All posts
Engineering5 min read

What the Public Suffix List is, and why your URL parser needs it

"foo.co.uk" is one domain. Naive URL parsing splits it wrong. The Public Suffix List explained — what it is, where to get it, and the libraries that use it correctly.

Most URL-parsing bugs come down to the same mistake: treating the last two dotted parts of a hostname as 'domain.tld' and everything before as the subdomain. That's wrong for any country-code TLD with a multi-part suffix — co.uk, com.au, com.br, ne.jp — which is roughly 25% of the public-internet hostnames you'll encounter. The fix is the Public Suffix List.

What it is

The Public Suffix List (PSL) is a community-maintained list of every domain suffix under which the public can register names. It's hosted at publicsuffix.org by Mozilla and updated continuously. The list has two sections: ICANN-recognized suffixes (.com, .uk, .co.uk, etc.) and a private section for vendor-controlled suffixes where users can register subdomains (github.io, vercel.app, herokuapp.com).

Without the PSL, a URL parser cannot reliably tell where 'a domain' ends. With it, the rule is simple: the registrable domain is the suffix on the list, plus exactly one label to the left of it.

What goes wrong without it

Naive parsing of "https://shop.bbc.co.uk/cart" treats "co.uk" as the domain and "shop.bbc" as the subdomain. Cookies set on the naive 'domain' would scope to every .co.uk site — a real security bug, which is why every browser ships with the PSL embedded. The same logic applies to:

  • Cookie scoping — Set-Cookie should not be allowed to set on a public suffix.
  • Same-origin checks for embedded content and OAuth redirect validation.
  • Email domain matching for SSO — "@bbc.co.uk" and "@news.bbc.co.uk" share an organization; "@something-else.co.uk" does not.
  • Spam and phishing detection — phishing domains often abuse subdomain confusion (paypal-login.co.uk vs login.paypal.co.uk).
  • Analytics — counting unique sites by registrable domain, not by raw hostname.

How to use it: the three steps

  1. Fetch the latest PSL — either at build time (bundled), at runtime with periodic refresh, or via a library that caches it for you. The list is small (a few hundred KB).
  2. Match the hostname against the list, finding the longest suffix that matches.
  3. The registrable domain is that suffix plus one label. Anything to the left of that is the subdomain.

Library support, by language

Python: tldextract

import tldextract ext = tldextract.extract("https://shop.bbc.co.uk/cart") ext.subdomain # 'shop' ext.domain # 'bbc' ext.suffix # 'co.uk' ext.registered_domain # 'bbc.co.uk'

tldextract caches the PSL locally and refreshes on demand. Use the include_psl_private_domains=True flag if you want to treat github.io as a public suffix (and therefore each user's *.github.io as a separate registrable domain).

JavaScript / Node: tldts

import { parse } from "tldts"; const { domain, subdomain, publicSuffix } = parse("https://shop.bbc.co.uk"); // domain: 'bbc.co.uk', subdomain: 'shop', publicSuffix: 'co.uk'

tldts ships the PSL inline and is dependency-free. For the browser, it's the right pick because it doesn't need a runtime fetch.

Go: golang.org/x/net/publicsuffix

import "golang.org/x/net/publicsuffix" etld1, _ := publicsuffix.EffectiveTLDPlusOne("shop.bbc.co.uk") // etld1 == "bbc.co.uk"

The Go x/net package is what most production Go services use. It's a generated lookup table — fast and zero-allocation per call.

Rust, Ruby, others

Rust has publicsuffix on crates.io. Ruby has the public_suffix gem (used by Rails for cookie domain handling). Java has guava's InternetDomainName. Every reasonable language ecosystem has at least one good binding — there is no good reason to roll your own.

ICANN vs private suffixes — the part that catches people

The PSL has two sections. ICANN suffixes are real public TLDs and country-codes (.com, .co.uk, .com.au). Private suffixes are vendor-managed (github.io, vercel.app, herokuapp.com, blogspot.com). Treat them the same for cookie isolation; treat them differently for organizational ownership.

An example: alice.github.io and bob.github.io are different registrable domains for the purpose of cookies and same-origin policy (a security boundary). But they're both on github.io — the organizational owner is GitHub, not Alice or Bob. Most libraries default to including private suffixes; flip the flag if your use case is 'who owns this domain?' rather than 'what's the security boundary?'

Keeping the list fresh

The PSL changes weekly — new TLDs, new private suffixes (every Cloudflare-managed pages domain, for example, ends up on it). For long-running services, refresh periodically: tldextract has a built-in refresh, tldts ships a new version every few days. Pin the version in your build, but pin a recent one.

When you need more than parsing

Once you have the registrable domain, the next step is usually fetching it and pulling structured data — title, OG tags, JSON-LD, contact details, prices. URL parsing gets you the right key; the value still has to be extracted from the page.

Tool
Past the URL — pull structured data from any website
Drop a URL into ExtractFox and get back the page's structured data — title, OG/Twitter tags, JSON-LD, prices, contacts — as Excel, CSV, or JSON. Built on top of the same URL-parsing primitives.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →