All posts
EngineeringApril 9, 20264 min read

How to extract the host or domain from a URL

URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right.

By Dawid Sibinski

Splitting a URL into its parts is one of those tasks that looks simple until you hit the public suffix list. "foo.com" is one domain; "foo.co.uk" is also one domain, even though it has more dots. Most quick approaches get this wrong.

Python: stdlib + tldextract

from urllib.parse import urlparse u = urlparse("https://www.example.co.uk/path?q=1") u.hostname # 'www.example.co.uk' u.netloc # 'www.example.co.uk' (includes port if present)

For the eTLD+1 ("example.co.uk"), use tldextract — it ships with the public suffix list:

import tldextract ext = tldextract.extract("https://blog.example.co.uk/post") ext.subdomain # 'blog' ext.domain # 'example' ext.suffix # 'co.uk' ext.registered_domain # 'example.co.uk'

JavaScript / TypeScript

const u = new URL("https://www.example.co.uk/path"); u.hostname; // 'www.example.co.uk' u.host; // 'www.example.co.uk' (with port) u.pathname; // '/path'

For eTLD+1 in JS: psl (Public Suffix List) on npm:

import psl from "psl"; psl.parse("blog.example.co.uk").domain; // 'example.co.uk'

Bash

echo "https://www.example.com/path?q=1" | awk -F[/:] '{print $4}' # www.example.com

Crude but works for clean URLs. For anything more involved, shell out to Python.

SQL

PostgreSQL:

SELECT regexp_replace(url, '^https?://([^/]+).*$', '\1') FROM logs;

BigQuery has NET.HOST(url) built in, which handles malformed inputs without erroring. Snowflake has PARSE_URL with a host field. MySQL needs a regex or a UDF.

Edge cases

  • URLs with no scheme — urllib.parse treats them as relative paths and returns no hostname. Prefix with // or http:// before parsing.
  • Internationalized domain names (IDN) — tldextract returns the punycode form by default. Convert with idna.decode for display.
  • Ports — urllib.parse.hostname strips them; netloc keeps them. Pick the right field.
  • userinfo (user:pass@host) — netloc includes it; hostname strips it. Be careful when logging URLs that might contain credentials.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →