How to extract the host or domain from a URL
URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right.
Splitting a URL into its parts is one of those tasks that looks simple until you hit the public suffix list. "foo.com" is one domain; "foo.co.uk" is also one domain, even though it has more dots. Most quick approaches get this wrong.
Python: stdlib + tldextract
from urllib.parse import urlparse u = urlparse("https://www.example.co.uk/path?q=1") u.hostname # 'www.example.co.uk' u.netloc # 'www.example.co.uk' (includes port if present)
For the eTLD+1 ("example.co.uk"), use tldextract — it ships with the public suffix list:
import tldextract ext = tldextract.extract("https://blog.example.co.uk/post") ext.subdomain # 'blog' ext.domain # 'example' ext.suffix # 'co.uk' ext.registered_domain # 'example.co.uk'
JavaScript / TypeScript
const u = new URL("https://www.example.co.uk/path"); u.hostname; // 'www.example.co.uk' u.host; // 'www.example.co.uk' (with port) u.pathname; // '/path'
For eTLD+1 in JS: psl (Public Suffix List) on npm:
import psl from "psl"; psl.parse("blog.example.co.uk").domain; // 'example.co.uk'
Bash
echo "https://www.example.com/path?q=1" | awk -F[/:] '{print $4}' # www.example.com
Crude but works for clean URLs. For anything more involved, shell out to Python.
SQL
PostgreSQL:
SELECT regexp_replace(url, '^https?://([^/]+).*$', '\1') FROM logs;
BigQuery has NET.HOST(url) built in, which handles malformed inputs without erroring. Snowflake has PARSE_URL with a host field. MySQL needs a regex or a UDF.
Edge cases
- URLs with no scheme — urllib.parse treats them as relative paths and returns no hostname. Prefix with // or http:// before parsing.
- Internationalized domain names (IDN) — tldextract returns the punycode form by default. Convert with idna.decode for display.
- Ports — urllib.parse.hostname strips them; netloc keeps them. Pick the right field.
- userinfo (user:pass@host) — netloc includes it; hostname strips it. Be careful when logging URLs that might contain credentials.