All posts
EngineeringMay 3, 20265 min read

How to extract links from HTML or plain text

BeautifulSoup, Cheerio, PHP DOMDocument, and the regex you should and shouldn't use — every reliable way to pull URLs out of a string of HTML or text.

By Dawid Sibinski

Extracting links from HTML is a one-liner if you use a parser, and a footgun if you use regex on the raw markup. Plain text is the opposite — regex is the right tool because there's no DOM to traverse.

Python: BeautifulSoup

from bs4 import BeautifulSoup soup = BeautifulSoup(html, "html.parser") links = [a["href"] for a in soup.find_all("a", href=True)]

Catches every <a href> in the document. To resolve relative URLs against a base, pair with urllib.parse.urljoin(base, link). For images and stylesheets, find_all("img", src=True) and find_all("link", rel="stylesheet") work the same way.

Node: Cheerio

import * as cheerio from "cheerio"; const $ = cheerio.load(html); const links = $("a[href]").map((_, el) => $(el).attr("href")).get();

Same jQuery-style API, server-side. The standard JS choice for HTML parsing outside the browser.

PHP: DOMDocument

$dom = new DOMDocument(); @$dom->loadHTML($html); $links = []; foreach ($dom->getElementsByTagName('a') as $a) { $links[] = $a->getAttribute('href'); }

The @ suppresses warnings on imperfect HTML, which is most real-world HTML. For modern PHP, the Symfony DomCrawler component wraps DOMDocument with a friendlier API.

Plain text: regex

For text that isn't structured HTML — emails, chat logs, PDF text dumps, comments — a URL regex is the right call. The classic that handles ~95% of real URLs:

https?://[\w\-._~:/?#[\]@!$&'()*+,;=%]+

Trailing punctuation is the recurring trap — sentences like "see https://example.com." produce a regex match that includes the trailing period. Strip with a separate cleanup step (rstrip(".,;:!?)") or use a more conservative trailing-character set in the regex.

When you need real URL parsing, not just matching

Python's stdlib urllib.parse handles parsing once you've matched. For more aggressive URL detection (bare domains without http://, IDN handling, validation), the urlextract library on PyPI uses Mozilla's IDN suffix list and catches things regex misses.

What you should never do

Parse HTML with regex. Yes, the famous Stack Overflow answer is hyperbole, but for anything more complex than "find <a href> tags in tame markup," regex on HTML breaks on edge cases — embedded scripts, attribute order, single vs double quotes, multi-line attributes. Use a parser. They're free and fast.

Niche cases

  • Instagram links in messy text — same URL regex; filter results by hostname (instagram.com, instagr.am).
  • Markdown — the link form is [text](url); a separate regex (\[([^\]]+)\]\(([^)]+)\)) catches them.
  • Email addresses — different regex altogether: [\w._%+-]+@[\w.-]+\.[A-Za-z]{2,}.
  • M3U playlist files — they're plain text, one URL per line; just split on newlines and filter for lines starting with http.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →