How to extract metadata from a website
Title tags, Open Graph, Twitter cards, JSON-LD structured data — what every page exposes and how to pull it out cleanly for SEO audits, link previews, or content indexing.
Websites carry metadata in four standard layers, and a fifth informal one. Most extractors handle two or three; the good ones cover all five.
What's in the head
- <title> and <meta name="description"> — the original SEO pair, still the most important.
- Open Graph tags (og:title, og:description, og:image, og:url) — what Facebook, LinkedIn, and Slack use for link previews.
- Twitter card tags (twitter:card, twitter:title, etc.) — same role, different namespace.
- JSON-LD structured data — Schema.org types like Article, Product, BreadcrumbList. What Google uses for rich results.
- Microdata and RDFa — older alternatives to JSON-LD, still common on legacy sites.
Quick extract: curl + grep
curl -s https://example.com | grep -i 'meta\|title\|json-ld' is enough for one-off poking. Useless at scale because it can't parse JS-rendered content or handle malformed HTML.
Python: BeautifulSoup + extruct
extruct is the right library for everything beyond title/description. It handles JSON-LD, microdata, RDFa, OpenGraph, and microformats from a single HTML string:
import requests, extruct html = requests.get("https://example.com").text data = extruct.extract(html, base_url="https://example.com") print(data["json-ld"]) print(data["opengraph"])
For JS-rendered sites, swap requests for Playwright or Puppeteer to get the post-render HTML before passing to extruct.
Node: metascraper
Modular metadata scraper with adapters for title, author, image, date, publisher. Used in production by several link-preview services. Returns a normalized object regardless of which tag layer the site actually uses.
Hosted APIs
Microlink, LinkPreview, and OpenGraph.io return cleaned metadata for any URL. Worth it when you don't want to maintain a fetcher and rendering layer yourself, especially for a link-preview feature in a chat or CMS product.
What "metadata" doesn't include
The <head> tags are author-declared and often missing, lying, or stale. For real content extraction — actual title, byline, publish date, body text — you need a content extractor (Readability, Mercury, Trafilatura) that infers these from the visible page when the metadata is wrong. ExtractFox's website extractor takes the same approach: it reads the rendered page, not just the head.
Edge cases
- Single-page apps that render in JS and don't update the head dynamically — Next.js, Remix, and similar frameworks fix this server-side; SPAs with client-only routing often don't.
- Paywalled articles where the metadata says one thing and the actual content is a paywall — mostly a problem for content syndication tools.
- Sites that return different metadata to bots vs. browsers — check both with a Googlebot user-agent if rankings depend on it.