Engineering
Every engineering post on the ExtractFox blog.
Passport MRZ format explained: fields, check digits, and parser code
How to parse passport MRZ data under ICAO 9303: TD3 field positions, check digits, example MRZ lines, Python parser code, and when to extract the full passport instead.
How to remove metadata from a PDF (for privacy)
Author, software, GPS, edit history — every PDF leaks more than you think. The reliable ways to strip metadata before sharing, in any tool you already have.
Public Suffix List: get the registrable domain from a URL
How to get the registrable domain from a URL with the Public Suffix List, including vercel.app, github.io, co.uk, and copy-paste Python, JavaScript, and Go examples.
How to extract metadata from a PDF (Python, C#, online tools)
How to read PDF metadata — author, creation date, producer software, custom XMP fields — in Python, C#, or with free online tools. Step-by-step guide with copy-paste code examples.
How to extract links from HTML or plain text
BeautifulSoup, Cheerio, PHP DOMDocument, and the regex you should and shouldn't use — every reliable way to pull URLs out of a string of HTML or text.
How to extract keywords from text, a website, or a job description
RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.
How to extract sentiment from text
VADER, TextBlob, fine-tuned transformers, and the LLM-with-typed-schema pattern — four sentiment-extraction methods, what each one is best at, and the few lines of code to start.
How to extract topics from text or interview transcripts
BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.
How to extract all links from a website
Crawling a whole site for every internal and external URL — the right way with a sitemap, the brute-force way with Scrapy, and what to do when neither works.
How to extract a table from a PDF using Python
pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.
The end of templates: how AI extraction actually works
Per-supplier templates were the only way to extract structured data from PDFs for two decades. Multimodal models change the shape of the problem.
How to extract key-value pairs from documents
"Name: John Smith. Date: 2024-04-12. Total: $1,420." — every form, invoice, and structured PDF is full of key-value pairs. Here's how to pull them out reliably across formats.
How to extract files from a Docker image or container
docker cp from a running container, dive for digging into image layers, and the few tricks for getting files out of an image you can't run.
How to extract the host or domain from a URL (with code examples)
URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right. Code examples for every approach.
Extract city and country from a location string: Python, JS, and LLM fallback
How to identify city, region, and country from location strings like Leiston, Greater London, NYC, or San Francisco Bay Area with Python, JavaScript, geocoding, and schema-based AI fallback.
How to extract schema from SQL Server, MongoDB, JSON, XML, or Parquet
Information_schema, sp_help, mongoexport, jsonschema, parquet-tools — the right command for every common data store, in two lines or fewer per format.
How to extract facts from text (Python NLP tutorial)
How to extract named entities, claims, and relations from text using Python NLP libraries. What fact extraction means, which tools to use, and how LLMs changed the game. Free tutorial with copy-paste code.
How to extract LinkedIn profile data with Python (legally)
How to extract LinkedIn profile data with Python — official APIs, unofficial libraries, headless browsers, and the legal boundaries. Plus a free no-code alternative that exports to Excel/JSON. Free guide with code examples.
How to extract text from an image using Python
Tesseract via pytesseract, EasyOCR, PaddleOCR, and the API-based path — what each one is best at, what they break on, and the few lines of code to get started.
How to extract data from a PDF in C#
A working engineer's tour of the PDF extraction libraries in the .NET ecosystem — iText, PdfPig, Azure Document Intelligence, and the API-first alternative when you don't want to ship a parser at all.