All posts
20 posts

Engineering

Every engineering post on the ExtractFox blog.

Engineering8 min read

Passport MRZ format explained: fields, check digits, and parser code

How to parse passport MRZ data under ICAO 9303: TD3 field positions, check digits, example MRZ lines, Python parser code, and when to extract the full passport instead.

Engineering5 min read

How to remove metadata from a PDF (for privacy)

Author, software, GPS, edit history — every PDF leaks more than you think. The reliable ways to strip metadata before sharing, in any tool you already have.

Engineering7 min read

Public Suffix List: get the registrable domain from a URL

How to get the registrable domain from a URL with the Public Suffix List, including vercel.app, github.io, co.uk, and copy-paste Python, JavaScript, and Go examples.

Engineering5 min read

How to extract metadata from a PDF (Python, C#, online tools)

How to read PDF metadata — author, creation date, producer software, custom XMP fields — in Python, C#, or with free online tools. Step-by-step guide with copy-paste code examples.

Engineering5 min read

How to extract links from HTML or plain text

BeautifulSoup, Cheerio, PHP DOMDocument, and the regex you should and shouldn't use — every reliable way to pull URLs out of a string of HTML or text.

Engineering6 min read

How to extract keywords from text, a website, or a job description

RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.

Engineering5 min read

How to extract sentiment from text

VADER, TextBlob, fine-tuned transformers, and the LLM-with-typed-schema pattern — four sentiment-extraction methods, what each one is best at, and the few lines of code to start.

Engineering5 min read

How to extract topics from text or interview transcripts

BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.

Engineering6 min read

How to extract all links from a website

Crawling a whole site for every internal and external URL — the right way with a sitemap, the brute-force way with Scrapy, and what to do when neither works.

Engineering7 min read

How to extract a table from a PDF using Python

pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.

Engineering5 min read

The end of templates: how AI extraction actually works

Per-supplier templates were the only way to extract structured data from PDFs for two decades. Multimodal models change the shape of the problem.

Engineering6 min read

How to extract key-value pairs from documents

"Name: John Smith. Date: 2024-04-12. Total: $1,420." — every form, invoice, and structured PDF is full of key-value pairs. Here's how to pull them out reliably across formats.

Engineering4 min read

How to extract files from a Docker image or container

docker cp from a running container, dive for digging into image layers, and the few tricks for getting files out of an image you can't run.

Engineering4 min read

How to extract the host or domain from a URL (with code examples)

URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right. Code examples for every approach.

Engineering8 min read

Extract city and country from a location string: Python, JS, and LLM fallback

How to identify city, region, and country from location strings like Leiston, Greater London, NYC, or San Francisco Bay Area with Python, JavaScript, geocoding, and schema-based AI fallback.

Engineering5 min read

How to extract schema from SQL Server, MongoDB, JSON, XML, or Parquet

Information_schema, sp_help, mongoexport, jsonschema, parquet-tools — the right command for every common data store, in two lines or fewer per format.

Engineering5 min read

How to extract facts from text (Python NLP tutorial)

How to extract named entities, claims, and relations from text using Python NLP libraries. What fact extraction means, which tools to use, and how LLMs changed the game. Free tutorial with copy-paste code.

Engineering7 min read

How to extract LinkedIn profile data with Python (legally)

How to extract LinkedIn profile data with Python — official APIs, unofficial libraries, headless browsers, and the legal boundaries. Plus a free no-code alternative that exports to Excel/JSON. Free guide with code examples.

Engineering7 min read

How to extract text from an image using Python

Tesseract via pytesseract, EasyOCR, PaddleOCR, and the API-based path — what each one is best at, what they break on, and the few lines of code to get started.

Engineering7 min read

How to extract data from a PDF in C#

A working engineer's tour of the PDF extraction libraries in the .NET ecosystem — iText, PdfPig, Azure Document Intelligence, and the API-first alternative when you don't want to ship a parser at all.