All posts
17 posts

Engineering

Every engineering post on the ExtractFox blog.

Engineering5 min read

How to extract metadata from a PDF

Author, creation date, producer software, custom XMP fields — every PDF carries metadata most people never see. Here's how to read it, in any language, and what's worth pulling for indexing or audit.

Engineering5 min read

How to extract links from HTML or plain text

BeautifulSoup, Cheerio, PHP DOMDocument, and the regex you should and shouldn't use — every reliable way to pull URLs out of a string of HTML or text.

Engineering6 min read

How to extract keywords from text, a website, or a job description

RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.

Engineering5 min read

How to extract sentiment from text

VADER, TextBlob, fine-tuned transformers, and the LLM-with-typed-schema pattern — four sentiment-extraction methods, what each one is best at, and the few lines of code to start.

Engineering5 min read

How to extract topics from text or interview transcripts

BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.

Engineering6 min read

How to extract all links from a website

Crawling a whole site for every internal and external URL — the right way with a sitemap, the brute-force way with Scrapy, and what to do when neither works.

Engineering7 min read

How to extract a table from a PDF using Python

pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.

Engineering5 min read

The end of templates: how AI extraction actually works

Per-supplier templates were the only way to extract structured data from PDFs for two decades. Multimodal models change the shape of the problem.

Engineering6 min read

How to extract key-value pairs from documents

"Name: John Smith. Date: 2024-04-12. Total: $1,420." — every form, invoice, and structured PDF is full of key-value pairs. Here's how to pull them out reliably across formats.

Engineering4 min read

How to extract files from a Docker image or container

docker cp from a running container, dive for digging into image layers, and the few tricks for getting files out of an image you can't run.

Engineering4 min read

How to extract the host or domain from a URL

URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right.

Engineering5 min read

How to extract city, state, and country from a location string

"San Francisco Bay Area," "Greater London," "NYC." Free-text location fields are messy. Here's how to parse them into clean city/state/country with libpostal, geopy, or an LLM.

Engineering5 min read

How to extract schema from SQL Server, MongoDB, JSON, XML, or Parquet

Information_schema, sp_help, mongoexport, jsonschema, parquet-tools — the right command for every common data store, in two lines or fewer per format.

Engineering5 min read

How to extract facts from text

Named entities, claims, relations — what "fact extraction" actually means in NLP, the libraries that handle each piece, and how the LLM era changed which ones are worth using.

Engineering7 min read

How to extract LinkedIn profile data using Python

LinkedIn is the most legally fraught scraping target on the internet. Here's what's allowed, what's gray, and the safest patterns for engineers who need profile or company data programmatically.

Engineering7 min read

How to extract text from an image using Python

Tesseract via pytesseract, EasyOCR, PaddleOCR, and the API-based path — what each one is best at, what they break on, and the few lines of code to get started.

Engineering7 min read

How to extract data from a PDF in C#

A working engineer's tour of the PDF extraction libraries in the .NET ecosystem — iText, PdfPig, Azure Document Intelligence, and the API-first alternative when you don't want to ship a parser at all.