Engineering
Every engineering post on the ExtractFox blog.
How to extract metadata from a PDF
Author, creation date, producer software, custom XMP fields — every PDF carries metadata most people never see. Here's how to read it, in any language, and what's worth pulling for indexing or audit.
How to extract links from HTML or plain text
BeautifulSoup, Cheerio, PHP DOMDocument, and the regex you should and shouldn't use — every reliable way to pull URLs out of a string of HTML or text.
How to extract keywords from text, a website, or a job description
RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.
How to extract sentiment from text
VADER, TextBlob, fine-tuned transformers, and the LLM-with-typed-schema pattern — four sentiment-extraction methods, what each one is best at, and the few lines of code to start.
How to extract topics from text or interview transcripts
BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.
How to extract all links from a website
Crawling a whole site for every internal and external URL — the right way with a sitemap, the brute-force way with Scrapy, and what to do when neither works.
How to extract a table from a PDF using Python
pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.
The end of templates: how AI extraction actually works
Per-supplier templates were the only way to extract structured data from PDFs for two decades. Multimodal models change the shape of the problem.
How to extract key-value pairs from documents
"Name: John Smith. Date: 2024-04-12. Total: $1,420." — every form, invoice, and structured PDF is full of key-value pairs. Here's how to pull them out reliably across formats.
How to extract files from a Docker image or container
docker cp from a running container, dive for digging into image layers, and the few tricks for getting files out of an image you can't run.
How to extract the host or domain from a URL
URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right.
How to extract city, state, and country from a location string
"San Francisco Bay Area," "Greater London," "NYC." Free-text location fields are messy. Here's how to parse them into clean city/state/country with libpostal, geopy, or an LLM.
How to extract schema from SQL Server, MongoDB, JSON, XML, or Parquet
Information_schema, sp_help, mongoexport, jsonschema, parquet-tools — the right command for every common data store, in two lines or fewer per format.
How to extract facts from text
Named entities, claims, relations — what "fact extraction" actually means in NLP, the libraries that handle each piece, and how the LLM era changed which ones are worth using.
How to extract LinkedIn profile data using Python
LinkedIn is the most legally fraught scraping target on the internet. Here's what's allowed, what's gray, and the safest patterns for engineers who need profile or company data programmatically.
How to extract text from an image using Python
Tesseract via pytesseract, EasyOCR, PaddleOCR, and the API-based path — what each one is best at, what they break on, and the few lines of code to get started.
How to extract data from a PDF in C#
A working engineer's tour of the PDF extraction libraries in the .NET ecosystem — iText, PdfPig, Azure Document Intelligence, and the API-first alternative when you don't want to ship a parser at all.