57 posts

The ExtractFox blog

Notes from the team — how AI extraction works, workflow guides, and product decisions.

TutorialMay 3, 20265 min read

How to extract data from a chart in Excel

Get the underlying numbers back out of an Excel chart — even if the source range was deleted, the workbook is locked, or you only have the chart as an image.

Read post →
EngineeringMay 3, 20265 min read

How to extract metadata from a PDF

Author, creation date, producer software, custom XMP fields — every PDF carries metadata most people never see. Here's how to read it, in any language, and what's worth pulling for indexing or audit.

TutorialMay 3, 20265 min read

How to extract images from a PDF

From a one-off Preview save to batch extraction with pdfimages, pypdf, or iText — every realistic way to pull every embedded image out of a PDF at full quality.

EngineeringMay 3, 20265 min read

How to extract links from HTML or plain text

BeautifulSoup, Cheerio, PHP DOMDocument, and the regex you should and shouldn't use — every reliable way to pull URLs out of a string of HTML or text.

TutorialMay 3, 20264 min read

How to extract a zip file (Mac, Windows, Linux, Android)

What "extracting a file" actually means, the one-click way to do it on every major OS, and how to handle the file that won't open.

EngineeringMay 3, 20266 min read

How to extract keywords from text, a website, or a job description

RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.

EngineeringMay 3, 20265 min read

How to extract sentiment from text

VADER, TextBlob, fine-tuned transformers, and the LLM-with-typed-schema pattern — four sentiment-extraction methods, what each one is best at, and the few lines of code to start.

WorkflowMay 2, 20265 min read

How to extract a chart of accounts from QuickBooks

Three reliable ways to get your full chart of accounts out of QuickBooks Online or Desktop — built-in export, IIF, and the API route — plus how to clean it up before re-importing somewhere else.

WorkflowMay 1, 20266 min read

How to extract data from a website to Excel automatically

Power Query, Make/Zapier, and a no-code AI route — three ways to set up an automated pipeline from a website to an Excel file that updates on its own.

TutorialApril 30, 20265 min read

How to extract data from a pivot table in Excel

GETPIVOTDATA, Show Details, copying values, and converting a pivot back into a flat table — the four ways to get data out of an Excel pivot, and when each one is the right call.

TutorialApril 30, 20264 min read

How to extract RAR, tar.gz, jar, and other archive formats

Beyond zip — every other archive format you'll meet (RAR, 7z, tar.gz, gz, jar, war, ear) and the right tool for each on every OS.

EngineeringApril 30, 20265 min read

How to extract topics from text or interview transcripts

BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.

TutorialApril 29, 20266 min read

How to extract data from Zillow listings

Zillow doesn't offer a public scraping API and actively blocks bots. Here's how to get listing data into a spreadsheet without getting your IP banned — and the legal lines to stay behind.

EngineeringApril 29, 20266 min read

How to extract all links from a website

Crawling a whole site for every internal and external URL — the right way with a sitemap, the brute-force way with Scrapy, and what to do when neither works.

TutorialApril 28, 20265 min read

How to extract photos and frames from a video

Pull a still from a video — on iPhone, Android, in DaVinci Resolve, with FFmpeg, or from a YouTube URL. The right tool depends on whether you want one frame or every frame.

TutorialApril 27, 20264 min read

How to extract text from a PowerPoint file

Three ways to pull text out of .pptx files — the built-in outline view, scripting with python-pptx, and image-based extraction for slides where the text is baked into pictures.

TutorialApril 26, 20265 min read

How to extract EXIF and GPS metadata from a photo

Where the photo was taken, what camera made it, when it was shot, even the focal length — every JPEG carries it. Here's how to read EXIF from a photo and the privacy lines you should know about.

EngineeringApril 25, 20267 min read

How to extract a table from a PDF using Python

pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.

TutorialApril 24, 20265 min read

How to extract images from a website (or a URL)

Browser tools, wget, gallery-dl, and the legal lines around scraping images from sites like Instagram, Pinterest, and stock photography. What's safe, what's gray, and what to skip.

TutorialApril 23, 20264 min read

How to extract metadata from a video file

FFprobe, MediaInfo, and yt-dlp — three tools that cover every format from MP4 to MKV to a YouTube URL. What each one is best at, and what you can pull out.

EngineeringApril 22, 20265 min read

The end of templates: how AI extraction actually works

Per-supplier templates were the only way to extract structured data from PDFs for two decades. Multimodal models change the shape of the problem.

TutorialApril 22, 20264 min read

How to extract hyperlinks from Excel and Google Sheets

Excel hides hyperlinks behind display text — getting the actual URL out takes a HYPERLINK trick or a tiny VBA function. Google Sheets has its own quirks. Here's the full set.

TutorialApril 21, 20264 min read

How to extract images from Google Docs and Google Slides

Three reliable ways to get every image out of a Google Doc or Slides deck — with or without losing resolution — including the publish-to-web trick that beats every other method.

TutorialApril 20, 20265 min read

How to extract a signature from a PDF or an image

Whether you need to verify a signature exists, lift it as a transparent PNG for reuse, or pull every signed name as text — here are the right tools for each version of the question.

TutorialApril 19, 20263 min read

How to extract images from a Word document

The fastest way to pull every image out of a .docx file at original resolution — using nothing more than the file extension trick that works in any unzip tool.

TutorialApril 18, 20267 min read

How to extract numbers from a cell in Excel

Whether you need digits out of a product code, an order ID, or a free-text field, here are the formulas (old and new), the Power Query route, and what to do when the data isn't actually in Excel yet.

TutorialApril 17, 20265 min read

How to extract metadata from a website

Title tags, Open Graph, Twitter cards, JSON-LD structured data — what every page exposes and how to pull it out cleanly for SEO audits, link previews, or content indexing.

TutorialApril 16, 20264 min read

Built-in image-to-text features in Mac, OneNote, and Excel

Live Text on macOS, OneNote's Copy Text from Picture, and Excel's Data from Picture — the OCR features already on your machine that most people don't know exist.

TutorialApril 15, 20264 min read

How to extract a table from a Word document

Native Word-to-Excel paste, python-docx for scripting, and what to do when the table is actually a screenshot inside the document.

TutorialApril 14, 20264 min read

How to extract embedded files and attachments from a PDF

PDFs can carry attached files — Excel sheets, source data, supporting docs. Acrobat shows them; most other readers don't. Here's how to get them out, on any OS.

TutorialApril 14, 20264 min read

How to extract zip and postal codes from addresses (Excel, Sheets, Python)

Three reliable patterns for pulling postal codes out of free-text addresses — Excel formulas for clean US data, regex for international, libpostal when nothing else works.

TutorialApril 13, 20264 min read

How to extract video links from a YouTube playlist

Three ways to get every video URL from a YouTube playlist into a flat list — yt-dlp for scripts, the official API for production, and the browser console for one-offs.

TutorialApril 12, 20265 min read

How to extract phone numbers from text

A regex that handles most cases, the libphonenumber library that handles the rest, and what to do when the phone numbers are trapped inside PDFs, screenshots, or messaging exports.

EngineeringApril 10, 20266 min read

How to extract key-value pairs from documents

"Name: John Smith. Date: 2024-04-12. Total: $1,420." — every form, invoice, and structured PDF is full of key-value pairs. Here's how to pull them out reliably across formats.

EngineeringApril 9, 20264 min read

How to extract files from a Docker image or container

docker cp from a running container, dive for digging into image layers, and the few tricks for getting files out of an image you can't run.

EngineeringApril 9, 20264 min read

How to extract the host or domain from a URL

URL parsing in Python, JavaScript, Bash, and SQL — and the public-suffix-list trick that makes "co.uk" come out right.

TutorialApril 8, 20265 min read

How to extract text from a YouTube video

Three reliable ways to turn a YouTube video into searchable, citable text — using the built-in transcript, yt-dlp + Whisper, or browser tools — and when each one is the right call.

EngineeringApril 6, 20265 min read

How to extract city, state, and country from a location string

"San Francisco Bay Area," "Greater London," "NYC." Free-text location fields are messy. Here's how to parse them into clean city/state/country with libpostal, geopy, or an LLM.

EngineeringApril 6, 20265 min read

How to extract schema from SQL Server, MongoDB, JSON, XML, or Parquet

Information_schema, sp_help, mongoexport, jsonschema, parquet-tools — the right command for every common data store, in two lines or fewer per format.

TutorialApril 5, 20264 min read

How to extract formulas from an Excel file or PDF

Showing all formulas in a sheet, exporting them programmatically with openpyxl, and pulling math from a PDF where the formulas are rendered images, not LaTeX.

WorkflowApril 4, 20265 min read

How to extract key information from emails

Sender, dates, attachments, action items, deal numbers — emails are unstructured text wrapping structured data. Here's how to pull the structure out reliably for CRM, support, or finance workflows.

WorkflowApril 2, 20264 min read

How to extract a Gantt chart from MS Project

Get a Project file's tasks and timeline out into Excel, image, or a presentation-friendly format — without buying a copy of MS Project if you don't already have one.

TutorialApril 2, 20264 min read

How to extract code from a video tutorial

Three workflows for getting the source code out of a programming tutorial video — from manual frame capture to a full transcript-plus-screenshot pipeline.

WorkflowMarch 30, 20265 min read

How to extract an organization chart from Microsoft Teams

Teams shows you reporting lines but doesn't give you a clean export. Here's how to get the org structure out via Microsoft Graph, the People app, or as a fallback, from screenshots.

TutorialMarch 28, 20265 min read

How to extract a summary from a PDF, video, or article

Three classes of summarization tools (extractive, abstractive, hybrid), how each one fails, and the practical setup for getting a useful 10-line summary out of a 90-minute talk or 300-page report.

TutorialMarch 26, 20266 min read

How to extract data from Amazon product pages

Title, price, ASIN, ratings, variations — Amazon makes it hard to scrape and easy to misuse APIs. Here are the legitimate options, the gray ones, and the screenshot-based fallback.

WorkflowMarch 25, 20263 min read

Slack's auto-extract links setting (and what it actually does)

Slack quietly fetches link previews for every URL pasted in a channel. Here's how the setting works, why teams turn it off, and how to control it per-channel or per-message.

WorkflowMarch 22, 20265 min read

How to extract a chart of accounts from SAP

SAP exposes the chart of accounts through a few transactions and a few APIs — here's the practical map for finance teams who need a clean export, not a consultant's project.

EngineeringMarch 19, 20265 min read

How to extract facts from text

Named entities, claims, relations — what "fact extraction" actually means in NLP, the libraries that handle each piece, and how the LLM era changed which ones are worth using.

WorkflowMarch 14, 20264 min read

Closing the month without re-keying invoices

A practical workflow for accountants and bookkeepers: take a folder of supplier PDFs, get a clean Excel sheet, and reconcile in one sitting.

EngineeringMarch 12, 20267 min read

How to extract LinkedIn profile data using Python

LinkedIn is the most legally fraught scraping target on the internet. Here's what's allowed, what's gray, and the safest patterns for engineers who need profile or company data programmatically.

EngineeringMarch 8, 20267 min read

How to extract text from an image using Python

Tesseract via pytesseract, EasyOCR, PaddleOCR, and the API-based path — what each one is best at, what they break on, and the few lines of code to get started.

IndustryMarch 5, 20266 min read

What "unstructured data" actually means — and how to extract from it

Unstructured data isn't disorganized data — it's data without a schema. Here's the practical taxonomy, why it became extractable in the last few years, and the patterns that work.

WorkflowMarch 2, 20264 min read

How to extract questions and responses from a Google Form

Three ways to get Google Forms data out — the built-in Sheets export, the Forms API for automation, and the screenshot fallback for when you only have access to the published form.

EngineeringFebruary 28, 20267 min read

How to extract data from a PDF in C#

A working engineer's tour of the PDF extraction libraries in the .NET ecosystem — iText, PdfPig, Azure Document Intelligence, and the API-first alternative when you don't want to ship a parser at all.

WorkflowFebruary 15, 20265 min read

How to extract data and metadata from PDFs with Power Automate

The built-in actions for PDF text extraction, the AI Builder model for invoices and receipts, and how to wire either one into a flow that drops structured data into Excel or Dataverse.

ProductFebruary 8, 20263 min read

Why we don't store your documents

A short note on how ExtractFox processes files in-flight, what we keep, and what we throw away — and why that's a deliberate product decision.

Try it on a real document

The free tier is genuine. Upload a PDF or image and see what comes out.

Try a free extraction →