Author profile

Dawid Sibinski

Founder and engineer at ExtractFox. Writes about AI document extraction, multimodal models, OCR vs IDP, and structured-output patterns.

About

I build ExtractFox, an AI-native data extraction tool that turns PDFs, images, and scanned documents into structured data — without per-vendor templates. Most of what I write is engineering notes from doing that, plus workflow guides for the people who use it.

For background reading, the glossary covers the terms (OCR, IDP, multimodal AI, MRZ, structured output) that keep coming up in the posts below. More about ExtractFox.

Posts (27)

WorkflowJune 28, 20266 min read

How to extract data from an annual report

Four practical methods to pull financials, tables, and metrics from annual report and 10-K PDFs — from copy-paste to Python to AI-powered extraction.

WorkflowJune 28, 20267 min read

How to extract data from a contract PDF

How to pull parties, dates, key clauses, and obligations from contract PDFs — Python, regex, CLM tools, and AI extraction for due diligence.

WorkflowJune 28, 20265 min read

How to extract data from a receipt photo or PDF (3 methods)

Compare OCR apps, Python (easyocr, pytesseract), and AI to pull merchant, line items, tax, and totals from receipt photos — including crumpled thermal receipts.

WorkflowJune 28, 20267 min read

How to extract data from a bank statement PDF (4 methods)

Compare copy-paste, Tabula, Camelot, pdfplumber, and AI to pull transactions and balances from bank statement PDFs — digital or scanned, any bank layout.

WorkflowJune 28, 20265 min read

How to extract data from a purchase order PDF (3 methods)

Compare ERP export, Camelot/pdfplumber, and AI to pull PO number, buyer, supplier, line items, and delivery terms from any ERP-generated PO layout.

WorkflowJune 28, 20265 min read

How to extract data from an ID document (3 methods)

Compare Tesseract OCR, MRZ parsing (TD1/TD2), and AI to pull identity fields from driver's licenses, national IDs, and residence permits — any country.

WorkflowJune 28, 20266 min read

How to extract data from an invoice PDF or image (4 methods)

Compare copy-paste, Python (pdfplumber, invoice2data), template parsers, and AI to pull vendor, line items, tax, and totals from any invoice — digital PDF, scan, or photo.

TutorialJune 28, 20267 min read

How to decode a QR code from an image or PDF

Decode QR codes from photos, screenshots, and PDFs using pyzbar, OpenCV, and AI. Covers URLs, Wi-Fi configs, vCards, payment requests, and multiple QR codes per image.

TutorialJune 28, 20267 min read

How to extract a clean recipe from a website, video, or image

Pull ingredients and steps from recipe blogs, TikTok, YouTube, and Instagram using schema.org scraping, yt-dlp transcription, and AI extraction — without wading through life stories.

TutorialJune 28, 20267 min read

How to extract and digitize handwritten text

Transcribe handwritten notes, letters, clinical notes, and historical documents using HTR tools, Tesseract with preprocessing, and AI vision models. Includes accuracy expectations by script type.

TutorialJune 28, 20267 min read

How to extract data from a filled form (PDF, scan, or photo)

Pull field values from any filled form — AcroForm PDF fields, printed text, or handwriting — with pypdf, pdfminer, and AI extraction for mixed or handwritten forms.

WorkflowJune 28, 20267 min read

How to extract data from an insurance policy PDF

Pull coverage limits, premiums, exclusions, and policy dates from any insurance policy PDF — declarations pages, endorsements, and full policies from any carrier.

TutorialJune 28, 20266 min read

How to extract data from a chart or graph (image or PDF)

Reverse-engineer bar, line, pie, and scatter charts back into numbers using WebPlotDigitizer, Python, and AI extraction. Works on screenshots, report PDFs, and dashboard photos.

TutorialJune 28, 20267 min read

How to convert a PDF to Excel

Turn PDF tables and structured data into clean Excel files. Covers Tabula, pdfplumber, pandas, and AI extraction for invoices, statements, and scanned PDFs — with real column headers and numeric values.

TutorialJune 28, 20267 min read

How to convert a PDF to JSON

Turn PDFs into structured JSON for APIs, databases, and AI pipelines. Covers PyMuPDF for raw text, pdfplumber for tables, and AI extraction for any document type with schema output.

TutorialJune 28, 20268 min read

How to extract data from Zillow listings

Three ways to get structured data from Zillow — the Zillow API (Bridge Interactive), Zillow's bulk listing exports, and AI extraction from listing PDFs and screenshots.

TutorialJune 28, 202610 min read

How to extract a table from a PDF with Python

Three Python libraries for PDF table extraction — pdfplumber, Tabula-py, and Camelot — with code, when to use each, and how to handle scanned PDFs where text-based extraction fails.

EngineeringMay 6, 20268 min read

Passport MRZ format explained: fields, check digits, and parser code

How to parse passport MRZ data under ICAO 9303: TD3 field positions, check digits, example MRZ lines, Python parser code, and when to extract the full passport instead.

WorkflowMay 6, 20265 min read

How to bulk-clean LinkedIn 'Location' fields into city/country pairs

Recruiters and sales ops teams inherit LinkedIn exports full of "Greater London," "Bay Area," and "remote." A practical workflow to turn those into a clean city/country/region columns — at any scale.

EngineeringMay 6, 20266 min read

How to remove metadata from a PDF (for privacy)

Author, software, GPS, edit history — every PDF leaks more than you think. The reliable ways to strip metadata before sharing, in any tool you already have.

TutorialApril 30, 20268 min read

Extract data from a pivot table in Excel

Extract data from an Excel PivotTable with Paste Values, GETPIVOTDATA, Show Details, Power Query, or screenshot-to-Excel extraction.

TutorialApril 27, 20269 min read

Extract text from PowerPoint (.pptx): Outline View, python-pptx, speaker notes

Extract all text from a .pptx with Outline View, python-pptx (tables, groups, speaker notes), or OCR for image slides — copy-paste code for automation pipelines.

TutorialApril 23, 20269 min read

Extract video metadata as JSON with FFprobe, MediaInfo, or yt-dlp

Copy-paste ffprobe, MediaInfo, and yt-dlp commands to extract duration, codecs, resolution, bitrate, chapters, and YouTube metadata as JSON — plus batch scripts for whole folders.

TutorialApril 22, 20268 min read

Extract hyperlinks from Excel and Google Sheets (VBA, Apps Script, Python)

Copy-paste VBA, Office Scripts, Google Apps Script, and Python openpyxl to extract the real URL behind Excel and Sheets hyperlinks — including bulk export from .xlsx XML.

TutorialApril 2, 20268 min read

How to extract code from a video tutorial

Extract source code from a programming tutorial video with clean screenshots, a code-aware extractor, or an FFmpeg frame pipeline for longer recordings.

WorkflowMarch 30, 20268 min read

How to extract an organization chart from Microsoft Teams

Export a Microsoft Teams organization chart to Excel or CSV with Microsoft Graph, Entra PowerShell, or screenshots when you do not have admin access.

WorkflowMarch 2, 20269 min read

Extract questions and responses from Google Forms (API, Sheets, Apps Script)

Export Google Forms responses to Excel/CSV, dump the full question schema as JSON via the Forms API, and automate with Apps Script or Python — including forms you don't own.