Extract data from any PDF

Drop any PDF — a contract, a form, a report, a statement — and pull the data you want as JSON, CSV, or Excel. No templates to set up, no per-document tuning.

Drop a PDF or image here, or browse

PDF or image · up to 20 MB

Processed in-flight — never stored on our servers.

What should we pull from this pdf?

Or pick specific fields

Or describe it yourself

Why this matters

There's no single 'right' schema for an arbitrary PDF — a lease, a lab report, and a shipping manifest don't share a shape. ExtractFox skips the schema-first approach: describe what you want in plain English, and the model finds those fields wherever they live in the document, whether or not it fits an invoice/contract/statement mold. Pick the output format — JSON, CSV, or Excel — after extraction, not before. A procurement team gets 30-page vendor quotes where pricing tables span pages 8–12 and terms sit in prose on page 22 — Tabula returns fragments that don't line up. Compliance auditors need every date and dollar from a decade of scanned filings, but regex over raw text can't tell whether '$4.2M' is revenue or a liability.

How it works

Step 1
Upload the PDF
Native PDFs, scanned PDFs, image-based PDFs, multi-page — all fine.
Step 2
Choose how you want to extract
Prebuilt schema for common document types, or free-text instruction for everything else.
Step 3
Get clean JSON or a spreadsheet
Inspect the result as a table, then export to JSON, CSV, or Excel.

Common use cases

Native PDFs — invoices, contracts, reports built from text

Scanned PDFs — receipts, statements, archived paperwork

Image-based PDFs — phone scans, faxed documents

Multi-page PDFs — long statements, annual reports, contracts

Form PDFs — government forms, applications, intake forms

Mixed PDFs — text pages plus scanned pages in one file

Vendor quote comparison — line items, totals, and payment terms from multi-page proposals

Due diligence — parties, dates, and financial figures from a data room of mixed PDF types

Sample output

Example: free-text extraction from an annual report PDF

Request: "pull total revenue, net income, and EPS for each year shown"

Result:
{
  "metrics_by_year": [
    { "year": 2024, "total_revenue": 12450000000, "net_income": 1820000000, "eps": 4.82 },
    { "year": 2025, "total_revenue": 13980000000, "net_income": 2104000000, "eps": 5.49 },
    { "year": 2026, "total_revenue": 15210000000, "net_income": 2387000000, "eps": 6.12 }
  ]
}

Frequently asked questions

How do I extract data from a PDF?+

Upload the PDF on this page, pick a document type or describe what you want extracted, and click Extract. Download the result as JSON, CSV, or Excel.

What types of PDFs are supported?+

Native PDFs (text-based), scanned PDFs (image-based), and form PDFs all work. Up to 20 MB and up to many pages.

Can I extract specific fields rather than the whole document?+

Yes. In the description box below the document tiles, type exactly what you want — for example, 'just the total and the due date' — and ExtractFox will return only those fields.

How does this compare to traditional PDF extraction libraries like pdfplumber or Tabula?+

pdfplumber and Tabula need clean tables and predictable layouts. ExtractFox understands document structure semantically, so it works on messy real-world PDFs — including scans, mixed layouts, and documents where the data isn't in a tidy grid.

Will extraction preserve the order of items in the original document?+

Yes. Lists, tables, and ordered data come back in the order they appear in the PDF — top to bottom, left to right.

Can I extract data from password-protected PDFs?+

No — remove the password before uploading. We don't store decrypted versions of your files.

How do I extract just the raw text from a PDF (not structured data)?+

Use the PDF-to-text extractor — same engine, but tuned for plain-text output (with Markdown, body-only, headings-only, and table-only modes). Handles both digital and scanned PDFs in one pass.

Can I extract data from only specific pages of a long PDF?+

Yes. In the description box, specify the page range — e.g., 'extract pricing table from pages 8–12'. The model returns only fields from those pages.

What if the PDF has both tables and free-form paragraphs I need in one schema?+

Describe the full shape you want — e.g., 'header fields plus every table as rows plus the termination clause as text'. ExtractFox returns nested JSON with tables as arrays and prose as string fields in one object.