20 terms

Document extraction glossary

Plain-English definitions for the terms you'll meet doing AI document extraction — OCR, IDP, multimodal models, MRZ, line items, structured output, and the rest.

Category

AI & OCR

IDP (Intelligent Document Processing)

Also known as: intelligent document processing, document AI

An umbrella term for software that turns documents into structured data using AI.

IDP describes any system that combines OCR, layout understanding, classification, and field extraction to produce structured records from documents. The category historically covered template-driven tools like Docparser, Rossum, and Nanonets — and now also includes pure-LLM tools like ExtractFox that skip templates entirely. The win from modern IDP isn't accuracy on a single document; it's eliminating per-vendor template maintenance.

Multimodal AI

An AI model that accepts images, text, and other input types in one pass.

Multimodal models — Google Gemini, GPT-4o, Claude — read pixels and text together. For document extraction, that means the model sees the document as a person sees it: layout, tables, and visual cues are part of the input, not stripped away by an OCR pre-step. Multimodal AI is what makes template-free extraction work; it's the engine inside tools like ExtractFox.

OCR (Optical Character Recognition)

Also known as: optical character recognition, text recognition

Software that converts an image of text into machine-readable characters.

OCR is the classical technique for turning a scan or photo of text into editable strings. It works well on clean, printed pages and predictable layouts, but breaks on rotated, photographed, or noisy documents and on anything that requires interpretation — like grouping line items into a table or distinguishing a total from a subtotal. Modern AI document extraction (like ExtractFox) skips per-character recognition and reads documents end-to-end with a multimodal model, which handles layout variation natively.

Category

Extraction concepts

Free-text extraction

Describing what you want in plain English and letting the model figure out the structure.

Instead of picking a prebuilt template, you write a prompt — "line items where amount > $500", "every transaction in February with a counterparty" — and the system infers a schema and extracts against it. Free-text extraction is the reason a single tool can replace dozens of per-vendor parsers: any new document type is a prompt away.

Schema

A definition of the shape and types of expected output data.

In document extraction, a schema lists the fields you want — vendor (string), date (ISO date), line_items (array of {description, quantity, price}) — and their types. Per-document-type schemas (invoice, passport, contract) make extraction deterministic. Free-text extraction generates a schema on the fly from your prompt.

Structured data

Data organized into fixed fields and types (e.g. rows in a table, keys in JSON).

Structured data has a known shape: an invoice has a vendor, a date, line items, and a total. Extracting structured data from documents means producing output that matches a predefined schema, so it can be imported into another system without manual cleanup. The opposite is unstructured text — paragraphs, scans, transcripts — which downstream tools can't operate on directly.

Structured output

Also known as: typed output, schema-constrained output

Forcing an LLM to return data that conforms to a defined JSON schema.

Modern LLMs accept a schema (a JSON Schema or Zod object) and constrain their output to match it. This is how reliable extraction works at scale: the model isn't free-form summarizing — it's filling in a typed object with named fields. ExtractFox uses structured output to guarantee the result is always parseable, with no post-processing.

Category

Document types

Accounts payable (AP)

The function that processes supplier invoices and pays them.

AP teams receive invoices in dozens of layouts, code them to GL accounts, route for approval, and pay. Document extraction collapses the data-entry step — vendor, invoice number, amount, due date all parsed in one upload — letting the team focus on coding and approvals instead of typing.

Bank statement parsing

Converting a PDF bank statement into a clean transaction table.

Statement formats vary across banks — column orders, date formats, balance/credit/debit conventions. Reliable parsing extracts every transaction (date, description, amount, running balance), opening and closing balances, and account metadata, normalizing them into a single schema regardless of source bank.

Invoice extraction

Pulling vendor, line items, totals, and tax from a supplier invoice.

Invoice extraction is the canonical IDP use case because invoices vary wildly by vendor but share a fixed conceptual shape: a vendor sells line items, totals to a subtotal, applies tax, returns a total due by a date. Modern extractors (ExtractFox, Klippa, Rossum) handle layout variation natively; older tools require per-vendor templates.

KYC (Know Your Customer)

Regulatory verification of customer identity, often using passport or ID document extraction.

Financial-services and crypto onboarding requires verified identity data — name, DOB, document number, expiry. Passport extraction feeds the KYC pipeline by turning a phone-photo upload into the structured fields a verification system can match against sanctions lists and existing customer records.

Line items

Individual rows on an invoice, receipt, or order — one per product or service.

Each line item typically has a description, quantity, unit price, and amount. Extracting line items reliably is harder than extracting top-level fields like vendor or total because they live in tables that span pages, wrap rows, or include subtotals — and because the schema must accommodate variable counts.

MRZ (Machine-Readable Zone)

The two- or three-line code at the bottom of a passport, encoded for automated reading.

The MRZ encodes name, passport number, nationality, date of birth, expiry, and a check digit in a fixed-width OCR-B font. Border-control kiosks read it. So does ExtractFox's passport extractor — pulling MRZ values out of a phone photo or scan and exposing the parsed fields as structured data.

Passport extraction

Pulling identity data — name, passport number, MRZ, dates — from a passport photo or scan.

Passport extraction normalizes data across hundreds of country-specific layouts. Reliable extractors read both the visual zone (which varies by country) and the MRZ (which doesn't), then reconcile them. Common downstream uses: KYC, travel booking, HR onboarding for international hires.

PDF data extraction

Pulling structured fields out of a PDF, regardless of source layout.

PDFs come in two flavors: text-based (where text is selectable) and image-based (a scan saved as a PDF). Modern extractors handle both — a multimodal model reads pixels regardless of the PDF's internal structure, so the same extractor works on a Word-export invoice and a phone photo of the same invoice.

Reconciliation

Matching transactions across systems — typically bank statements against the general ledger.

Reconciliation is the close-process bottleneck for most accounting teams. Extracting bank statements into a clean transaction table — date, description, amount — turns reconciliation from manual data entry plus matching into matching alone.

Category

Output formats

CSV

Comma-separated values — a flat-table text format that opens directly in Excel and Google Sheets.

CSV is the right output when you have a single table (every line item, every transaction). It's the wrong output when you have nested data (an invoice with header fields and line items). For nested data, JSON or Excel with multiple sheets is better.

JSON

JavaScript Object Notation — a text format for structured data, used as the default output of modern document extractors.

JSON is the lingua franca of structured output. Every modern extractor returns JSON because every modern system can ingest it. ExtractFox returns JSON natively and offers Excel/CSV as derived views.

PDF to Excel

Converting a PDF — invoice, statement, report — into an Excel spreadsheet.

The phrase covers two very different operations: copying a single table out of a PDF (where tools like Tabula work), and extracting structured fields from a non-table document like an invoice (where AI extraction wins). ExtractFox handles both — table-shaped PDFs become sheet rows; structured documents become labeled fields.

Skip the theory

Drop a real document into ExtractFox and see structured output in seconds.

Try a free extraction →