EngineeringApril 10, 20266 min read

How to extract key-value pairs from documents

"Name: John Smith. Date: 2024-04-12. Total: $1,420." — every form, invoice, and structured PDF is full of key-value pairs. Here's how to pull them out reliably across formats.

By Dawid Sibinski

Most semi-structured documents are key-value documents. Forms, receipts, invoices, applications, contracts — all of them have a consistent pattern of label-then-value, with some prose in between. The extraction problem is making that pattern machine-readable across thousands of layouts that vary in detail.

Three approaches that actually work

1. Template-based (only when layouts are stable)

Define x/y coordinates per field, parse with PdfPig or pypdf, return values. Brittle — a 2px shift breaks it — but unbeatable when you control the document template (your own forms, your own contracts).

2. Cloud Document AI

AWS Textract, Azure Document Intelligence, and Google Document AI all expose key-value extraction as a first-class operation. They return pairs with bounding boxes and confidence scores:

import boto3 client = boto3.client("textract") with open("form.pdf", "rb") as f: res = client.analyze_document( Document={"Bytes": f.read()}, FeatureTypes=["FORMS"], ) for block in res["Blocks"]: if block["BlockType"] == "KEY_VALUE_SET": print(block)

Cost is per page, accuracy is good on standard layouts, lower on hand-filled forms.

3. Multimodal LLM with a JSON schema

Send the document and the target schema; get JSON back. The model handles layout variance because it reads the page semantically, not geometrically. ExtractFox's form data extractor and free-text mode both work this way — useful when the form layout varies across thousands of suppliers or when you want fields the cloud Document AI services don't natively recognize.

Schema design matters

How you describe the fields shapes the output quality. Vague schemas produce vague extractions. "name" returns the first name-shaped string on the page. "applicant_full_name" with type string and a description "Full legal name of the person filing the application" returns the right field even when the document also has a witness name and an emergency contact name.

Useful schema patterns:

Use enums for known categorical fields (status, country, document_type).
Make optional fields explicitly nullable rather than forcing the model to invent values.
Add format hints in descriptions: "ISO 8601 date," "E.164 phone number," "three-letter currency code."
Pass example outputs in the prompt for ambiguous fields.

Validation and confidence

Whatever extractor you use, validate the output against domain rules before storing it. Dates parseable, totals matching line-item sums, currency codes valid. The cheapest accuracy gain is usually post-extraction validation, not a better model.

When templates beat AI

Government forms with strict layouts. High-volume single-supplier invoices. Anything where you control the source. Templates are 100% accurate, free to run, and audit-friendly. Use AI extraction for the long tail; use templates for the high-volume head where you can afford to set them up.