TutorialJune 28, 20265 min read

How to extract data from a filled form (PDF, scan, or photo)

Pull field values from any filled form — AcroForm PDF fields, printed text, or handwriting — with pypdf, pdfminer, and AI extraction for mixed or handwritten forms.

By Dawid Sibinski

Extracting data from a filled form requires knowing what kind of form you're dealing with. A digital PDF with form fields (AcroForm) is the easiest case — the values are stored in the PDF metadata. A printed form that was filled out by hand is the hardest — it requires OCR and then layout-aware parsing to match each answer to its label.

Case 1: Digital PDF with AcroForm fields

Many PDFs created from government forms, tax software, or document management systems are interactive AcroForms — they have fillable fields whose values are stored in the PDF file. pypdf reads these directly:

pip install pypdf
from pypdf import PdfReader
reader = PdfReader('filled_form.pdf')
fields = reader.get_fields() # returns {field_name: field_object}
for name, field in fields.items():
print(name, field.value)

get_fields() returns a flat dict of field names to values. Checkbox fields return '/Yes' or '/Off'. Radio buttons return the selected option value. Signatures return an object (not the signature content itself). This is the most accurate extraction path when it works — there's no reconstruction, just reading the stored values.

Gotcha: some PDFs flatten the form during creation or print-to-PDF — the visual appearance of checked boxes and filled text is preserved but the underlying field data is gone. get_fields() returns None or empty for flattened forms. Check whether the PDF has interactive fields before assuming AcroForm extraction will work.

Case 2: Printed form (PDF or scan), filled with typed text

A form that was filled using a typewriter, printed from a typed-in Word document, or scanned after being printed and typed on — has no AcroForm fields. The answers are baked into the visual layer. You need text extraction plus layout awareness to match answers to labels.

pdfplumber gives you word-level bounding boxes, which lets you find answer text by its position relative to label text:

import pdfplumber
with pdfplumber.open('form.pdf') as pdf:
words = pdf.pages[0].extract_words()
# words is a list of {text, x0, y0, x1, y1}
# Find label 'Date of Birth:' and collect words to its right on the same line
labels = {w['text']: w for w in words if w['text'].endswith(':')}
for label, lw in labels.items():
answer_words = [w['text'] for w in words if abs(w['y0'] - lw['y0']) < 5 and w['x0'] > lw['x1']]
print(label, ' '.join(answer_words))

This works when the form was typed on a computer and the text layer is clean. It breaks on multi-column forms, forms where the answer field extends below the label, and any scan (no text layer).

Case 3: Handwritten forms (scanned or photographed)

Handwritten forms are the hardest case. Tesseract and standard OCR tools are built for printed text — accuracy on handwriting is typically 50-70%, dropping further on messy cursive or non-standard pen strokes. The result needs heavy cleanup even when it runs.

The right approach for handwritten forms is a vision model that reads the form the way a person does — it finds the field labels (even if printed), reads the handwritten answers in the appropriate cells, and pairs them up:

Tool

Form Data Extractor

Upload a filled form — scanned, photographed, or PDF — and get every field and its value as structured rows. Handles printed text and handwriting, checkboxes (checked/unchecked), and signature presence flags. Works on intake forms, applications, surveys, and government paperwork. Export as JSON or Excel.

Checkboxes and radio buttons

Checkboxes in AcroForm PDFs read cleanly with get_fields(). Checkboxes in printed/scanned forms are harder — you need to detect the presence of a check mark or X inside a box boundary. AI extraction handles this by identifying each checkbox in context and returning checked: true/false for each.

Multi-page forms

For multi-page scanned forms — a 10-page intake packet, a multi-section application — the same extraction runs across all pages. In the AcroForm case, pypdf's get_fields() collects fields from all pages in one call. For scanned forms, upload the full PDF rather than individual page images.

Table-formatted forms (inspection checklists, survey grids)

Some forms are structured as a grid: rows of items, columns for pass/fail/NA, a notes column at the end. These look like tables but aren't database tables — the row label is the inspection item, the column header is the answer option. AI extraction handles these as a list of records: { item, result, notes }.

Frequently asked questions

How do I extract form field values from a PDF?+

If the PDF has interactive fields (AcroForm), use pypdf's get_fields() — it returns all field names and values directly from the PDF metadata. If the form was flattened or printed-to-PDF, the field data is gone and you need text extraction or AI extraction instead.

How do I extract data from a handwritten form?+

Tesseract OCR has low accuracy on handwriting. Use AI extraction with a vision model that reads handwriting in context of the form's printed labels — it returns label-value pairs regardless of whether the answers are typed or handwritten.

How do I check if a PDF form has AcroForm fields?+

Open the PDF in Acrobat and click a field — if it highlights and accepts input, it's an AcroForm. In Python, pypdf's get_fields() returns a populated dict for forms with fields and None for flat PDFs.

Can I extract data from a form that was filled out and scanned?+

Yes. Scanned forms have no text layer, so pypdf and pdfplumber return nothing. AI extraction applies OCR and reads the printed labels and handwritten or typed-in values from the image, returning the same label-value output as for digital forms.