How to extract data from a filled form (PDF, scan, or photo)
Pull field values from any filled form — AcroForm PDF fields, printed text, or handwriting — with pypdf, pdfminer, and AI extraction for mixed or handwritten forms.
Extracting data from a filled form requires knowing what kind of form you're dealing with. A digital PDF with form fields (AcroForm) is the easiest case — the values are stored in the PDF metadata. A printed form that was filled out by hand is the hardest — it requires OCR and then layout-aware parsing to match each answer to its label.
Case 1: Digital PDF with AcroForm fields
Many PDFs created from government forms, tax software, or document management systems are interactive AcroForms — they have fillable fields whose values are stored in the PDF file. pypdf reads these directly:
- pip install pypdf
- from pypdf import PdfReader
- reader = PdfReader('filled_form.pdf')
- fields = reader.get_fields() # returns {field_name: field_object}
- for name, field in fields.items():
- print(name, field.value)
get_fields() returns a flat dict of field names to values. Checkbox fields return '/Yes' or '/Off'. Radio buttons return the selected option value. Signatures return an object (not the signature content itself). This is the most accurate extraction path when it works — there's no reconstruction, just reading the stored values.
Gotcha: some PDFs flatten the form during creation or print-to-PDF — the visual appearance of checked boxes and filled text is preserved but the underlying field data is gone. get_fields() returns None or empty for flattened forms. Check whether the PDF has interactive fields before assuming AcroForm extraction will work.
Case 2: Printed form (PDF or scan), filled with typed text
A form that was filled using a typewriter, printed from a typed-in Word document, or scanned after being printed and typed on — has no AcroForm fields. The answers are baked into the visual layer. You need text extraction plus layout awareness to match answers to labels.
pdfplumber gives you word-level bounding boxes, which lets you find answer text by its position relative to label text:
- import pdfplumber
- with pdfplumber.open('form.pdf') as pdf:
- words = pdf.pages[0].extract_words()
- # words is a list of {text, x0, y0, x1, y1}
- # Find label 'Date of Birth:' and collect words to its right on the same line
- labels = {w['text']: w for w in words if w['text'].endswith(':')}
- for label, lw in labels.items():
- answer_words = [w['text'] for w in words if abs(w['y0'] - lw['y0']) < 5 and w['x0'] > lw['x1']]
- print(label, ' '.join(answer_words))
This works when the form was typed on a computer and the text layer is clean. It breaks on multi-column forms, forms where the answer field extends below the label, and any scan (no text layer).
Case 3: Handwritten forms (scanned or photographed)
Handwritten forms are the hardest case. Tesseract and standard OCR tools are built for printed text — accuracy on handwriting is typically 50-70%, dropping further on messy cursive or non-standard pen strokes. The result needs heavy cleanup even when it runs.
The right approach for handwritten forms is a vision model that reads the form the way a person does — it finds the field labels (even if printed), reads the handwritten answers in the appropriate cells, and pairs them up:
Checkboxes and radio buttons
Checkboxes in AcroForm PDFs read cleanly with get_fields(). Checkboxes in printed/scanned forms are harder — you need to detect the presence of a check mark or X inside a box boundary. AI extraction handles this by identifying each checkbox in context and returning checked: true/false for each.
Multi-page forms
For multi-page scanned forms — a 10-page intake packet, a multi-section application — the same extraction runs across all pages. In the AcroForm case, pypdf's get_fields() collects fields from all pages in one call. For scanned forms, upload the full PDF rather than individual page images.
Table-formatted forms (inspection checklists, survey grids)
Some forms are structured as a grid: rows of items, columns for pass/fail/NA, a notes column at the end. These look like tables but aren't database tables — the row label is the inspection item, the column header is the answer option. AI extraction handles these as a list of records: { item, result, notes }.
Frequently asked questions
How do I extract form field values from a PDF?+
If the PDF has interactive fields (AcroForm), use pypdf's get_fields() — it returns all field names and values directly from the PDF metadata. If the form was flattened or printed-to-PDF, the field data is gone and you need text extraction or AI extraction instead.
How do I extract data from a handwritten form?+
Tesseract OCR has low accuracy on handwriting. Use AI extraction with a vision model that reads handwriting in context of the form's printed labels — it returns label-value pairs regardless of whether the answers are typed or handwritten.
How do I check if a PDF form has AcroForm fields?+
Open the PDF in Acrobat and click a field — if it highlights and accepts input, it's an AcroForm. In Python, pypdf's get_fields() returns a populated dict for forms with fields and None for flat PDFs.
Can I extract data from a form that was filled out and scanned?+
Yes. Scanned forms have no text layer, so pypdf and pdfplumber return nothing. AI extraction applies OCR and reads the printed labels and handwritten or typed-in values from the image, returning the same label-value output as for digital forms.