All posts
Tutorial5 min read

How to convert a PDF to JSON

Turn PDFs into structured JSON for APIs, databases, and AI pipelines. Covers PyMuPDF for raw text, pdfplumber for tables, and AI extraction for any document type with schema output.

Converting a PDF to JSON can mean two very different things: extracting the raw text structure of the PDF (characters, positions, fonts) into a JSON tree — or extracting the meaningful data (invoice fields, transaction rows, contract metadata) into a clean application JSON object. The first is low-level and usually only useful for PDF analysis tooling. The second is what most people actually want.

Raw PDF structure to JSON (for PDF tooling)

PyMuPDF (fitz) can serialize an entire PDF page's text blocks into JSON with bounding boxes and font metadata. Useful if you're building a PDF editor or spatial text analysis tool; not useful if you want the actual data content.

  • pip install pymupdf
  • import fitz, json
  • doc = fitz.open('document.pdf')
  • pages = [page.get_text('dict') for page in doc]
  • json.dump(pages, open('structure.json', 'w'), indent=2)

get_text('dict') returns every text span with its bounding box, font name, font size, and color. The result is verbose — a simple one-page PDF produces megabytes of JSON. This format is for spatial analysis, not data ingestion.

Table data to JSON with pdfplumber

For a PDF where the data lives in a table, pdfplumber extracts rows and you convert to a list-of-dicts:

  • pip install pdfplumber
  • import pdfplumber, json
  • with pdfplumber.open('report.pdf') as pdf:
  • rows = []
  • for page in pdf.pages:
  • table = page.extract_table()
  • if table:
  • headers = table[0]
  • rows.extend(dict(zip(headers, row)) for row in table[1:])
  • print(json.dumps(rows, indent=2))

This works when the table has a clear header row and consistent columns. The output is a JSON array of objects — one object per row, keyed by column header. For multi-page tables, the header repeats on each page — filter duplicate header rows before constructing dicts.

Document-level structured JSON with a schema

For documents where the meaningful data isn't a simple table — invoices (header fields + line items array), contracts (parties + dates + clauses), bank statements (account info + transactions) — you need schema-aware extraction. The JSON shape isn't rows; it's a nested object matching the document's natural structure.

You can do this with an LLM call on top of pdfplumber text extraction:

  • import pdfplumber, openai, json
  • with pdfplumber.open('invoice.pdf') as pdf:
  • text = ' '.join(p.extract_text() or '' for p in pdf.pages)
  • resp = openai.chat.completions.create(
  • model='gpt-4o-mini',
  • messages=[{'role': 'user', 'content': f'Extract this invoice as JSON with vendor, invoice_number, date, line_items, total: {text}'}],
  • response_format={'type': 'json_object'}
  • )
  • print(json.loads(resp.choices[0].message.content))

AI extraction with a stable, reusable schema

For production use — processing many PDFs of the same type with a consistent output schema — a dedicated extraction tool is more reliable than ad-hoc LLM calls. The schema is stable across every document, numbers come back as numbers (not strings), and the tool handles scanned PDFs and images directly.

Tool
PDF to JSON Converter
Upload any PDF and get structured JSON back. Pick a prebuilt schema (invoice, bank statement, contract, receipt) or describe the fields you want in plain English. Numbers as numbers, dates as ISO strings, arrays for lists. Download as .json or hit the API. No signup required.

When to use JSON vs CSV vs Excel

  • JSON — for APIs, databases, AI pipelines, and nested data structures (invoice with line items array). Preserves types.
  • CSV — for flat tabular data going into a database, spreadsheet, or script. Universal support, no type info.
  • Excel — for human use: sorting, filtering, pivot tables, charts. Best when a person needs to work with the data.

Building a PDF-to-JSON pipeline

A production pipeline for processing PDFs at volume: receive files (email attachment, S3 event, webhook), extract as JSON via API, validate the schema (check required fields, numeric ranges), and insert into your database or forward to the next system. The extraction API is the only moving part — everything else is standard backend work.

Frequently asked questions

How do I convert a PDF to JSON in Python?+

For table data: use pdfplumber to extract rows, then convert with dict(zip(headers, row)) for each row. For document-level structured data (invoices, contracts), use AI extraction which returns a schema-aware JSON object rather than a flat list of rows.

What does 'PDF to JSON' actually return?+

Depends on the tool. Low-level tools return the raw PDF structure: text spans with font and position info. Extraction tools return the data content: for an invoice, that means {vendor, invoice_number, line_items: [{description, quantity, price}], total}. The second is almost always what you want.

How do I convert a PDF invoice to JSON?+

Use AI extraction with the invoice preset. The output is a JSON object with vendor, customer, invoice_number, dates, line_items as an array, and totals. Numbers are numbers, not strings — ready to insert into a database.

Can I convert a scanned PDF to JSON?+

Yes with AI extraction, which applies OCR automatically. pdfplumber and PyMuPDF require a text layer and return empty results on scanned PDFs.

More on tutorial

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →