All posts
EngineeringApril 25, 20267 min read

How to extract a table from a PDF using Python

pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.

By Dawid Sibinski

PDF table extraction in Python isn't a solved problem — it's a tradeoff problem. The right library depends on whether your PDFs have ruled lines, consistent columns, scanned pages, or all three.

pdfplumber: start here

MIT-licensed, pure Python, no Java dependency. Handles most clean PDFs with text layers and works well when columns are visually consistent.

import pdfplumber with pdfplumber.open("report.pdf") as pdf: for page in pdf.pages: for table in page.extract_tables(): for row in table: print(row)

Tweak the table_settings dict (vertical_strategy, horizontal_strategy) for tables without visible borders. pdfplumber struggles when columns have variable spacing or merged cells.

Camelot: better on bordered tables

Two flavors: lattice (uses ruling lines, very accurate when they exist) and stream (uses whitespace heuristics, similar to pdfplumber). Returns DataFrames directly:

import camelot tables = camelot.read_pdf("report.pdf", pages="1-end", flavor="lattice") for t in tables: t.to_csv(f"page_{t.page}_table_{t.order}.csv")

Lattice is the right choice for financial statements, scientific papers, and government reports — anything with consistent ruled tables. Requires ghostscript installed on the system.

Tabula-py: Java wrapper, battle-tested

Wraps the Java tabula tool. Slowest of the three to set up because of the JVM dependency, but historically the most accurate on awkward layouts. Good fallback when Camelot misses rows.

Scanned PDFs: OCR first

None of the above work on scanned PDFs without a text layer. You need OCR first: ocrmypdf converts scans into searchable PDFs in one command, then you feed the result back into pdfplumber or Camelot. Quality on hand-scanned pages varies a lot — expect manual cleanup.

Long-tail layouts: API-based extraction

When tables span pages, have merged headers, contain footnotes, or vary by document, traditional libraries hit a wall. A multimodal model handles these because it reads the page the way a person does instead of relying on geometry heuristics.

import requests with open("report.pdf", "rb") as f: r = requests.post( "https://extractfox.com/api/extract", files={"file": f}, data={"prompt": "Extract every table as a flat CSV."}, headers={"Authorization": f"Bearer {API_KEY}"}, ) result = r.json()

ExtractFox returns structured JSON; serialize to CSV with the standard library and you're done.

Quick decision matrix

  • Clean PDFs, simple tables → pdfplumber.
  • Bordered tables, accuracy matters → Camelot lattice.
  • Awkward layouts, willing to install Java → Tabula.
  • Scanned pages → ocrmypdf, then any of the above.
  • Long-tail variety, don't want to maintain a parser → ExtractFox API.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →