How to extract a table from a PDF using Python
pdfplumber, Camelot, Tabula, and the API-based fallback — what each library handles well, what it breaks on, and the code you actually need.
PDF table extraction in Python isn't a solved problem — it's a tradeoff problem. The right library depends on whether your PDFs have ruled lines, consistent columns, scanned pages, or all three.
pdfplumber: start here
MIT-licensed, pure Python, no Java dependency. Handles most clean PDFs with text layers and works well when columns are visually consistent.
import pdfplumber with pdfplumber.open("report.pdf") as pdf: for page in pdf.pages: for table in page.extract_tables(): for row in table: print(row)
Tweak the table_settings dict (vertical_strategy, horizontal_strategy) for tables without visible borders. pdfplumber struggles when columns have variable spacing or merged cells.
Camelot: better on bordered tables
Two flavors: lattice (uses ruling lines, very accurate when they exist) and stream (uses whitespace heuristics, similar to pdfplumber). Returns DataFrames directly:
import camelot tables = camelot.read_pdf("report.pdf", pages="1-end", flavor="lattice") for t in tables: t.to_csv(f"page_{t.page}_table_{t.order}.csv")
Lattice is the right choice for financial statements, scientific papers, and government reports — anything with consistent ruled tables. Requires ghostscript installed on the system.
Tabula-py: Java wrapper, battle-tested
Wraps the Java tabula tool. Slowest of the three to set up because of the JVM dependency, but historically the most accurate on awkward layouts. Good fallback when Camelot misses rows.
Scanned PDFs: OCR first
None of the above work on scanned PDFs without a text layer. You need OCR first: ocrmypdf converts scans into searchable PDFs in one command, then you feed the result back into pdfplumber or Camelot. Quality on hand-scanned pages varies a lot — expect manual cleanup.
Long-tail layouts: API-based extraction
When tables span pages, have merged headers, contain footnotes, or vary by document, traditional libraries hit a wall. A multimodal model handles these because it reads the page the way a person does instead of relying on geometry heuristics.
import requests with open("report.pdf", "rb") as f: r = requests.post( "https://extractfox.com/api/extract", files={"file": f}, data={"prompt": "Extract every table as a flat CSV."}, headers={"Authorization": f"Bearer {API_KEY}"}, ) result = r.json()
ExtractFox returns structured JSON; serialize to CSV with the standard library and you're done.
Quick decision matrix
- Clean PDFs, simple tables → pdfplumber.
- Bordered tables, accuracy matters → Camelot lattice.
- Awkward layouts, willing to install Java → Tabula.
- Scanned pages → ocrmypdf, then any of the above.
- Long-tail variety, don't want to maintain a parser → ExtractFox API.