How to extract a table from a PDF with Python
Three Python libraries for PDF table extraction — pdfplumber, Tabula-py, and Camelot — with code, when to use each, and how to handle scanned PDFs where text-based extraction fails.
PDF tables are one of the most common and most frustrating data-extraction tasks. The PDF format has no native concept of a 'table' — it just places characters at absolute coordinates. Libraries that extract tables are reverse-engineering that layout geometry to reconstruct rows and columns. This guide covers three Python approaches and when each one works.
Method 1: pdfplumber
pdfplumber is the most popular Python library for PDF table extraction. It works well on digital PDFs where the text layer is clean and tables have visible borders or consistent spacing.
Install: `pip install pdfplumber`
Basic usage:
- `import pdfplumber` — open the PDF with `pdfplumber.open('file.pdf')`
- Loop over `pdf.pages` and call `page.extract_tables()` on each page
- Each table comes back as a list of rows, each row a list of strings
- Pass to `pandas.DataFrame` for further cleaning
Full example:
```python import pdfplumber import pandas as pd with pdfplumber.open('report.pdf') as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() for table in tables: df = pd.DataFrame(table[1:], columns=table[0]) all_tables.append(df) result = pd.concat(all_tables, ignore_index=True) result.to_csv('output.csv', index=False) ```
pdfplumber also supports `table_settings` to tune the algorithm — useful when the default horizontal/vertical line detection misses a table or splits it incorrectly. Key settings: `vertical_strategy`, `horizontal_strategy`, and `snap_tolerance`.
Method 2: tabula-py
tabula-py is a Python wrapper around Tabula, a Java library. It's often more accurate on tables with explicit grid lines and handles multi-page tables well. Requires Java 8+ installed.
Install: `pip install tabula-py`
```python import tabula # Read all tables from all pages tables = tabula.read_pdf('report.pdf', pages='all', multiple_tables=True) # tables is a list of DataFrames for i, df in enumerate(tables): df.to_csv(f'table_{i}.csv', index=False) ```
tabula-py has two extraction modes: `lattice` (for tables with visible borders) and `stream` (for tables with whitespace-delimited columns). Specify with `lattice=True` or `stream=True`. When in doubt, try both and compare.
Method 3: Camelot
Camelot is more configurable than pdfplumber and gives you accuracy scores per table so you can filter low-confidence extractions. It has two parsers: `lattice` (visible lines) and `stream` (whitespace).
Install: `pip install camelot-py[cv]` (requires OpenCV and Ghostscript)
```python import camelot tables = camelot.read_pdf('report.pdf', pages='all', flavor='lattice') for table in tables: print(f'Accuracy: {table.accuracy:.1f}%') if table.accuracy > 80: table.to_csv(f'table_{table.page}.csv') ```
The accuracy score makes Camelot useful in automated pipelines — you can flag low-accuracy tables for manual review rather than silently outputting garbage.
When all three fail: scanned PDFs
pdfplumber, tabula-py, and Camelot all work on the text layer of a PDF. A scanned PDF is just a image embedded in a PDF container — there's no text layer to parse. All three libraries will return empty or nonsense output.
For scanned PDFs you need OCR first. Options:
- **pytesseract** — open-source Tesseract OCR wrapper. Run OCR on each page image, then pass the text through a table-detection heuristic. Accuracy is lower than commercial models.
- **pdf2image + pytesseract** — convert each page to an image, run Tesseract, reconstruct the table from character bounding boxes. Complex and fragile on multi-column layouts.
- **AI extraction** — a multimodal model reads the page image directly and reconstructs the table semantically. No bounding-box geometry needed.
Comparison: when to use each
- **pdfplumber** — best default for digital PDFs. Pythonic API, no Java dependency, good community support.
- **tabula-py** — better on tables with explicit grid lines. Handles multi-page table spanning well.
- **Camelot** — best when you need accuracy scores for automated QA. More setup (OpenCV, Ghostscript).
- **AI extraction** — best for scanned PDFs, unusual table layouts, tables embedded in reports with surrounding text. No code required.
Common pitfalls
- **Merged cells** — pdfplumber and tabula-py handle merged header cells inconsistently. Check the first row carefully and normalize headers before using as column names.
- **Multi-page tables** — a table that spans pages is often extracted as two separate DataFrames. Detect continuation by comparing column count and concatenate.
- **Rotated pages** — a landscape-rotated page embedded in a portrait PDF confuses coordinate detection. Use pdfplumber's `page.rotate` or pre-process with PyMuPDF.
- **Whitespace-only cells** — numeric columns with missing values come back as empty strings, not NaN. Run `df.replace('', pd.NA)` after extraction.
Frequently asked questions
Which Python library is best for extracting tables from PDFs?+
pdfplumber is the best starting point for most digital PDFs — it has a clean API, no Java dependency, and good accuracy on most table layouts. Use tabula-py if you need multi-page table support or Camelot if you need per-table accuracy scores for automated QA.
Why does pdfplumber return an empty list from extract_tables()?+
Most likely cause: the PDF is scanned (no text layer). Check with `page.extract_text()` — if that also returns empty, it's a scanned document and needs OCR or AI extraction first.
How do I extract tables from a scanned PDF in Python?+
Run OCR first using pdf2image + pytesseract to get a text layer, then pass through a table parser. Alternatively, use an AI-based extraction service that reads the page image directly — this is more accurate on complex layouts and requires no geometry tuning.
How do I combine tables from multiple pages into one DataFrame?+
Extract all tables page by page, collect into a list, then `pd.concat(all_dfs, ignore_index=True)`. For multi-page tables that span a page break, check if the first row of the next page matches the column structure of the previous table and concatenate accordingly.
Can I extract tables from a PDF without Python?+
Yes — Tabula has a standalone GUI that requires no coding. For scanned documents or complex layouts, AI extraction tools like ExtractFox handle any table in any PDF via a browser upload.