How to convert a PDF to CSV
Four methods to export PDF tables and structured data as CSV — pdfplumber, Tabula, Camelot, and AI extraction. When to use each, with code examples and common pitfalls.
Converting a PDF to CSV means getting the tabular data inside a PDF into a comma-separated file you can open in Excel, import into a database, or process with a script. The difficulty depends entirely on what's in the PDF: a clean single-table report is trivial; a 50-page bank statement with multi-page tables and inconsistent column spacing requires a proper approach.
When you want CSV specifically
CSV is the right output format when you're importing into a database, feeding a script that expects delimited input, or loading into Google Sheets. If you're going to open it in Excel and work with it there, Excel (.xlsx) is usually better — CSV loses type information and doesn't support multiple sheets. For API integration or downstream processing, JSON is often cleaner. But CSV is universal, unambiguous, and supported by everything.
Method 1: pdfplumber (Python, most flexible)
pdfplumber is the right starting point for PDF tables that are defined by whitespace rather than visible grid lines.
- pip install pdfplumber
- import pdfplumber, csv
- with pdfplumber.open('report.pdf') as pdf:
- rows = []
- for page in pdf.pages:
- table = page.extract_table()
- if table: rows.extend(table)
- with open('output.csv', 'w', newline='', encoding='utf-8') as f:
- csv.writer(f).writerows(r for r in rows if any(c for c in r))
extract_table() uses the page's bounding boxes to reconstruct columns. It works well on most machine-generated PDFs. For multi-page tables, the column header from page 1 appears again in rows — filter duplicates. For tables with explicit borders, use extract_tables() which detects multiple separate tables per page.
Method 2: Tabula (point-and-click or Python)
Tabula comes both as a desktop GUI and a Python library (tabula-py). The GUI lets you draw a selection box around the table — useful for scoping exactly which region to export.
- pip install tabula-py
- import tabula
- dfs = tabula.read_pdf('report.pdf', pages='all', multiple_tables=True)
- for i, df in enumerate(dfs):
- df.to_csv(f'table_{i}.csv', index=False)
tabula-py wraps the Java tabula library. It requires Java to be installed. It's reliable on PDFs where table borders are clear, less so on whitespace-delimited layouts. The multiple_tables=True flag handles PDFs with more than one table per page.
Method 3: Camelot (better for bordered tables)
Camelot is a Python library with two modes: lattice (for tables with visible grid lines, very accurate) and stream (for whitespace tables, similar to pdfplumber). It also produces an accuracy score per table, which is useful for flagging uncertain results.
- pip install camelot-py[cv]
- import camelot
- tables = camelot.read_pdf('report.pdf', pages='all', flavor='lattice')
- print(tables[0].parsing_report) # accuracy score
- tables[0].to_csv('table.csv')
- # If lattice fails, try flavor='stream'
Method 4: AI extraction (complex PDFs, scanned documents, structured data that isn't a clean table)
pdfplumber, Tabula, and Camelot all fail on scanned PDFs (no text layer), and struggle with tables where the structure is implied rather than visually obvious — invoices, bank statements, insurance forms. AI extraction understands the document semantically and returns the data in a clean, consistent schema.
Common CSV output problems and fixes
- Currency symbols in number cells — strip with str.replace('$','').replace(',','') before converting to float
- Numbers in parentheses meaning negative — (1,234.56) → -1234.56. Fix: re.sub(r'\(([\d,.]+)\)', r'-\1', val)
- Multi-line cell values — may contain embedded newlines. Either quote properly or join with a space during extraction
- Merged header cells across two rows — manually join row 0 and row 1 before writing the header to CSV
- UTF-8 issues in Excel — open with Data → From Text/CSV and specify UTF-8, or save as UTF-8 with BOM
Frequently asked questions
How do I convert a PDF table to CSV in Python?+
Use pdfplumber for most PDFs: open the file, call page.extract_table() for each page, and write with csv.writer. For bordered tables, Camelot with flavor='lattice' is more accurate. For scanned PDFs or complex structured documents, use AI extraction.
Why does my CSV have all data in one column?+
pdfplumber or Tabula couldn't detect the column boundaries. The PDF likely uses whitespace without clear column markers. Try Camelot's stream mode, or open the PDF in Tabula's desktop GUI and manually draw the table area.
How do I convert a scanned PDF to CSV?+
Scanned PDFs have no text layer — pdfplumber, Tabula, and Camelot return nothing. Use AI extraction which applies OCR automatically before extracting the structured data.
How do I convert multiple PDFs to CSV at once?+
Loop over a directory: for path in Path('pdfs/').glob('*.pdf'): use pdfplumber or the AI extraction API on each file and save the output as path.stem + '.csv'. The API approach gives consistent column names across all files.