How to convert a PDF to Excel
Turn PDF tables and structured data into clean Excel files. Covers Tabula, pdfplumber, pandas, and AI extraction for invoices, statements, and scanned PDFs — with real column headers and numeric values.
PDF to Excel is one of the most common data conversion needs — and one of the most frustrating when the tools don't work. The core problem is that PDFs were designed for printing, not data extraction: columns are defined by visual position rather than structure, and there's no concept of 'this is a table' baked into the format.
The three types of PDF tables (and how hard each is)
- Bordered tables with visible grid lines — easiest. Camelot's lattice mode handles these reliably.
- Whitespace-aligned tables (columns separated by spaces, no borders) — medium difficulty. pdfplumber and Tabula usually work.
- Scanned or image-based tables — hardest. No text layer, so standard tools return nothing. Need OCR or AI extraction.
Method 1: Tabula desktop (no code, point-and-click)
Download Tabula from tabula.technology. Open your PDF, draw a box around the table, and click Export. This is the fastest option for a one-off conversion and requires no programming. The GUI shows you the detected columns in real time so you can adjust the selection before exporting.
Limitation: manual, one table at a time, requires Java. Not suitable for processing many PDFs.
Method 2: pdfplumber + openpyxl (Python)
For automating the conversion across many PDFs:
- pip install pdfplumber openpyxl
- import pdfplumber; from openpyxl import Workbook
- wb = Workbook(); ws = wb.active
- with pdfplumber.open('report.pdf') as pdf:
- for page in pdf.pages:
- table = page.extract_table()
- if table:
- for row in table: ws.append(row)
- wb.save('output.xlsx')
This writes raw text into cells. Numbers come out as strings — you'll need to convert them if you want Excel formulas to work. Use ws.cell(row=r, column=c).value = float(val.replace(',','').replace('$','')) to force numeric types on amount columns.
Method 3: pandas (cleaner for data analysis)
tabula-py wraps Tabula as a Python library and integrates directly with pandas, which makes type conversion and Excel writing cleaner:
- pip install tabula-py pandas openpyxl
- import tabula, pandas as pd
- dfs = tabula.read_pdf('report.pdf', pages='all', multiple_tables=True)
- # Combine all tables into one sheet, or write each to a separate sheet:
- with pd.ExcelWriter('output.xlsx') as writer:
- for i, df in enumerate(dfs):
- df.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)
pandas infers types automatically — columns that look numeric become float64, which means Excel formulas work without cleanup. For invoices and financial documents where dollar signs and commas are in the cells, you'll still need to clean before pandas can infer the numeric type.
Method 4: AI extraction (for invoices, statements, forms, and scans)
For documents where the data isn't a simple table — invoices with vendor headers and line items, bank statements with account info plus transaction rows, forms with field-value pairs — or for any scanned PDF, AI extraction produces a clean Excel file with proper structure: header strip on one sheet, data rows on another, numeric values already typed correctly.
Why 'PDF to Excel' tools often produce garbage
Most online 'PDF to Excel' converters take a shortcut: they rasterize the page, run OCR, and dump the raw text into cells. The result looks like the original PDF but isn't actually structured — all data in column A, numbers as text strings you can't sum, merged cells that break filters. The useful conversion isn't 'put the text where it appears', it's 'understand what the data is and put it in columns'.
Getting proper numeric values in Excel
The most common problem after extraction: amount fields that look like numbers but Excel treats as text. Signs: left-aligned in the cell, sum returns 0, VALUE error on arithmetic. Fix: Data → Text to Columns → Delimited → Finish, or use VALUE() formula, or fix the currency symbol stripping in the extraction step.
Frequently asked questions
How do I convert a PDF to Excel for free?+
Tabula (free desktop app) is the best free option for clean PDF tables — no code required. For Python automation, pdfplumber + openpyxl is free and handles most machine-generated PDFs. For scanned PDFs or complex documents like invoices, AI extraction is needed.
Why are my PDF numbers showing as text in Excel?+
The extraction tool preserved the currency symbols or comma-thousands separators from the PDF, so Excel sees '$1,234.56' as text, not a number. Strip the symbols during extraction (or after with Text to Columns), or use an extraction tool that returns typed numeric values.
How do I convert a scanned PDF to Excel?+
Scanned PDFs have no text layer — Tabula, pdfplumber, and pandas return nothing. Use AI extraction which applies OCR automatically and returns structured data with correct types.
How do I convert an invoice PDF to Excel with line items?+
Standard PDF-to-Excel tools dump the raw layout — vendor name in one cell, line items mixed with totals. Use AI invoice extraction, which returns a properly structured Excel: vendor/date/number in a header strip, one line item per row in the data table.