How to extract data from a bank statement PDF
Four methods to pull transactions, balances, and account info from bank statement PDFs — from copy-paste to Python to AI — with export to Excel or CSV.
Bank statements are notoriously difficult to extract data from: multi-page transaction tables, inconsistent column orders between banks, currency symbols baked into number cells, and page footers that break copy-paste. Here are four methods that actually work.
Method 1: Copy-paste into Excel
Open the PDF in your browser or Acrobat, select the transaction table, and paste into Excel. Works on about half of digital bank statement PDFs — the rest either collapse into a single column or paste transactions and descriptions into separate rows. If the bank's PDF generator uses complex table layouts, this will fail immediately.
Best for: one-off statements from a bank whose PDF pastes cleanly. Fails on: scanned statements, PDFs with complex column layouts, statements with multi-line description fields.
Method 2: Tabula or Camelot (free tools)
Tabula (free desktop app) lets you draw a box around a transaction table and exports it as CSV. Camelot is a Python library with two parsing modes — lattice (for tables with visible grid lines) and stream (for tables defined by whitespace).
- pip install camelot-py[cv] pandas
- import camelot; tables = camelot.read_pdf('statement.pdf', pages='all', flavor='lattice')
- tables[0].df.to_csv('transactions.csv', index=False)
- If lattice fails, try flavor='stream' — many banks use whitespace-delimited tables.
Limitations: both tools fail on scanned statements (no text layer). They also struggle with multi-page tables where the column header only appears on the first page — transactions on subsequent pages lose their column context.
Method 3: Python with pdfplumber
pdfplumber gives you lower-level control over how tables are reconstructed. It's slower to set up but handles more edge cases than Camelot on statements with unusual layouts.
- pip install pdfplumber pandas
- import pdfplumber, pandas as pd
- with pdfplumber.open('statement.pdf') as pdf:
- rows = [row for page in pdf.pages for row in (page.extract_table() or [])]
- df = pd.DataFrame(rows[1:], columns=rows[0])
- df.to_csv('transactions.csv', index=False)
You'll still need to clean the output — amount fields may have currency symbols, negative amounts may use parentheses instead of minus signs, and descriptions often span two rows in the original PDF and need merging. Budget 30-60 minutes of cleanup per new bank format.
Method 4: AI-powered extraction (handles any bank, any layout, scanned or digital)
AI extraction reads the statement semantically — it understands that a column labeled 'CR' means credit regardless of its position, that parentheses mean debits, and that a running balance column exists even if the header says 'Balance fwd'. Upload the PDF and describe what you want, or use the bank statement preset.
This handles the cases the other methods miss: scanned statements (no text layer), statements where transactions wrap across two lines, PDFs where the bank uses a non-standard column order, and foreign currency accounts where amounts need normalization.
What you can extract from a bank statement
- Account info — bank name, account holder, account number, IBAN/routing number
- Statement period — start date and end date
- Opening and closing balance
- Every transaction — date, description, debit/credit amount, running balance
- Categorized transactions — if you describe the categories in the extraction prompt
Scanned vs digital statements
Older statements (pre-2010), some foreign bank statements, and statements exported from certain banking apps come as image-only PDFs with no text layer. Copy-paste, Tabula, and pdfplumber will return nothing. For scanned statements, OCR is required before extraction — or use an AI extractor that applies OCR automatically.
Processing multiple months at once
If you need a full year of transactions from twelve monthly statements, upload each PDF separately and concatenate the transaction tables in Excel or pandas. Sort by date to restore chronological order across months. AI extraction returns the same schema for every statement from the same account, so the concatenation is a simple row-append.
Frequently asked questions
How do I extract transactions from a bank statement PDF?+
Upload the PDF to an AI extractor or try Tabula for a free option. For code, use pdfplumber or Camelot in Python. Copy-paste works on some digital PDFs but fails on scanned statements and complex layouts.
Can I extract data from a scanned bank statement?+
Yes, but only with OCR-capable tools. Tabula, Camelot, and pdfplumber require a text layer and return nothing on scanned PDFs. An AI extractor applies OCR automatically and works on both scanned and digital bank statement PDFs.
How do I convert a bank statement PDF to Excel?+
Upload the PDF to ExtractFox's bank statement extractor, and download the result as .xlsx. The output has one row per transaction with date, description, amount, and balance, plus account metadata on a separate sheet.
Why does copy-paste from a bank statement PDF give messy results?+
Most bank PDFs use a complex table layout where the PDF renderer doesn't embed the column boundaries as text. When you paste, cells collapse into a single column or transactions split across two rows. Tools like Tabula or AI extraction read the visual structure instead of relying on the text order.