WorkflowJune 28, 20264 min read

How to extract data from financial statements

Pull revenue, profit, balance sheet, and cash flow numbers from P&L, income statement, and financial report PDFs — with Python, pandas, and AI extraction.

By Dawid Sibinski

Financial statements — income statements, balance sheets, cash flow statements — are where the numbers everyone wants actually live. The challenge is getting them out of PDFs (or Excel files with merged cells and subtotals) in a clean, comparable form.

What counts as a financial statement

Income statement (P&L) — revenue, cost of goods, gross profit, operating expenses, EBITDA, net income
Balance sheet — assets, liabilities, equity
Cash flow statement — operating, investing, financing activities; free cash flow
These appear in annual reports (10-K, 20-F), quarterly reports (10-Q), earnings releases, pitch decks, and standalone CFO reports

Method 1: Python + pandas for Excel-format financials

If the financial statements are in Excel or Google Sheets (not PDF), pandas handles them directly.

pip install pandas openpyxl
import pandas as pd
df = pd.read_excel('financials.xlsx', sheet_name='Income Statement', header=2)
# header=2 skips the company name and report title rows
revenue = df[df.iloc[:,0].str.contains('Revenue', na=False)]

The hard part: every company structures their financial Excel differently. The revenue row might be labeled 'Revenue', 'Net revenues', 'Total revenue', 'Sales', or 'Net sales'. Merged cells for subtotals break pandas indexing. Multi-year comparison tables often have unnamed columns. Expect cleanup work.

Method 2: pdfplumber + Camelot for PDF financial tables

For PDF financial statements with clear table structure (most modern 10-Ks), Camelot extracts the tables well.

pip install camelot-py[cv] pandas
import camelot, pandas as pd
tables = camelot.read_pdf('10k.pdf', pages='45-47', flavor='lattice')
income_df = tables[0].df # inspect to find the right table index
income_df.to_csv('income_statement.csv', index=False)

Limitation: you need to know which pages the financial statements are on (varies per filing), and the column headers are often split across two rows in the PDF so you need to merge them manually. Scanned annual reports won't work — Camelot requires a text layer.

Method 3: AI extraction for any financial statement format

AI extraction is the right approach when you're dealing with: inconsistent labeling across companies, scanned or image-based PDFs, statements that span multiple pages, or when you need the same fields from many companies for comparison.

Tool

Financial Statement Extractor

Upload a 10-K, annual report, earnings release, or standalone financial statement PDF and extract the income statement, balance sheet, or cash flow statement — all years shown — as a clean table. Revenue, gross profit, operating income, net income, EPS. Export to Excel or JSON.

Extracting the same metrics across many companies

Competitive financial analysis — pulling the same line items from five competitors' annual reports — is where AI extraction pays off most. You describe the fields once ('revenue, gross margin, operating income, net income by year') and run the same extraction on each company's filing. The schema is consistent even when company A calls it 'net revenues' and company B calls it 'total sales'.

Common pitfalls when extracting financial data

Units: numbers may be in thousands or millions — check the table header for '(in thousands)'
Currency: multi-national companies may report segment data in local currency
Restated figures: prior-year numbers may be restated from what was originally reported
Non-GAAP metrics: 'adjusted EBITDA' and 'non-GAAP operating income' aren't standard — label them clearly
Negative numbers: losses may be shown in parentheses instead of with a minus sign

Frequently asked questions

How do I extract revenue and profit from a financial statement PDF?+

For text-based PDFs, use Camelot to extract the income statement table by page range. For scanned PDFs or when you need a consistent schema across companies, use AI extraction — describe the fields you want and get structured output regardless of how each company labels their line items.

How do I extract balance sheet data from a PDF?+

Same approach as the income statement. Camelot works on digital PDFs with clear table structure. AI extraction handles any format and lets you specify the exact balance sheet fields you want — total assets, current assets, long-term debt, equity — as named fields in the output.

Can I extract financial data from a scanned annual report?+

Yes with AI extraction, which applies OCR automatically. Camelot and pdfplumber require a text layer and return nothing on scanned PDFs.

How do I compare financial metrics across multiple companies?+

Run the same AI extraction prompt on each company's filing. The output uses consistent field names regardless of how each company labels their line items — so a JOIN across the JSON outputs gives you a comparable table.