WorkflowJune 28, 20265 min read

How to extract data from an annual report

Four practical methods to pull financials, tables, and metrics from annual report and 10-K PDFs — from copy-paste to Python to AI-powered extraction.

By Dawid Sibinski

Annual reports and 10-K filings pack the numbers you need — revenue, net income, segment breakdowns, EPS — into dense PDFs that weren't designed for copying. Here are four methods to get the data out, ordered from simplest to most powerful.

Method 1: Copy-paste (works for one table)

Open the PDF in Adobe Acrobat or your browser. Click and drag to select a table. Paste into Excel. This works about half the time on text-based PDFs — the rest of the time, the columns collapse or the numbers land in the wrong cells.

Use this when you need one table from one filing and you can spend five minutes cleaning it up. It breaks immediately on multi-page tables, scanned reports, and anything with merged cells.

Method 2: Tabula (free, open-source)

Tabula is a free desktop app that draws a selection box around a table in a PDF and exports the contents as CSV. It works well on clean, text-based PDFs where the columns are visually obvious.

Download Tabula (tabula.technology), open your annual report PDF.
Draw a selection box around the income statement or balance sheet.
Click Export and choose CSV or Excel.
Repeat for each table you need.

Limitations: Tabula can't read scanned PDFs (it needs a text layer), it fails on tables that span multiple pages, and you have to select each table manually — so extracting five years of filings is tedious.

Method 3: Python with pdfplumber

For batch extraction — pulling the income statement from twelve quarterly filings, or extracting the same metric across a dozen competitors — Python gives you a repeatable pipeline.

pdfplumber is the most reliable library for table extraction from text-based PDFs:

pip install pdfplumber pandas
import pdfplumber; pdf = pdfplumber.open('10k.pdf')
tables = pdf.pages[12].extract_tables() # page number varies by filing
import pandas as pd; df = pd.DataFrame(tables[0][1:], columns=tables[0][0])
df.to_csv('income_statement.csv', index=False)

The hard part is that page numbers and table positions change between filers and between years. You'll need to inspect each PDF to find the right page, and the column names will vary ("Net revenues" vs "Total revenue" vs "Revenue, net"). For consistent schema extraction across multiple companies, you need something that understands the semantics, not just the layout.

Method 4: AI-powered extraction (handles scans, inconsistent layouts, any schema)

AI extraction reads the document semantically instead of by layout coordinates. You describe what you want — "income statement with revenue, operating income, and net income by year" — and the model finds and normalizes the data regardless of how the filing is formatted.

Tool

Annual Report Extractor

Drop a 10-K or annual report PDF and get income statement, balance sheet, cash flow, segment data, and risk factors back as JSON or Excel — across all years shown. Works on scanned and digital PDFs, no template needed.

This handles the cases that break Tabula and pdfplumber: scanned annual reports, multi-page tables, IFRS vs GAAP formatting differences, non-US currencies, and consistent field extraction across many companies with different layouts.

What you can extract from an annual report

Income statement — revenue, gross profit, operating income, net income, EPS by year
Balance sheet — total assets, current assets, liabilities, equity, cash, long-term debt
Cash flow — operating, investing, and financing activities; free cash flow; capex
Segment breakdown — revenue and profit by business unit or geography
Key metrics — growth rates, margins, guidance, KPIs
Risk factors — material risks as structured rows with title and summary
Executive compensation — named officers, base salary, stock awards, total comp
ESG data — emissions (scope 1/2/3), diversity %, board governance

Scanned vs digital PDFs

If your annual report was scanned — common for older filings and international companies — copy-paste and Tabula won't work at all. pdfplumber will return empty tables. AI extraction (method 4) handles scanned PDFs because it applies OCR before extraction.

For most modern 10-K and 20-F filings from public companies, the PDF is text-based, so all four methods apply. If you're pulling older filings (pre-2010) or private-company annual reports, assume scanned and plan accordingly.

Extracting the same fields across multiple companies

Competitive analysis — pulling the same income statement metrics from five companies' annual reports — is where the difference between methods is most visible. Copy-paste and Tabula require manual effort per filing. Python scripts require per-filer tuning. AI extraction uses the same prompt on all filings and returns a consistent schema regardless of how each company labels their line items.

The output is a JSON array where each row is a year and each key is a field you asked for — ready to paste into Excel or import into a model.

Frequently asked questions

How do I extract data from an annual report PDF?+

For a single table from a text-based PDF, copy-paste or Tabula work. For batch extraction across multiple filings, use Python with pdfplumber. For scanned reports or consistent cross-company extraction, use AI-powered extraction which reads the document semantically.

Can I extract data from a scanned annual report?+

Yes, but only with OCR-capable tools. Copy-paste, Tabula, and pdfplumber require a text layer and will return nothing on scanned PDFs. AI extraction applies OCR automatically and works on both scanned and digital PDFs.

How do I extract income statement data from a 10-K?+

Open the 10-K PDF, find the Consolidated Statements of Operations (usually around page 50–80), and use Tabula to select the table, or use an AI extractor with a prompt like 'income statement with revenue, operating income, net income, and EPS by year'. The AI approach works even when the page number varies between filings.

How do I extract the same metrics from annual reports of multiple companies?+

Python scripts can batch-process multiple PDFs but require tuning per filer since page numbers and column names differ. AI extraction is more reliable for cross-company comparisons: describe the fields once ('revenue, gross margin, net income by year') and run the same extraction on each company's filing.