All posts
Workflow5 min read

How to extract data from an invoice PDF or image

Pull vendor, line items, totals, and tax from any invoice — PDF, photo, or scan — using copy-paste, Python, or AI extraction. No templates needed.

Invoice data extraction sounds simple — vendor, dates, line items, total — but invoices from different suppliers look wildly different. The vendor name might be at the top or the bottom. Line items might be a table or a paragraph. Tax might be labeled 'VAT', 'GST', 'Tax', or nothing at all. Here are the methods that actually handle this variety.

Method 1: Copy-paste (works on tidy digital PDFs)

Open the invoice PDF in your browser, select the line-item table, paste into Excel. Works on about 40% of invoices where the PDF was generated by software (not scanned). Most of the time, the columns collapse or the amounts paste as text strings that Excel can't sum.

Method 2: Python with pdfplumber + regex

For a fixed supplier you process at volume — same ERP, same layout every time — a Python script pays off.

  • pip install pdfplumber
  • import pdfplumber, re
  • with pdfplumber.open('invoice.pdf') as pdf:
  • text = ' '.join(p.extract_text() or '' for p in pdf.pages)
  • total = re.search(r'(?:Total|Amount Due)[:\s]+\$?([\d,]+\.\d{2})', text, re.I)
  • if total: print(total.group(1))

The limitation: regex breaks the moment the supplier changes their template. 'Total', 'Grand Total', 'Amount Due', 'Invoice Total', 'Balance Due' all mean the same thing — you need a different pattern for each. For invoices from many suppliers, maintaining this becomes a project.

Method 3: Dedicated invoice parsing libraries

Invoice2data (Python, open-source) uses YAML templates — you define the fields and patterns per supplier once, and it extracts them on every future invoice from that supplier. Good fit if you have under 20 suppliers and stable layouts. Falls apart when a supplier updates their system.

Method 4: AI extraction (any invoice, any layout, no templates)

Multimodal AI reads invoices semantically — it understands that 'Bill To' means the customer regardless of position, that line-item tables always have quantity × unit price = amount, and that the total is the biggest dollar figure near the bottom. No template needed, no regex to maintain.

Tool
Invoice Extractor
Drop any invoice PDF, photo, or scan and get vendor, invoice number, dates, every line item with quantity and unit price, subtotal, tax, and total — as Excel, CSV, or JSON. Works on invoices from any country, any ERP, any layout. No signup required.

What you can extract from an invoice

  • Vendor — company name, address, tax ID / VAT number
  • Customer — bill-to name and address
  • Invoice number and purchase order reference
  • Issue date and payment due date
  • Line items — description, quantity, unit price, line total
  • Subtotal, tax (VAT/GST/HST), discounts, shipping
  • Total amount and currency
  • Payment terms and bank details (where printed)

Scanned vs digital invoices

Invoices received by email as PDF attachments are usually digital — they have a text layer. Invoices sent by post and scanned, or photos taken of paper invoices, have no text layer. pdfplumber and invoice2data return nothing on scanned invoices. AI extraction applies OCR automatically and handles both.

Processing invoices from many suppliers

For AP automation — extracting invoices from a high-volume accounts payable inbox — the practical setup is an API call per invoice. ExtractFox's API takes a file upload and returns the same JSON schema regardless of which supplier sent the invoice. You can write the AP system integration once and drop new suppliers without touching the code.

Frequently asked questions

How do I extract line items from an invoice PDF?+

For a quick one-off, copy-paste the table from the PDF into Excel. For any serious volume, use AI extraction — it returns every line with description, quantity, unit price, and amount in a table regardless of the invoice layout.

Can I extract invoice data from a scanned PDF or photo?+

Yes, with AI extraction. Traditional tools like pdfplumber require a text layer and return nothing on scanned invoices. AI extraction applies OCR automatically and works on both scanned and digital PDFs, as well as phone photos of paper invoices.

How do I automate invoice data extraction?+

Route invoice emails to a processor that attaches each PDF to an API call to ExtractFox. The API returns JSON — vendor, invoice number, line items, totals — that your AP or ERP system ingests directly. No per-supplier templates to maintain.

What's the difference between invoice extraction and OCR?+

OCR returns raw text — all the characters on the page in roughly their original order. Invoice extraction takes that text (or the raw image) and returns named fields: vendor='Acme Corp', total=1240.50, line_items=[...]. You need extraction logic on top of OCR to get structured data.

More on workflow

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →