All posts
Workflow4 min read

How to extract data from a receipt (photo or PDF)

Three ways to pull merchant, items, totals, and tax from receipt photos and PDFs — manual OCR apps, Python, and AI extraction for expense reporting and accounting.

Receipts are the messiest documents to extract data from: thermal paper that fades, wrinkled photos, rotated scans, and a total lack of consistency between merchants. Here are three approaches that work in practice.

Method 1: OCR apps (for one-off receipts)

For occasional receipt capture, dedicated apps work well: Adobe Scan, Microsoft Lens, or Google Drive's built-in OCR (upload a photo, right-click → Open With Google Docs) all extract text from a receipt photo. You'll get the raw text, which you then manually pull fields from.

The limitation: you still have to read the OCR output and find the merchant name, total, and line items yourself. For a single receipt that's fine; for twenty receipts a week, it's unsustainable.

Method 2: Python with pytesseract or easyocr

For batch receipt processing in a pipeline (expense management systems, accounting automation), Python lets you run OCR at scale.

  • pip install easyocr pillow
  • import easyocr; reader = easyocr.Reader(['en'])
  • result = reader.readtext('receipt.jpg', detail=0)
  • text = ' '.join(result)

The hard part isn't the OCR — it's parsing the output into structured fields. Merchant names appear in different positions, totals may be labeled 'Total', 'Amount Due', 'Grand Total', or just be the last dollar figure. Tax shows up as 'Tax', 'GST', 'VAT', 'HST', or percentage notation. Writing robust regex or parsing logic for this is a significant engineering project.

If you're processing receipts from a narrow set of known merchants with consistent layouts, the parsing logic is tractable. If you're processing receipts from any restaurant, retail store, or service provider, you'll spend more time on edge cases than the automation saves.

Method 3: AI extraction (structured output from any receipt photo)

Multimodal AI extraction reads the receipt the way a person does — it understands that 'SUBTOTAL' and 'Sub-Total' and 'Sub Total' all mean the same thing, that the merchant name is usually at the top even if it's in a large font, and that line items are the rows between the merchant header and the subtotal. You get a clean structured object back.

Tool
Receipt Extractor
Upload a receipt photo (HEIC, JPEG, PNG) or PDF and get back merchant name, date, every line item with quantity and price, subtotal, tax, total, and payment method — as Excel or JSON. Works on crumpled, rotated, and faded thermal receipts.

What you can extract from a receipt

  • Merchant name and address
  • Date and time
  • Line items — description, quantity, unit price
  • Subtotal, tax (GST/VAT/HST), tips, discounts
  • Total and payment method (cash, card, card type)
  • Receipt or transaction number

Receipt photos vs PDFs

Phone photos of receipts are harder than PDFs — they may be rotated, poorly lit, or have the paper curled at the edges. Multimodal AI models handle this because they process the image visually rather than relying on a clean text layer. pytesseract struggles with rotated or low-contrast thermal paper; AI extraction doesn't.

Building a receipt processing pipeline

A common setup: employees forward receipt photos to a shared inbox, a Zapier/Make automation routes each attachment to ExtractFox's API, and the structured JSON lands in a Google Sheet or accounting tool. The engineering surface is one API call per receipt rather than a custom OCR pipeline.

Frequently asked questions

How do I extract data from a receipt photo?+

Upload the photo to an AI extractor like ExtractFox's receipt extractor. It reads the image directly — no need to convert to PDF first — and returns merchant, items, totals, and tax as structured data you can export to Excel.

Can I extract data from a crumpled or faded receipt?+

Yes, with a multimodal AI extractor. Traditional OCR (pytesseract, Tesseract) struggles with low-contrast thermal paper. AI models trained on real-world receipt photos handle crumpling, rotation, and faded ink much better.

How do I extract receipt data in Python?+

Use easyocr or pytesseract for the raw text layer, then parse the output with regex or an LLM call to extract merchant, total, and line items. For a lower-maintenance approach, call ExtractFox's API (HTTP POST with the image file) which returns structured JSON directly without writing a parser.

What's the best way to extract data from many receipts at once?+

Batch processing via API. Upload each receipt to ExtractFox's API and collect the JSON responses. The output schema is consistent across all receipts — same field names regardless of merchant — so you can directly concatenate rows into a spreadsheet or database.

More on workflow

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →