WorkflowJune 28, 20264 min read

How to extract data from a purchase order PDF

Pull PO number, buyer, supplier, line items, and delivery terms from any purchase order PDF — with Python, ERP exports, and AI extraction for AP three-way matching.

By Dawid Sibinski

Purchase orders come from every ERP system on the market — SAP, Oracle, NetSuite, QuickBooks, Microsoft Dynamics, and custom-built systems. Each generates a different PDF layout. Extracting the same fields (PO number, buyer, supplier, line items, totals) from all of them without per-supplier templates is the challenge.

What to extract from a purchase order

PO header — PO number, PO date, buyer company and address, supplier company and address
Delivery info — ship-to address, delivery date, incoterms
Line items — line number, SKU/item code, description, quantity, unit price, line total
Payment terms — net 30, 2/10 net 30, etc.
Totals — subtotal, shipping, tax, grand total
Signature and approval references (where present)

Method 1: ERP export (skip extraction where possible)

If you control the ERP that generated the PO, export the data directly rather than parsing the PDF. Most ERPs (SAP, Oracle, NetSuite) have API endpoints or export functions that return PO data as JSON or XML. The PDF is a human-readable representation; the source data is in the ERP. Extraction is for when you're on the receiving end of a PO from a buyer whose system you don't control.

Method 2: Python with Camelot for structured PO PDFs

POs generated by major ERPs often have well-formed tables that Camelot handles well.

pip install camelot-py[cv] pandas
import camelot, pandas as pd
tables = camelot.read_pdf('purchase_order.pdf', flavor='lattice', pages='all')
line_items = tables[0].df # usually the first table is the line-item grid
line_items.to_csv('po_lines.csv', index=False)

Camelot extracts the table but not the header metadata (PO number, buyer, supplier, dates). You'd combine it with pdfplumber text extraction and regex to pull the header fields. This is a workable approach for a single buyer whose PO template you know well — not scalable across many buyers.

Method 3: AI extraction (any buyer, any ERP-generated layout)

AI extraction understands purchase orders as documents — it knows that the largest company name near the top is the buyer, that 'PO No.', 'Purchase Order Number', and 'PO #' all refer to the same field, and that the line-item table has quantity × unit price = line total regardless of column order.

Tool

Purchase Order Extractor

Upload any PO PDF and get the header (PO number, buyer, supplier, date, delivery terms) plus every line item as a row with SKU, description, quantity, unit price, and line total. Export to Excel for a procurement spreadsheet or JSON for ERP import. No signup required.

AP three-way matching

Three-way matching — verifying that the PO, goods receipt, and invoice all agree — requires extracting comparable fields from all three documents. Extract the PO line items, the invoice line items, and match them by SKU and quantity. Discrepancies in price or quantity flag for review. AI extraction returns the same schema for POs and invoices, making the comparison a straightforward data operation rather than a manual check.

Receiving POs from many buyers

Suppliers typically receive POs from many buyers, each using their own ERP system and PO template. Per-buyer parsing templates break every time a buyer upgrades their system or changes a field label. AI extraction handles each new buyer's template automatically — no template to build, no maintenance when templates change.

Frequently asked questions

How do I extract line items from a purchase order PDF?+

For a PO from a single known buyer with a consistent template, Camelot (Python) extracts the line-item table well. For POs from many buyers with different layouts, AI extraction returns the same schema (SKU, description, quantity, unit price, line total) regardless of layout.

How do I automate purchase order data entry from PDFs?+

Route incoming PO emails to a processor that sends each PDF to an extraction API. The API returns JSON — PO number, buyer, line items, totals — that your ERP or procurement system ingests directly. No manual re-keying.

Can I extract PO data from scanned or signed POs?+

Yes with AI extraction. Scanned POs (or POs with wet signatures that were scanned after signing) have no text layer — Camelot and pdfplumber return nothing. AI extraction applies OCR automatically.

How is a PO extractor different from an invoice extractor?+

POs and invoices share similar structure (line items, buyer/seller, totals) but different fields matter. POs emphasize delivery terms, ship-to addresses, and item codes. Invoices emphasize payment terms, due dates, and tax. Using a purpose-built extractor for each ensures the right fields are returned in the right schema.