How to extract data from a purchase order PDF
Pull PO number, buyer, supplier, line items, and delivery terms from any purchase order PDF — with Python, ERP exports, and AI extraction for AP three-way matching.
Purchase orders come from every ERP system on the market — SAP, Oracle, NetSuite, QuickBooks, Microsoft Dynamics, and custom-built systems. Each generates a different PDF layout. Extracting the same fields (PO number, buyer, supplier, line items, totals) from all of them without per-supplier templates is the challenge.
What to extract from a purchase order
- PO header — PO number, PO date, buyer company and address, supplier company and address
- Delivery info — ship-to address, delivery date, incoterms
- Line items — line number, SKU/item code, description, quantity, unit price, line total
- Payment terms — net 30, 2/10 net 30, etc.
- Totals — subtotal, shipping, tax, grand total
- Signature and approval references (where present)
Method 1: ERP export (skip extraction where possible)
If you control the ERP that generated the PO, export the data directly rather than parsing the PDF. Most ERPs (SAP, Oracle, NetSuite) have API endpoints or export functions that return PO data as JSON or XML. The PDF is a human-readable representation; the source data is in the ERP. Extraction is for when you're on the receiving end of a PO from a buyer whose system you don't control.
Method 2: Python with Camelot for structured PO PDFs
POs generated by major ERPs often have well-formed tables that Camelot handles well.
- pip install camelot-py[cv] pandas
- import camelot, pandas as pd
- tables = camelot.read_pdf('purchase_order.pdf', flavor='lattice', pages='all')
- line_items = tables[0].df # usually the first table is the line-item grid
- line_items.to_csv('po_lines.csv', index=False)
Camelot extracts the table but not the header metadata (PO number, buyer, supplier, dates). You'd combine it with pdfplumber text extraction and regex to pull the header fields. This is a workable approach for a single buyer whose PO template you know well — not scalable across many buyers.
Method 3: AI extraction (any buyer, any ERP-generated layout)
AI extraction understands purchase orders as documents — it knows that the largest company name near the top is the buyer, that 'PO No.', 'Purchase Order Number', and 'PO #' all refer to the same field, and that the line-item table has quantity × unit price = line total regardless of column order.
AP three-way matching
Three-way matching — verifying that the PO, goods receipt, and invoice all agree — requires extracting comparable fields from all three documents. Extract the PO line items, the invoice line items, and match them by SKU and quantity. Discrepancies in price or quantity flag for review. AI extraction returns the same schema for POs and invoices, making the comparison a straightforward data operation rather than a manual check.
Receiving POs from many buyers
Suppliers typically receive POs from many buyers, each using their own ERP system and PO template. Per-buyer parsing templates break every time a buyer upgrades their system or changes a field label. AI extraction handles each new buyer's template automatically — no template to build, no maintenance when templates change.
Frequently asked questions
How do I extract line items from a purchase order PDF?+
For a PO from a single known buyer with a consistent template, Camelot (Python) extracts the line-item table well. For POs from many buyers with different layouts, AI extraction returns the same schema (SKU, description, quantity, unit price, line total) regardless of layout.
How do I automate purchase order data entry from PDFs?+
Route incoming PO emails to a processor that sends each PDF to an extraction API. The API returns JSON — PO number, buyer, line items, totals — that your ERP or procurement system ingests directly. No manual re-keying.
Can I extract PO data from scanned or signed POs?+
Yes with AI extraction. Scanned POs (or POs with wet signatures that were scanned after signing) have no text layer — Camelot and pdfplumber return nothing. AI extraction applies OCR automatically.
How is a PO extractor different from an invoice extractor?+
POs and invoices share similar structure (line items, buyer/seller, totals) but different fields matter. POs emphasize delivery terms, ship-to addresses, and item codes. Invoices emphasize payment terms, due dates, and tax. Using a purpose-built extractor for each ensures the right fields are returned in the right schema.