WorkflowFebruary 15, 20265 min read

How to extract data and metadata from PDFs with Power Automate

The built-in actions for PDF text extraction, the AI Builder model for invoices and receipts, and how to wire either one into a flow that drops structured data into Excel or Dataverse.

By Dawid Sibinski

Power Automate has two distinct paths for getting data out of PDFs: built-in actions for plain text and metadata, and AI Builder for structured extraction (invoices, receipts, IDs, custom forms). They're priced differently and they solve different problems.

Built-in: text and metadata

The Plumsail Documents and Encodian connectors (free tier available) cover most extraction needs without per-call AI Builder credits:

Get PDF Metadata — title, author, creation date, page count.
Convert PDF to Text — flat text dump, no layout preservation.
Extract Tables from PDF — works on bordered tables, less reliable on borderless.
Fill PDF Form / Read PDF Form Fields — for true PDF forms with AcroFields.

Wire the output into Excel Online (Add row to table) or SharePoint (Update item) actions to land the data where you need it.

AI Builder: structured extraction

AI Builder ships prebuilt models for invoices, receipts, IDs, business cards, and a custom-trainable model for arbitrary forms. The flow looks like:

Trigger: When a file is created (SharePoint, OneDrive, Outlook attachment).
Action: Extract information from invoices using AI Builder.
Action: Add a row to an Excel table or create a Dataverse record from the parsed fields.

Pricing is credit-based — check the AI Builder calculator for current per-page costs. Custom models require a few dozen sample documents to train and produce a model you can call from any flow.

Image metadata

Power Automate doesn't have first-party EXIF support. Workarounds: use the AI Builder "Extract information from images" with a custom model, or call an Azure Function that runs ExifTool and returns JSON. The Azure Function path is more flexible if you need full EXIF including GPS.

When AI Builder isn't enough

AI Builder shines on the prebuilt domains (invoices, receipts). For long-tail document types — leases, scientific reports, multi-page contracts — it requires training a custom model and the accuracy ceiling is lower than a multimodal LLM. Two pragmatic alternatives inside a Power Automate flow:

HTTP action calling Azure Document Intelligence prebuilt or custom model.
HTTP action calling ExtractFox's API with a JSON schema for the document type.

Both let you keep the rest of the flow (trigger, downstream Excel/SharePoint/Dataverse actions) intact and swap in better extraction.

Practical pattern

Use AI Builder for the standard document types where the prebuilt models work. Fall back to an external extraction API for anything custom. Keep the orchestration in Power Automate so non-developers can maintain the flow.