All posts
EngineeringFebruary 28, 20267 min read

How to extract data from a PDF in C#

A working engineer's tour of the PDF extraction libraries in the .NET ecosystem — iText, PdfPig, Azure Document Intelligence, and the API-first alternative when you don't want to ship a parser at all.

By Dawid Sibinski

PDF extraction in C# splits into two questions: do you have structured PDFs (text layer present, predictable layout) or unstructured ones (scans, images, varying layouts)? The library you reach for depends entirely on which side you're on.

Structured PDFs

PdfPig

Free, MIT-licensed, pure .NET. Reads text and basic layout from any PDF with a text layer. No write support, no advanced features — just clean reading.

using UglyToad.PdfPig; using var doc = PdfDocument.Open("invoice.pdf"); foreach (var page in doc.GetPages()) { Console.WriteLine(page.Text); }

For structured invoices and reports with selectable text, this gets you 80% of the way in 10 lines.

iText 7

More powerful, with read, write, fill-form, and digital signature support. Note the licensing: iText is AGPL by default — you need a commercial license for non-OSS use. For a lot of teams that alone rules it out; for others it's worth the spend.

PDFsharp / MigraDoc

Mature, MIT-licensed alternative. Stronger on PDF generation than extraction, but it does both. Less surface area than iText.

Tables specifically

PdfPig has a community Table extraction add-on, but for serious table work most .NET teams either shell out to a Java tool like Tabula or call a hosted service. Native C# table extraction is the weakest part of the ecosystem.

Unstructured PDFs (scans, images, varying layouts)

Azure AI Document Intelligence

Microsoft's hosted service. Pre-built models for invoices, receipts, and IDs; custom models you train on your own samples. The .NET SDK is first-class:

var client = new DocumentAnalysisClient(endpoint, credential); var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, "prebuilt-invoice", fileStream); var result = operation.Value;

Pricing is per-page; for high volumes it adds up. Quality is good on the prebuilt invoice/receipt models, less reliable on long-tail document types.

AWS Textract

Equivalent service from AWS, accessible from .NET via AWSSDK.Textract. Strong on tables and forms, weaker than Azure on prebuilt domain models.

API-first: ExtractFox

If you don't want to ship a PDF library at all and prefer to call a hosted endpoint with HttpClient, ExtractFox exposes a JSON API that handles structured and unstructured PDFs through the same call. You send the file (or a URL), declare what you want as a JSON schema or a free-text description, and get structured output back.

var content = new MultipartFormDataContent(); content.Add(new StreamContent(fileStream), "file", "invoice.pdf"); content.Add(new StringContent("invoice"), "vertical"); var response = await http.PostAsync("https://extractfox.com/api/extract", content);

See the API docs for the full schema and authentication details.

Choosing

  • Structured PDFs, simple text extraction → PdfPig.
  • Structured PDFs, complex manipulation → iText (mind the license) or PDFsharp.
  • Scanned/varied invoices and receipts → Azure Document Intelligence or ExtractFox.
  • Don't want to maintain any of this → ExtractFox, or one of the hosted services.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →