All posts
EngineeringMay 3, 20265 min read

How to extract metadata from a PDF

Author, creation date, producer software, custom XMP fields — every PDF carries metadata most people never see. Here's how to read it, in any language, and what's worth pulling for indexing or audit.

By Dawid Sibinski

PDF metadata lives in two places: the document information dictionary (the old way: Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate) and an XMP packet (the modern way: arbitrary RDF/XML, including custom schemas). Most tools read both and present a unified view.

Quick check in any reader

Acrobat / Adobe Reader: File → Properties. Preview on macOS: Tools → Show Inspector. Both surface the standard fields; neither shows the full XMP packet.

Command line: ExifTool

exiftool report.pdf dumps everything — info dict, XMP, embedded fonts, page count, encryption status. This is the right tool for forensics or one-off inspection.

Python: pypdf or pdfminer.six

from pypdf import PdfReader reader = PdfReader("report.pdf") print(reader.metadata) print(reader.metadata.title, reader.metadata.author) xmp = reader.xmp_metadata # full XMP namespace if present

pypdf returns a PdfDocumentInfo object that behaves like a dict. The .xmp_metadata attribute exposes every XMP namespace (dc, pdf, xmp, xmpRights, custom) as Python objects.

C#: iText 7 or PdfPig

PdfPig (MIT-licensed) for read-only metadata access:

using var doc = PdfDocument.Open("report.pdf"); var info = doc.Information; Console.WriteLine($"{info.Title} by {info.Author}");

iText for full XMP read/write (mind the AGPL license).

Java: Apache PDFBox

PDDocument.load() then getDocumentInformation() gives you the standard fields, getDocumentCatalog().getMetadata() gives the XMP stream. Mature, Apache-licensed, the canonical Java choice.

C++: PoDoFo or Qt Pdf

PoDoFo is the standard pure-C++ option. PdfMemDocument doc; doc.Load("report.pdf"); auto info = doc.GetInfo(); — clean enough but the project's documentation is sparser than the Python and Java equivalents.

What's actually worth extracting

For document indexing or DAM systems: title, author, subject, keywords, page count, creation/modification dates. For audit and provenance: producer (the software that made the PDF — telling for forensic work), original creation date vs. modification date (lets you spot edits), digital signature info if present.

When the metadata is unreliable

Author and title are set by whatever tool created the PDF and almost no one fixes them. "Microsoft® Word for Microsoft 365" as Producer with Title "Document1" tells you nothing useful. For real document classification, you usually need to extract from the content itself — that's the AI extraction layer, separate from the metadata layer.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →