How to extract metadata from a PDF
Author, creation date, producer software, custom XMP fields — every PDF carries metadata most people never see. Here's how to read it, in any language, and what's worth pulling for indexing or audit.
PDF metadata lives in two places: the document information dictionary (the old way: Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate) and an XMP packet (the modern way: arbitrary RDF/XML, including custom schemas). Most tools read both and present a unified view.
Quick check in any reader
Acrobat / Adobe Reader: File → Properties. Preview on macOS: Tools → Show Inspector. Both surface the standard fields; neither shows the full XMP packet.
Command line: ExifTool
exiftool report.pdf dumps everything — info dict, XMP, embedded fonts, page count, encryption status. This is the right tool for forensics or one-off inspection.
Python: pypdf or pdfminer.six
from pypdf import PdfReader reader = PdfReader("report.pdf") print(reader.metadata) print(reader.metadata.title, reader.metadata.author) xmp = reader.xmp_metadata # full XMP namespace if present
pypdf returns a PdfDocumentInfo object that behaves like a dict. The .xmp_metadata attribute exposes every XMP namespace (dc, pdf, xmp, xmpRights, custom) as Python objects.
C#: iText 7 or PdfPig
PdfPig (MIT-licensed) for read-only metadata access:
using var doc = PdfDocument.Open("report.pdf"); var info = doc.Information; Console.WriteLine($"{info.Title} by {info.Author}");
iText for full XMP read/write (mind the AGPL license).
Java: Apache PDFBox
PDDocument.load() then getDocumentInformation() gives you the standard fields, getDocumentCatalog().getMetadata() gives the XMP stream. Mature, Apache-licensed, the canonical Java choice.
C++: PoDoFo or Qt Pdf
PoDoFo is the standard pure-C++ option. PdfMemDocument doc; doc.Load("report.pdf"); auto info = doc.GetInfo(); — clean enough but the project's documentation is sparser than the Python and Java equivalents.
What's actually worth extracting
For document indexing or DAM systems: title, author, subject, keywords, page count, creation/modification dates. For audit and provenance: producer (the software that made the PDF — telling for forensic work), original creation date vs. modification date (lets you spot edits), digital signature info if present.
When the metadata is unreliable
Author and title are set by whatever tool created the PDF and almost no one fixes them. "Microsoft® Word for Microsoft 365" as Producer with Title "Document1" tells you nothing useful. For real document classification, you usually need to extract from the content itself — that's the AI extraction layer, separate from the metadata layer.