All posts
TutorialMay 3, 20265 min read

How to extract images from a PDF

From a one-off Preview save to batch extraction with pdfimages, pypdf, or iText — every realistic way to pull every embedded image out of a PDF at full quality.

By Dawid Sibinski

PDFs embed images as separate objects inside the file. "Extracting" them really means dumping those objects to disk in their original format. That's good news — there's no quality loss when you do it right, because you're getting the bytes that were embedded, not a re-render.

One image, no install

Acrobat: right-click the image → Copy Image, paste into any image editor, save. Preview on macOS: select the image with the rectangular selection tool, copy, paste into Preview's File → New from Clipboard. Both are fine for one or two images. Both lose quality if the embedded image was higher-resolution than your screen.

Every image, full quality: pdfimages

Comes with the poppler-utils package. The right tool for batch:

pdfimages -all report.pdf out/img # writes out/img-001.jpg, out/img-002.png, etc.

-all writes each image in its native format (JPEG stays JPEG, PNG stays PNG). -j writes JPEGs and converts everything else; -png forces PNG output. -list inspects without writing — useful to see how many images and at what resolution before extracting.

Python: pypdf

from pypdf import PdfReader reader = PdfReader("report.pdf") for page_num, page in enumerate(reader.pages, 1): for img in page.images: with open(f"page_{page_num}_{img.name}", "wb") as f: f.write(img.data)

page.images iterates every embedded image in original format. Faster than pdfimages for tight integration with the rest of a Python pipeline.

C#: PdfPig or iText

PdfPig (MIT) for read access:

using var doc = PdfDocument.Open("report.pdf"); foreach (var page in doc.GetPages()) foreach (var img in page.GetImages()) File.WriteAllBytes($"img_{img.Name}.png", img.RawBytes.ToArray());

iText 7 covers the same ground with finer control over color spaces and JBIG2 (mind the AGPL license).

Java: PDFBox

PDFBox's PDFStreamEngine plus a custom processor extracts images during page rendering — the canonical Java pattern. Apache-licensed, mature, handles every PDF image encoding including JBIG2 and JPEG2000 if you add the optional jai-imageio-jpeg2000 dependency.

Quality and quirks

  • PDFs sometimes split a single visible image into many small tiles. pdfimages will dump them as separate files; merging requires positional metadata.
  • Some images are stored with a separate stencil mask (alpha channel as a second object). pdfimages flags these — use -all and check for paired images with similar timestamps.
  • JBIG2-encoded images won't open in most image viewers. Re-encode with pdfimages -png if you need portability.
  • Vector graphics (logos drawn with PDF path operators, not embedded as raster) aren't "images" in this sense — pdfimages skips them. For those, render the page region to PNG with a tool like pdf2image.

Image plus surrounding text

If the goal is each image alongside its caption or the prose around it (figure cataloging in scientific papers, or building a multimodal training set), pdfimages alone isn't enough — you need positional info to pair images with nearby text spans. pdfplumber's page.images returns bounding boxes you can intersect with text spans. For unstructured documents at scale, ExtractFox's PDF extractor returns each image's bounding box alongside extracted text, so figure-and-caption pairing falls out naturally.

More on tutorial

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →