How to extract text from a PowerPoint file
Three ways to pull text out of .pptx files — the built-in outline view, scripting with python-pptx, and image-based extraction for slides where the text is baked into pictures.
Most .pptx files are easy to extract from — text lives in shapes that any tool can read. The hard cases are slides built as images, screenshots of dashboards, or decks exported to PDF where the original .pptx is gone.
1. PowerPoint's outline view
View → Outline View shows every text box as plain text in slide order. Select all, copy, paste into your destination. This catches title and body text but misses content inside grouped shapes, SmartArt, and image-based text.
Faster variant: File → Save As → Outline (.rtf). You get a clean text file of every text element on every slide.
2. python-pptx for programmatic access
MIT-licensed, handles every text shape including those in groups and tables:
from pptx import Presentation prs = Presentation("deck.pptx") for i, slide in enumerate(prs.slides, 1): for shape in slide.shapes: if shape.has_text_frame: for para in shape.text_frame.paragraphs: print(i, para.text)
Add an extra branch for shape.has_table to walk table cells. For grouped shapes, recurse into shape.shapes when shape.shape_type is GROUP.
3. Slides as images
If the deck is a series of image-only slides (common for branded marketing decks and screenshots-of-dashboards decks), neither of the above works. Two options:
- Export to PDF, then OCR with ocrmypdf or run through a PDF text extractor.
- Export each slide as PNG (File → Export → PNG), then run them through ExtractFox's image data extractor with a prompt like "extract all visible text in reading order."
Online .pptx files
If the file is on SharePoint or Google Slides, both support exporting to PDF or .pptx for free. The Google Slides API also exposes presentation content directly via REST — useful for automated pipelines pulling from a shared Drive.