TutorialApril 27, 20269 min read

Extract text from PowerPoint (.pptx): Outline View, python-pptx, speaker notes

Q: Does python-pptx extract text from PDF-exported PowerPoint?

No — once exported to PDF, you need a PDF extractor. python-pptx only reads the .pptx XML structure directly.

Q: Why is some slide text missing?

Usually grouped shapes, SmartArt, or text rendered as images. Run the recursive group walker. If still missing, the text is baked into a picture — OCR is the only path.

Q: Can I extract text from Google Slides programmatically?

Yes — use the Google Slides API presentations.get, which returns every text element with its objectId and content. Export to .pptx and use python-pptx if you prefer a file-based pipeline.

Extract all text from a .pptx with Outline View, python-pptx (tables, groups, speaker notes), or OCR for image slides — copy-paste code for automation pipelines.

By Dawid Sibinski · Updated July 15, 2026

To extract text from a PowerPoint file, start with Outline View or Save As Outline for a one-off export. For code, use python-pptx to read text shapes and tables. If the slide is an image or screenshot, use OCR or image-based extraction instead.

Most .pptx files are easy to extract from — text lives in shapes that any tool can read. The hard cases are slides built as images, screenshots of dashboards, or decks exported to PDF where the original .pptx is gone.

1. PowerPoint's outline view

View → Outline View shows every text box as plain text in slide order. Select all, copy, paste into your destination. This catches title and body text but misses content inside grouped shapes, SmartArt, and image-based text.

Faster variant: File → Save As → Outline (.rtf). You get a clean text file of every text element on every slide.

2. python-pptx for programmatic access

MIT-licensed, handles every text shape including those in groups and tables:

from pptx import Presentation prs = Presentation("deck.pptx") for i, slide in enumerate(prs.slides, 1): for shape in slide.shapes: if shape.has_text_frame: for para in shape.text_frame.paragraphs: print(i, para.text)

Add an extra branch for shape.has_table to walk table cells. For grouped shapes, recurse into shape.shapes when shape.shape_type is GROUP.

3. Slides as images

If the deck is a series of image-only slides (common for branded marketing decks and screenshots-of-dashboards decks), neither of the above works. Two options:

Export to PDF, then OCR with ocrmypdf or run through a PDF text extractor.
Export each slide as PNG (File → Export → PNG), then run them through ExtractFox's image data extractor with a prompt like "extract all visible text in reading order."

Online .pptx files

If the file is on SharePoint or Google Slides, both support exporting to PDF or .pptx for free. The Google Slides API also exposes presentation content directly via REST — useful for automated pipelines pulling from a shared Drive.

python-pptx: tables, groups, and speaker notes

A complete extractor that handles the cases Outline View misses:

from pptx import Presentation from pptx.enum.shapes import MSO_SHAPE_TYPE def shape_text(shape): if shape.has_text_frame: return "\n".join(p.text for p in shape.text_frame.paragraphs) if shape.has_table: return "\n".join(" | ".join(cell.text for cell in row.cells) for row in shape.table.rows) if shape.shape_type == MSO_SHAPE_TYPE.GROUP: return "\n".join(shape_text(s) for s in shape.shapes) return "" prs = Presentation("deck.pptx") for i, slide in enumerate(prs.slides, 1): print(f"--- Slide {i} ---") for shape in slide.shapes: t = shape_text(shape) if t.strip(): print(t) if slide.has_notes_slide: print("[Notes]", slide.notes_slide.notes_text_frame.text)

Speaker notes live in a separate notes_slide — easy to miss if you only iterate slide.shapes. Compliance and legal reviews often care about notes more than slide body text.

Legacy .ppt files

python-pptx does not read binary .ppt (pre-2007). Options: open in PowerPoint/LibreOffice and Save As .pptx, or use LibreOffice headless:

soffice --headless --convert-to pptx legacy.ppt

Then run python-pptx on the converted file. Batch-convert a folder before your extraction script runs.

SmartArt and charts

SmartArt text is inside grouped shapes — the recursive shape_text function above catches most of it. Chart titles and data labels are trickier: chart shapes expose chart.chart_title and series names via the chart API, but data values require chart.plots[0].series[0].values. For a full data export, consider exporting the embedded Excel workbook (chart.part.related_parts) rather than parsing the visual.

Choosing the right method

Situation	Best approach
One deck, quick copy-paste	Outline View or Save As .rtf
Automation, real text shapes	python-pptx with group recursion
Need speaker notes	python-pptx notes_slide branch
Slides are screenshots/images	Export PNG → OCR or ExtractFox image extractor
Only have PDF export	PDF text extractor or ocrmypdf
Legacy .ppt	LibreOffice convert → python-pptx

What's inside a .pptx file

A .pptx is a ZIP archive of XML parts. Slide text lives in ppt/slides/slideN.xml inside <a:t> runs. Speaker notes are in ppt/notesSlides/. Tables, groups, and SmartArt each have their own part files. python-pptx abstracts this; if you need a quick grep without Python, unzip the file and search the XML:

unzip -p deck.pptx ppt/slides/slide1.xml | grep -oP '(?<=<a:t>)[^<]+'

That path is brittle — XML entities, split runs, and phonetic hints break naive regex — but it confirms whether text is actually stored or baked into images before you debug python-pptx.

Exporting slides to structured JSON

For a search index or RAG pipeline, flat text per slide isn't enough — you want slide number, title, body, and notes as separate fields:

def slide_to_dict(slide, index): title = "" body = [] for shape in slide.shapes: if not shape.has_text_frame: continue t = shape.text_frame.text.strip() if shape == slide.shapes.title: title = t elif t: body.append(t) notes = "" if slide.has_notes_slide: notes = slide.notes_slide.notes_text_frame.text return {"slide": index, "title": title, "body": body, "notes": notes}

Serialize the list with json.dumps and you have a document ready for embedding or full-text search. Keep slide numbers — citations back to the deck matter in compliance reviews.

Troubleshooting missing text

Text in a picture — no XML text runs exist. Export slide as PNG and OCR.
Embedded video with burned-in captions — captions aren't in slide XML; transcribe the video separately.
Password-protected deck — python-pptx raises PackageNotFoundError. Remove protection in PowerPoint first.
Macros in .pptm — python-pptx reads slide text fine; macro code in vbaProject.bin is ignored.
Right-to-left languages — text extracts correctly but paragraph order may differ from visual order; test with your target locale.

Frequently asked questions

Does python-pptx extract text from PDF-exported PowerPoint?+

No — once exported to PDF, you need a PDF extractor. python-pptx only reads the .pptx XML structure directly.

Why is some slide text missing?+

Usually grouped shapes, SmartArt, or text rendered as images. Run the recursive group walker. If still missing, the text is baked into a picture — OCR is the only path.

Can I extract text from Google Slides programmatically?+

Yes — use the Google Slides API presentations.get, which returns every text element with its objectId and content. Export to .pptx and use python-pptx if you prefer a file-based pipeline.

Extract text from PowerPoint (.pptx): Outline View, python-pptx, speaker notes

1. PowerPoint's outline view

2. python-pptx for programmatic access

3. Slides as images

Online .pptx files

python-pptx: tables, groups, and speaker notes

Legacy .ppt files

SmartArt and charts

Choosing the right method

What's inside a .pptx file

Exporting slides to structured JSON

Troubleshooting missing text

Frequently asked questions

Related reading

Stop reading, start extracting