TutorialMarch 28, 20265 min read

How to extract a summary from a PDF, video, or article

Three classes of summarization tools (extractive, abstractive, hybrid), how each one fails, and the practical setup for getting a useful 10-line summary out of a 90-minute talk or 300-page report.

By Dawid Sibinski

Summarization isn't one problem. Pulling key sentences from a 5-page article and condensing a 90-minute talk into a memo are different tasks with different failure modes. Pick the wrong tool and you'll get either lossy gibberish or a summary that's longer than the original.

Extractive vs. abstractive

Extractive summarizers pick the most important sentences from the source verbatim. Fast, faithful (every sentence really appeared), but reads choppy. Tools: sumy, gensim's summarize, BERT-extractive-summarizer.

Abstractive summarizers generate new sentences that paraphrase the source. Reads naturally; can hallucinate. Tools: any LLM (Claude, Gemini, GPT-4), or fine-tuned models like BART-large-CNN.

Hybrid is the practical answer for long documents: extract the most important chunks, then have an LLM rewrite them. Avoids context-length limits and keeps hallucinations low.

PDF summarization

Extract clean text with pdfplumber or pdfminer.six — strip headers, footers, page numbers.
If the PDF is over ~30 pages, chunk by section using table of contents or heading detection.
Summarize per chunk, then summarize the summaries (map-reduce pattern).
Validate by spot-checking three random chunks against their summaries.

Video / talk summarization

Get the transcript: YouTube auto-transcript, or yt-dlp + Whisper for higher quality.
Add speaker diarization if there's more than one speaker (pyannote.audio).
Summarize with timestamps preserved so the summary cites the moment in the video.
Output as a bulleted list with [HH:MM:SS] anchors — far more useful than a paragraph.

Article summarization

For web articles, use a content extractor first (Trafilatura, Readability) to strip nav/ads/footer, then summarize. Skipping this step is the most common reason article summaries are bad — the model is summarizing the cookie banner along with the content.

What summaries are bad at

Numbers — LLMs round, conflate, and occasionally invent. Always preserve numeric claims verbatim or extract them separately.
Tables — collapsed into prose lose the structure that made them useful.
Quotes — paraphrased quotes lose the speaker's voice; if the quote matters, extract it as a quote.
Causal chains — "X because Y" often becomes "X and Y" in summary.

When summarization isn't the right job

If what you actually want is structured data — every recommendation, every action item, every figure mentioned — extraction beats summarization every time. Define the schema, ask the model to fill it. ExtractFox's free-text mode does this against any document; pair it with a transcript for video.