How to extract a summary from a PDF, video, or article
Three classes of summarization tools (extractive, abstractive, hybrid), how each one fails, and the practical setup for getting a useful 10-line summary out of a 90-minute talk or 300-page report.
Summarization isn't one problem. Pulling key sentences from a 5-page article and condensing a 90-minute talk into a memo are different tasks with different failure modes. Pick the wrong tool and you'll get either lossy gibberish or a summary that's longer than the original.
Extractive vs. abstractive
Extractive summarizers pick the most important sentences from the source verbatim. Fast, faithful (every sentence really appeared), but reads choppy. Tools: sumy, gensim's summarize, BERT-extractive-summarizer.
Abstractive summarizers generate new sentences that paraphrase the source. Reads naturally; can hallucinate. Tools: any LLM (Claude, Gemini, GPT-4), or fine-tuned models like BART-large-CNN.
Hybrid is the practical answer for long documents: extract the most important chunks, then have an LLM rewrite them. Avoids context-length limits and keeps hallucinations low.
PDF summarization
- Extract clean text with pdfplumber or pdfminer.six — strip headers, footers, page numbers.
- If the PDF is over ~30 pages, chunk by section using table of contents or heading detection.
- Summarize per chunk, then summarize the summaries (map-reduce pattern).
- Validate by spot-checking three random chunks against their summaries.
Video / talk summarization
- Get the transcript: YouTube auto-transcript, or yt-dlp + Whisper for higher quality.
- Add speaker diarization if there's more than one speaker (pyannote.audio).
- Summarize with timestamps preserved so the summary cites the moment in the video.
- Output as a bulleted list with [HH:MM:SS] anchors — far more useful than a paragraph.
Article summarization
For web articles, use a content extractor first (Trafilatura, Readability) to strip nav/ads/footer, then summarize. Skipping this step is the most common reason article summaries are bad — the model is summarizing the cookie banner along with the content.
What summaries are bad at
- Numbers — LLMs round, conflate, and occasionally invent. Always preserve numeric claims verbatim or extract them separately.
- Tables — collapsed into prose lose the structure that made them useful.
- Quotes — paraphrased quotes lose the speaker's voice; if the quote matters, extract it as a quote.
- Causal chains — "X because Y" often becomes "X and Y" in summary.
When summarization isn't the right job
If what you actually want is structured data — every recommendation, every action item, every figure mentioned — extraction beats summarization every time. Define the schema, ask the model to fill it. ExtractFox's free-text mode does this against any document; pair it with a transcript for video.