All posts
TutorialMarch 28, 20265 min read

How to extract a summary from a PDF, video, or article

Three classes of summarization tools (extractive, abstractive, hybrid), how each one fails, and the practical setup for getting a useful 10-line summary out of a 90-minute talk or 300-page report.

By Dawid Sibinski

Summarization isn't one problem. Pulling key sentences from a 5-page article and condensing a 90-minute talk into a memo are different tasks with different failure modes. Pick the wrong tool and you'll get either lossy gibberish or a summary that's longer than the original.

Extractive vs. abstractive

Extractive summarizers pick the most important sentences from the source verbatim. Fast, faithful (every sentence really appeared), but reads choppy. Tools: sumy, gensim's summarize, BERT-extractive-summarizer.

Abstractive summarizers generate new sentences that paraphrase the source. Reads naturally; can hallucinate. Tools: any LLM (Claude, Gemini, GPT-4), or fine-tuned models like BART-large-CNN.

Hybrid is the practical answer for long documents: extract the most important chunks, then have an LLM rewrite them. Avoids context-length limits and keeps hallucinations low.

PDF summarization

  1. Extract clean text with pdfplumber or pdfminer.six — strip headers, footers, page numbers.
  2. If the PDF is over ~30 pages, chunk by section using table of contents or heading detection.
  3. Summarize per chunk, then summarize the summaries (map-reduce pattern).
  4. Validate by spot-checking three random chunks against their summaries.

Video / talk summarization

  1. Get the transcript: YouTube auto-transcript, or yt-dlp + Whisper for higher quality.
  2. Add speaker diarization if there's more than one speaker (pyannote.audio).
  3. Summarize with timestamps preserved so the summary cites the moment in the video.
  4. Output as a bulleted list with [HH:MM:SS] anchors — far more useful than a paragraph.

Article summarization

For web articles, use a content extractor first (Trafilatura, Readability) to strip nav/ads/footer, then summarize. Skipping this step is the most common reason article summaries are bad — the model is summarizing the cookie banner along with the content.

What summaries are bad at

  • Numbers — LLMs round, conflate, and occasionally invent. Always preserve numeric claims verbatim or extract them separately.
  • Tables — collapsed into prose lose the structure that made them useful.
  • Quotes — paraphrased quotes lose the speaker's voice; if the quote matters, extract it as a quote.
  • Causal chains — "X because Y" often becomes "X and Y" in summary.

When summarization isn't the right job

If what you actually want is structured data — every recommendation, every action item, every figure mentioned — extraction beats summarization every time. Define the schema, ask the model to fill it. ExtractFox's free-text mode does this against any document; pair it with a transcript for video.

More on tutorial

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →