All posts
TutorialApril 8, 20265 min read

How to extract text from a YouTube video

Three reliable ways to turn a YouTube video into searchable, citable text — using the built-in transcript, yt-dlp + Whisper, or browser tools — and when each one is the right call.

By Dawid Sibinski

Whether you're researching, repurposing content for a blog, building citations into a paper, or just want to ctrl-F a two-hour talk, getting text out of a YouTube video is mostly a solved problem. Mostly.

1. The built-in transcript (fastest, free)

Most YouTube videos have an auto-generated transcript. To open it: click the three dots under the video, then "Show transcript." A panel opens on the right with timestamped lines. Click the three dots in that panel to toggle timestamps off, then select-all and copy.

Quality depends on the original audio. Clean studio recordings are near-perfect. Heavy accents, music, and overlapping speakers degrade fast. The auto-transcript also won't include speaker labels.

2. yt-dlp + Whisper (best quality, free)

If the auto-transcript is garbage or missing, run the audio through OpenAI's Whisper model. yt-dlp pulls the audio, Whisper transcribes it.

  1. Install yt-dlp (brew install yt-dlp on macOS, or pip install yt-dlp).
  2. Pull audio: yt-dlp -x --audio-format mp3 "https://youtube.com/watch?v=..."
  3. Install Whisper: pip install openai-whisper
  4. Transcribe: whisper audio.mp3 --model medium --output_format txt

The medium model is the sweet spot for most accents and languages. Use large-v3 if you have a GPU and need translation or hard accents. Expect roughly real-time on CPU for medium, much faster on GPU.

3. Browser tools and extensions

Several free sites accept a YouTube URL and return a transcript. They mostly wrap the same auto-caption API the YouTube panel uses, so quality is identical to method 1 but with one click. Useful if you don't want to install anything.

What to do with the transcript

Once you have raw text, the next step depends on intent: feed it to a summarization model, drop it into a notes app, search for quotes, or extract structured points like decisions, action items, or named entities. ExtractFox doesn't process video, but if your goal is structured extraction from a transcript-style document, paste the text into the free-text mode with a description of what you want and it'll come back as a table.

More on tutorial

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →