How to extract text from a YouTube video
Three reliable ways to turn a YouTube video into searchable, citable text — using the built-in transcript, yt-dlp + Whisper, or browser tools — and when each one is the right call.
Whether you're researching, repurposing content for a blog, building citations into a paper, or just want to ctrl-F a two-hour talk, getting text out of a YouTube video is mostly a solved problem. Mostly.
1. The built-in transcript (fastest, free)
Most YouTube videos have an auto-generated transcript. To open it: click the three dots under the video, then "Show transcript." A panel opens on the right with timestamped lines. Click the three dots in that panel to toggle timestamps off, then select-all and copy.
Quality depends on the original audio. Clean studio recordings are near-perfect. Heavy accents, music, and overlapping speakers degrade fast. The auto-transcript also won't include speaker labels.
2. yt-dlp + Whisper (best quality, free)
If the auto-transcript is garbage or missing, run the audio through OpenAI's Whisper model. yt-dlp pulls the audio, Whisper transcribes it.
- Install yt-dlp (brew install yt-dlp on macOS, or pip install yt-dlp).
- Pull audio: yt-dlp -x --audio-format mp3 "https://youtube.com/watch?v=..."
- Install Whisper: pip install openai-whisper
- Transcribe: whisper audio.mp3 --model medium --output_format txt
The medium model is the sweet spot for most accents and languages. Use large-v3 if you have a GPU and need translation or hard accents. Expect roughly real-time on CPU for medium, much faster on GPU.
3. Browser tools and extensions
Several free sites accept a YouTube URL and return a transcript. They mostly wrap the same auto-caption API the YouTube panel uses, so quality is identical to method 1 but with one click. Useful if you don't want to install anything.
What to do with the transcript
Once you have raw text, the next step depends on intent: feed it to a summarization model, drop it into a notes app, search for quotes, or extract structured points like decisions, action items, or named entities. ExtractFox doesn't process video, but if your goal is structured extraction from a transcript-style document, paste the text into the free-text mode with a description of what you want and it'll come back as a table.