How to extract a clean recipe from a website, video, or image
Pull ingredients and steps from recipe blogs, TikTok, YouTube, and Instagram using schema.org scraping, yt-dlp transcription, and AI extraction — without wading through life stories.
Recipe websites hide the actual recipe behind personal essays and SEO copy. Cooking videos on TikTok and YouTube bury it inside a 10-minute monologue. The recipe is usually 15% of the content. Here are the practical methods to extract just the ingredients and steps.
Method 1: schema.org Recipe markup (most recipe blogs)
Most recipe blogs add structured data to their pages for Google's recipe cards — schema.org/Recipe markup embedded as JSON-LD in the page's HTML. This markup contains the ingredients and instructions in clean, parseable format. You don't need AI or OCR if you can find this.
- pip install requests beautifulsoup4
- import requests, json; from bs4 import BeautifulSoup
- soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text, 'html.parser')
- for script in soup.find_all('script', type='application/ld+json'):
- data = json.loads(script.string)
- if isinstance(data, list): data = next((d for d in data if d.get('@type') == 'Recipe'), None)
- if data and data.get('@type') == 'Recipe':
- print(data.get('recipeIngredient', [])); print(data.get('recipeInstructions', []))
This is fast and exact when it works. Limitations: not every recipe site uses schema.org. Older blogs, personal sites, and non-English recipe sites often don't. Some sites block scraping — add delays and respect robots.txt. Instagram and TikTok require authentication and rate-limit aggressively.
Method 2: YouTube recipe extraction via transcript
YouTube cooking videos often verbally state every ingredient and step. yt-dlp (the best youtube-dl fork) can pull the auto-generated or uploaded subtitles without downloading the video:
- pip install yt-dlp
- yt-dlp --write-auto-subs --skip-download --sub-lang en -o 'recipe.%(ext)s' 'https://youtube.com/watch?v=...'
- # Downloads recipe.en.vtt — a subtitle file with timestamped transcript
- # Parse the VTT and pass the text to an LLM to extract ingredients and steps
The transcript gives you all the spoken words, but the recipe is mixed into the presentation. You need a second step to extract the recipe from the transcript text — either regex (for structured 'now add X of Y' patterns) or an LLM call with the transcript as input.
Method 3: Screenshot or photo of a recipe card
For a recipe from a physical cookbook, a screenshot of a paywalled site, or a social media story that doesn't have a link — a photo of the recipe page is your source. An image extractor can read the ingredients list and numbered steps directly from the photo.
Method 4: AI extraction from any source
When the URL is a TikTok or Instagram post, a food blog with no schema.org markup, a YouTube video, or a cookbook photo — paste the URL or upload the image and AI extraction pulls the recipe from whatever it can access.
What the output looks like
A well-structured recipe extraction returns:
- title — the dish name
- servings — number of portions
- prep_time_minutes and cook_time_minutes
- ingredients — each as { quantity, unit, item, notes } so '2 cups all-purpose flour, sifted' becomes quantity: 2, unit: 'cups', item: 'all-purpose flour', notes: 'sifted'
- instructions — numbered steps as an array of strings
- notes — author tips and substitutions (optional, only if present)
Recipe scaling
Once you have the recipe as structured JSON with quantities as numbers, scaling is arithmetic. Multiply every quantity by the scale factor to get a 2x or 0.5x version. Narrative recipes where quantities are embedded in step text ('add a cup of flour and knead until...') can't be scaled programmatically — another reason structured JSON output is worth the extra extraction step.
Building a recipe database
For a recipe app or meal planning tool, batch extraction lets you populate a structured database from a list of URLs. Run the extractor on each URL, collect the JSON, normalize units (tbsp vs tablespoon), deduplicate ingredients for a shopping list. The schema.org scraping approach is fastest for high volume; AI extraction covers the sites that don't use schema.org.
Frequently asked questions
How do I extract a recipe from a recipe website?+
Most recipe blogs embed schema.org Recipe markup as JSON-LD. Fetch the page, find the script tag with type='application/ld+json', parse it, and look for @type: Recipe — the ingredients and instructions are in recipeIngredient and recipeInstructions. For sites without schema.org, use AI extraction.
How do I get the recipe from a TikTok or Instagram video?+
TikTok and Instagram don't expose structured recipe data and block most scraping. The practical approach is AI extraction — paste the URL and the extractor accesses whatever content is publicly visible and returns the recipe fields.
How do I extract a recipe from a YouTube video?+
Download the subtitles with yt-dlp (--write-auto-subs --skip-download), then pass the transcript to an LLM to extract the recipe. Many YouTube cooking creators also post the recipe in the video description — check there first.
How do I convert a recipe to JSON for an app?+
Use AI extraction and request JSON output. The standard fields are title, servings, prep_time_minutes, cook_time_minutes, ingredients (array of {quantity, unit, item}), and instructions (array of strings). This maps directly to schema.org/Recipe if you need standard markup.