How to extract keywords from text, a website, or a job description
RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.
Keyword extraction is one of the oldest tasks in NLP and still one of the most useful. SEO audits, job-description matching, content tagging, search query expansion — all of them want the same thing: the words that actually carry meaning, not the and/of/the noise.
RAKE: Rapid Automatic Keyword Extraction
Statistical, language-agnostic, fast. Splits text on stopwords and punctuation, scores remaining phrases by word frequency and word degree.
from rake_nltk import Rake r = Rake() r.extract_keywords_from_text(text) r.get_ranked_phrases()[:10]
Strength: zero training, deterministic, returns multi-word phrases. Weakness: weights long phrases too heavily; "the quick brown fox jumps over" can outscore actually relevant single words.
YAKE: Yet Another Keyword Extractor
Statistical like RAKE but with five additional features (word casing, position, frequency, context, sentence frequency). More balanced than RAKE on news and prose.
import yake kw = yake.KeywordExtractor(lan="en", n=3, top=10) kw.extract_keywords(text)
n=3 caps phrase length at 3 words. Lower YAKE scores are better (it's a relevance measure, not similarity). Strongest of the unsupervised statistical methods on most domains.
KeyBERT: BERT-based
Embeds the whole document and every candidate phrase, returns phrases whose embeddings are closest to the document. The right tool when you want semantic relevance, not just frequency.
from keybert import KeyBERT kw = KeyBERT() kw.extract_keywords(text, keyphrase_ngram_range=(1, 3), top_n=10)
Slower than RAKE/YAKE because it runs a transformer. Quality is noticeably better on technical and domain-specific text where statistical methods miss because the important terms aren't the most frequent.
TextRank: graph-based
Builds a co-occurrence graph and runs PageRank over it. Built into spaCy via the pytextrank package:
import spacy, pytextrank nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textrank") doc = nlp(text) [p.text for p in doc._.phrases[:10]]
Conceptually elegant; in practice usually about the same quality as YAKE on most documents. Worth keeping if you're already using spaCy for the rest of your pipeline.
LLM with a typed schema
For SEO and content work where keyword intent matters more than frequency, an LLM beats every statistical method. Schema:
{ primary_keywords: string[], // 3-5 high-intent terms long_tail: string[], // 10-20 longer phrases entities: { name, type }[], // people, orgs, products topical_themes: string[] // broader categories }
Cost-effective once you batch — 1k documents through Gemini Flash or Claude Haiku is cheap. Use a statistical method first to filter the candidate set if cost is critical.
From a website
Fetch with requests, strip nav/footer with Trafilatura or readability-lxml, then run any of the methods above on the cleaned content. Skipping the cleanup step is the most common reason website keyword extraction returns junk like "cookie" and "newsletter."
From a job description
Job descriptions have a known structure (responsibilities, requirements, nice-to-haves). The cleanest approach is a typed extraction: skills (required vs. preferred), seniority signals, tech stack, soft skills, perks. RAKE on raw text returns junk like "strong communication" — useful only as a baseline.
From a YouTube video
Pull the transcript first (yt-dlp + Whisper or the auto-transcript), then run the keyword extractor on the text. The transcript is the document; everything else is the same as for any other text source.
Choosing
- Quick, no setup, generic text → YAKE.
- Multi-word phrases, language-agnostic → RAKE.
- Domain-specific or technical text → KeyBERT.
- Already using spaCy → TextRank.
- Need typed output (skills, entities, themes) → LLM with a schema.