EngineeringMay 3, 20266 min read

How to extract keywords from text, a website, or a job description

RAKE, YAKE, KeyBERT, TextRank, and the LLM era — the four open-source keyword-extraction algorithms that still matter, plus when to use each.

By Dawid Sibinski

Keyword extraction is one of the oldest tasks in NLP and still one of the most useful. SEO audits, job-description matching, content tagging, search query expansion — all of them want the same thing: the words that actually carry meaning, not the and/of/the noise.

RAKE: Rapid Automatic Keyword Extraction

Statistical, language-agnostic, fast. Splits text on stopwords and punctuation, scores remaining phrases by word frequency and word degree.

from rake_nltk import Rake r = Rake() r.extract_keywords_from_text(text) r.get_ranked_phrases()[:10]

Strength: zero training, deterministic, returns multi-word phrases. Weakness: weights long phrases too heavily; "the quick brown fox jumps over" can outscore actually relevant single words.

YAKE: Yet Another Keyword Extractor

Statistical like RAKE but with five additional features (word casing, position, frequency, context, sentence frequency). More balanced than RAKE on news and prose.

import yake kw = yake.KeywordExtractor(lan="en", n=3, top=10) kw.extract_keywords(text)

n=3 caps phrase length at 3 words. Lower YAKE scores are better (it's a relevance measure, not similarity). Strongest of the unsupervised statistical methods on most domains.

KeyBERT: BERT-based

Embeds the whole document and every candidate phrase, returns phrases whose embeddings are closest to the document. The right tool when you want semantic relevance, not just frequency.

from keybert import KeyBERT kw = KeyBERT() kw.extract_keywords(text, keyphrase_ngram_range=(1, 3), top_n=10)

Slower than RAKE/YAKE because it runs a transformer. Quality is noticeably better on technical and domain-specific text where statistical methods miss because the important terms aren't the most frequent.

TextRank: graph-based

Builds a co-occurrence graph and runs PageRank over it. Built into spaCy via the pytextrank package:

import spacy, pytextrank nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textrank") doc = nlp(text) [p.text for p in doc._.phrases[:10]]

Conceptually elegant; in practice usually about the same quality as YAKE on most documents. Worth keeping if you're already using spaCy for the rest of your pipeline.

LLM with a typed schema

For SEO and content work where keyword intent matters more than frequency, an LLM beats every statistical method. Schema:

{ primary_keywords: string[], // 3-5 high-intent terms long_tail: string[], // 10-20 longer phrases entities: { name, type }[], // people, orgs, products topical_themes: string[] // broader categories }

Cost-effective once you batch — 1k documents through Gemini Flash or Claude Haiku is cheap. Use a statistical method first to filter the candidate set if cost is critical.

From a website

Fetch with requests, strip nav/footer with Trafilatura or readability-lxml, then run any of the methods above on the cleaned content. Skipping the cleanup step is the most common reason website keyword extraction returns junk like "cookie" and "newsletter."

From a job description

Job descriptions have a known structure (responsibilities, requirements, nice-to-haves). The cleanest approach is a typed extraction: skills (required vs. preferred), seniority signals, tech stack, soft skills, perks. RAKE on raw text returns junk like "strong communication" — useful only as a baseline.

From a YouTube video

Pull the transcript first (yt-dlp + Whisper or the auto-transcript), then run the keyword extractor on the text. The transcript is the document; everything else is the same as for any other text source.

Choosing

Quick, no setup, generic text → YAKE.
Multi-word phrases, language-agnostic → RAKE.
Domain-specific or technical text → KeyBERT.
Already using spaCy → TextRank.
Need typed output (skills, entities, themes) → LLM with a schema.