How to extract topics from text or interview transcripts
BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.
Topic modeling looks at a corpus of text and clusters it into themes. Used to be the domain of academics and statisticians; now used widely for customer feedback analysis, qualitative research coding, document categorization, and content audits.
LDA: the classic
Latent Dirichlet Allocation. Treats documents as mixtures of topics and topics as mixtures of words. Python: gensim's LdaModel:
from gensim import corpora, models dictionary = corpora.Dictionary(tokenized_docs) corpus = [dictionary.doc2bow(d) for d in tokenized_docs] lda = models.LdaModel(corpus, num_topics=10, id2word=dictionary) lda.print_topics()
Strength: well-understood, deterministic with a fixed seed, fast on large corpora. Weakness: requires aggressive preprocessing (stopword removal, stemming/lemmatization, bigram detection), and the topics are bags of words you have to interpret yourself.
BERTopic: the modern default
Embeds each document with a sentence transformer, reduces dimensionality with UMAP, clusters with HDBSCAN, then represents each cluster with c-TF-IDF for keywords:
from bertopic import BERTopic topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs) topic_model.get_topic_info()
Strength: handles small documents well, no hyperparameter tuning required for a baseline, comes with a great visualization toolkit. Weakness: stochastic (set random_state for reproducibility), needs a few hundred documents minimum to find stable topics.
LLM + clustering: the quiet winner for qualitative research
Two-step pipeline that has quietly taken over interview-transcript coding:
- For each transcript, ask an LLM to extract themes with a typed schema: { theme, supporting_quote, sentiment }.
- Embed the themes (not the transcripts) and cluster them. The clusters are your topics; each topic is grounded in real quotes.
Beats LDA and BERTopic for qualitative work because the topics are human-readable from the start and traceable back to specific quotes — important for academic and consulting deliverables where someone will ask "who said this and where?"
Interview transcripts specifically
- Diarize first (split by speaker) — pyannote.audio handles this cleanly.
- Extract themes per speaker turn, not per whole transcript — preserves who said what.
- Watch for the moderator's questions framing the participant's answers; topics that come up only when the moderator prompts them are interesting in a different way than topics the participant raises.
From a PDF
Extract clean text first (pdfplumber or pdfminer.six), strip headers/footers/page numbers, chunk long documents into sections. Run topic modeling on the chunks. Whole-PDF topic modeling on documents over ~30 pages usually returns topics that match the section structure rather than the actual themes.
Choosing
- Big corpus, want repeatable analysis → LDA.
- Few hundred to few thousand docs, modest setup → BERTopic.
- Qualitative research, themes need to ground in quotes → LLM + clustering.
- Real-time / streaming → don't try; topic models stabilize on batch corpora.