All posts
EngineeringApril 30, 20265 min read

How to extract topics from text or interview transcripts

BERTopic, LDA, and the LLM-with-clustering pattern that's quietly taken over qualitative research. What each one is best at, and the few lines of code to start.

By Dawid Sibinski

Topic modeling looks at a corpus of text and clusters it into themes. Used to be the domain of academics and statisticians; now used widely for customer feedback analysis, qualitative research coding, document categorization, and content audits.

LDA: the classic

Latent Dirichlet Allocation. Treats documents as mixtures of topics and topics as mixtures of words. Python: gensim's LdaModel:

from gensim import corpora, models dictionary = corpora.Dictionary(tokenized_docs) corpus = [dictionary.doc2bow(d) for d in tokenized_docs] lda = models.LdaModel(corpus, num_topics=10, id2word=dictionary) lda.print_topics()

Strength: well-understood, deterministic with a fixed seed, fast on large corpora. Weakness: requires aggressive preprocessing (stopword removal, stemming/lemmatization, bigram detection), and the topics are bags of words you have to interpret yourself.

BERTopic: the modern default

Embeds each document with a sentence transformer, reduces dimensionality with UMAP, clusters with HDBSCAN, then represents each cluster with c-TF-IDF for keywords:

from bertopic import BERTopic topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs) topic_model.get_topic_info()

Strength: handles small documents well, no hyperparameter tuning required for a baseline, comes with a great visualization toolkit. Weakness: stochastic (set random_state for reproducibility), needs a few hundred documents minimum to find stable topics.

LLM + clustering: the quiet winner for qualitative research

Two-step pipeline that has quietly taken over interview-transcript coding:

  1. For each transcript, ask an LLM to extract themes with a typed schema: { theme, supporting_quote, sentiment }.
  2. Embed the themes (not the transcripts) and cluster them. The clusters are your topics; each topic is grounded in real quotes.

Beats LDA and BERTopic for qualitative work because the topics are human-readable from the start and traceable back to specific quotes — important for academic and consulting deliverables where someone will ask "who said this and where?"

Interview transcripts specifically

  • Diarize first (split by speaker) — pyannote.audio handles this cleanly.
  • Extract themes per speaker turn, not per whole transcript — preserves who said what.
  • Watch for the moderator's questions framing the participant's answers; topics that come up only when the moderator prompts them are interesting in a different way than topics the participant raises.

From a PDF

Extract clean text first (pdfplumber or pdfminer.six), strip headers/footers/page numbers, chunk long documents into sections. Run topic modeling on the chunks. Whole-PDF topic modeling on documents over ~30 pages usually returns topics that match the section structure rather than the actual themes.

Choosing

  • Big corpus, want repeatable analysis → LDA.
  • Few hundred to few thousand docs, modest setup → BERTopic.
  • Qualitative research, themes need to ground in quotes → LLM + clustering.
  • Real-time / streaming → don't try; topic models stabilize on batch corpora.

More on engineering

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →