Gensim Topic Modeling — Deep Dive
Gensim’s architecture is designed around memory independence — the ability to process corpora larger than RAM by streaming documents one at a time. This guide covers the full pipeline from raw text to deployed topic model.
Corpus Construction
Streaming Corpus
For large datasets, implement a corpus as an iterable class:
from gensim import corpora
class MyCorpus:
def __init__(self, dictionary, path):
self.dictionary = dictionary
self.path = path
def __iter__(self):
for line in open(self.path):
yield self.dictionary.doc2bow(line.lower().split())
This never loads the full dataset into memory. Gensim’s algorithms iterate over the corpus multiple times (LDA does this for each training pass), so the __iter__ method gets called repeatedly.
Dictionary Filtering
Aggressive filtering is the single most impactful preprocessing step:
from gensim.corpora import Dictionary
# Build dictionary from tokenized documents
dictionary = Dictionary(tokenized_docs)
# Remove extremes
dictionary.filter_extremes(
no_below=15, # appear in at least 15 documents
no_above=0.5, # appear in at most 50% of documents
keep_n=100000 # keep top 100k by frequency after filtering
)
Typical vocabulary drops from 500k+ tokens to 20-50k after filtering, which dramatically improves both training speed and topic quality.
TF-IDF Weighting
Raw bag-of-words treats all words equally. TF-IDF downweights common words and upweights distinctive ones:
from gensim.models import TfidfModel
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus] # lazy transformation, no memory spike
LDA traditionally uses raw counts, but LSI benefits significantly from TF-IDF weighting.
Training LDA
Basic Training
from gensim.models import LdaMulticore
model = LdaMulticore(
corpus=corpus,
id2word=dictionary,
num_topics=20,
passes=10, # full passes over the corpus
iterations=50, # iterations per document per pass
chunksize=2000, # documents per training chunk
workers=3, # CPU cores (uses multiprocessing)
random_state=42,
per_word_topics=True # enables per-word topic assignments
)
LdaMulticore uses Python multiprocessing for parallelism. On an 8-core machine with 100k documents and 20 topics, training takes 5-15 minutes depending on document length.
Hyperparameter Tuning
The three most impactful parameters:
num_topics: Start with a grid search:
from gensim.models import CoherenceModel
coherence_scores = []
for k in range(5, 50, 5):
model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=k, passes=10, workers=3)
cm = CoherenceModel(model=model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
coherence_scores.append((k, cm.get_coherence()))
print(f"k={k}: coherence={cm.get_coherence():.4f}")
alpha and eta: These control the Dirichlet priors.
alpha='asymmetric'— allows some topics to be more prevalent than others. Often better than the default symmetric prior for real-world corpora.eta='auto'— learns the word-topic prior from data. Slower but can improve topic specificity.
model = LdaMulticore(
corpus=corpus, id2word=dictionary, num_topics=20,
alpha='asymmetric', eta='auto',
passes=15, workers=3
)
Convergence Monitoring
Track perplexity across passes to detect convergence:
model = LdaMulticore(
corpus=corpus, id2word=dictionary, num_topics=20,
passes=20, eval_every=2, # evaluate perplexity every 2 passes
workers=3
)
# Check log output for perplexity values
# Converged when perplexity stops decreasing between passes
Evaluating and Interpreting Topics
Coherence Metrics
from gensim.models import CoherenceModel
# C_v coherence (requires original texts, not just corpus)
cm = CoherenceModel(model=model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
print(f"C_v: {cm.get_coherence():.4f}")
# U_mass coherence (works with corpus alone)
cm_umass = CoherenceModel(model=model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
print(f"U_mass: {cm_umass.get_coherence():.4f}")
C_v above 0.55 generally indicates useful topics. Below 0.40 suggests problems with preprocessing or topic count.
Topic Inspection
for idx, topic in model.print_topics(num_words=10):
print(f"Topic {idx}: {topic}")
# Per-document topic distribution
doc_topics = model.get_document_topics(corpus[0], minimum_probability=0.05)
# [(3, 0.45), (7, 0.32), (12, 0.18)] — topic_id, probability
Visualization with pyLDAvis
import pyLDAvis.gensim_models
vis = pyLDAvis.gensim_models.prepare(model, corpus, dictionary)
pyLDAvis.save_html(vis, 'lda_visualization.html')
The interactive visualization shows topic distances (should be spread apart, not clustered) and word relevance per topic. Overlapping circles suggest you have too many topics.
Online Learning for Growing Corpora
When new documents arrive continuously, retrain incrementally instead of from scratch:
# Initial training
model = LdaMulticore(corpus=initial_corpus, id2word=dictionary, num_topics=20, passes=10)
# Later, update with new documents
new_corpus = [dictionary.doc2bow(doc) for doc in new_tokenized_docs]
model.update(new_corpus)
The update method runs additional passes on the new data while preserving learned topic distributions. This is LDA’s online variational Bayes algorithm — it processes mini-batches and updates global parameters incrementally.
Caveat: if new documents introduce vocabulary not in the dictionary, you need to extend the dictionary and potentially retrain from scratch.
LSI as an Alternative
Latent Semantic Indexing is faster and deterministic (no random initialization):
from gensim.models import LsiModel
lsi = LsiModel(corpus_tfidf, id2word=dictionary, num_topics=200)
LSI topics have positive and negative weights, making them harder to interpret than LDA. However, LSI excels at document similarity search:
from gensim.similarities import MatrixSimilarity
index = MatrixSimilarity(lsi[corpus_tfidf])
query = dictionary.doc2bow("machine learning neural network".split())
sims = index[lsi[query]]
# sims is an array of similarity scores for each document
Saving, Loading, and Serving
# Save
model.save("models/lda_20topics.model")
dictionary.save("models/dictionary.dict")
# Load
from gensim.models import LdaMulticore
model = LdaMulticore.load("models/lda_20topics.model")
For serving in a web application:
from fastapi import FastAPI
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary
app = FastAPI()
model = LdaMulticore.load("models/lda_20topics.model")
dictionary = Dictionary.load("models/dictionary.dict")
@app.post("/topics")
def get_topics(text: str):
bow = dictionary.doc2bow(text.lower().split())
topics = model.get_document_topics(bow, minimum_probability=0.05)
return [{"topic": t, "probability": round(p, 3)} for t, p in topics]
Performance Benchmarks
| Corpus Size | Topics | Training Time | RAM Usage |
|---|---|---|---|
| 10k docs | 20 | ~30 sec | ~200 MB |
| 100k docs | 20 | ~8 min | ~500 MB |
| 1M docs | 50 | ~2 hours | ~2 GB (streaming) |
These assume LdaMulticore with 3 workers, 10 passes, on a modern 8-core CPU.
Common Pitfalls
- Skipping preprocessing. LDA on raw text produces topics dominated by stopwords and punctuation. Always tokenize, remove stopwords, and lemmatize first.
- Choosing topic count by gut feeling. Use coherence scores. The “right” number varies wildly — 10 topics for a focused corpus, 50+ for a broad one.
- Expecting reproducibility without
random_state. LDA uses random initialization. Setrandom_statefor consistent results across runs. - Ignoring short documents. Documents with fewer than 20 words after preprocessing provide almost no signal. Filter them out or combine them.
- Using LDA for classification. LDA is exploratory, not discriminative. If you have labels, use a supervised classifier instead.
The one thing to remember: Gensim’s streaming architecture lets you build topic models on corpora of any size — but the quality of those models depends far more on preprocessing and evaluation than on algorithm choice or hyperparameter tuning.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.