Gensim Topic Modeling — Deep Dive

Gensim’s architecture is designed around memory independence — the ability to process corpora larger than RAM by streaming documents one at a time. This guide covers the full pipeline from raw text to deployed topic model.

Corpus Construction

Streaming Corpus

For large datasets, implement a corpus as an iterable class:

from gensim import corpora

class MyCorpus:
    def __init__(self, dictionary, path):
        self.dictionary = dictionary
        self.path = path

    def __iter__(self):
        for line in open(self.path):
            yield self.dictionary.doc2bow(line.lower().split())

This never loads the full dataset into memory. Gensim’s algorithms iterate over the corpus multiple times (LDA does this for each training pass), so the __iter__ method gets called repeatedly.

Dictionary Filtering

Aggressive filtering is the single most impactful preprocessing step:

from gensim.corpora import Dictionary

# Build dictionary from tokenized documents
dictionary = Dictionary(tokenized_docs)

# Remove extremes
dictionary.filter_extremes(
    no_below=15,     # appear in at least 15 documents
    no_above=0.5,    # appear in at most 50% of documents
    keep_n=100000    # keep top 100k by frequency after filtering
)

Typical vocabulary drops from 500k+ tokens to 20-50k after filtering, which dramatically improves both training speed and topic quality.

TF-IDF Weighting

Raw bag-of-words treats all words equally. TF-IDF downweights common words and upweights distinctive ones:

from gensim.models import TfidfModel

tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]  # lazy transformation, no memory spike

LDA traditionally uses raw counts, but LSI benefits significantly from TF-IDF weighting.

Training LDA

Basic Training

from gensim.models import LdaMulticore

model = LdaMulticore(
    corpus=corpus,
    id2word=dictionary,
    num_topics=20,
    passes=10,            # full passes over the corpus
    iterations=50,        # iterations per document per pass
    chunksize=2000,       # documents per training chunk
    workers=3,            # CPU cores (uses multiprocessing)
    random_state=42,
    per_word_topics=True  # enables per-word topic assignments
)

LdaMulticore uses Python multiprocessing for parallelism. On an 8-core machine with 100k documents and 20 topics, training takes 5-15 minutes depending on document length.

Hyperparameter Tuning

The three most impactful parameters:

num_topics: Start with a grid search:

from gensim.models import CoherenceModel

coherence_scores = []
for k in range(5, 50, 5):
    model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=k, passes=10, workers=3)
    cm = CoherenceModel(model=model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
    coherence_scores.append((k, cm.get_coherence()))
    print(f"k={k}: coherence={cm.get_coherence():.4f}")

alpha and eta: These control the Dirichlet priors.

  • alpha='asymmetric' — allows some topics to be more prevalent than others. Often better than the default symmetric prior for real-world corpora.
  • eta='auto' — learns the word-topic prior from data. Slower but can improve topic specificity.
model = LdaMulticore(
    corpus=corpus, id2word=dictionary, num_topics=20,
    alpha='asymmetric', eta='auto',
    passes=15, workers=3
)

Convergence Monitoring

Track perplexity across passes to detect convergence:

model = LdaMulticore(
    corpus=corpus, id2word=dictionary, num_topics=20,
    passes=20, eval_every=2,  # evaluate perplexity every 2 passes
    workers=3
)
# Check log output for perplexity values
# Converged when perplexity stops decreasing between passes

Evaluating and Interpreting Topics

Coherence Metrics

from gensim.models import CoherenceModel

# C_v coherence (requires original texts, not just corpus)
cm = CoherenceModel(model=model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
print(f"C_v: {cm.get_coherence():.4f}")

# U_mass coherence (works with corpus alone)
cm_umass = CoherenceModel(model=model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
print(f"U_mass: {cm_umass.get_coherence():.4f}")

C_v above 0.55 generally indicates useful topics. Below 0.40 suggests problems with preprocessing or topic count.

Topic Inspection

for idx, topic in model.print_topics(num_words=10):
    print(f"Topic {idx}: {topic}")

# Per-document topic distribution
doc_topics = model.get_document_topics(corpus[0], minimum_probability=0.05)
# [(3, 0.45), (7, 0.32), (12, 0.18)]  — topic_id, probability

Visualization with pyLDAvis

import pyLDAvis.gensim_models

vis = pyLDAvis.gensim_models.prepare(model, corpus, dictionary)
pyLDAvis.save_html(vis, 'lda_visualization.html')

The interactive visualization shows topic distances (should be spread apart, not clustered) and word relevance per topic. Overlapping circles suggest you have too many topics.

Online Learning for Growing Corpora

When new documents arrive continuously, retrain incrementally instead of from scratch:

# Initial training
model = LdaMulticore(corpus=initial_corpus, id2word=dictionary, num_topics=20, passes=10)

# Later, update with new documents
new_corpus = [dictionary.doc2bow(doc) for doc in new_tokenized_docs]
model.update(new_corpus)

The update method runs additional passes on the new data while preserving learned topic distributions. This is LDA’s online variational Bayes algorithm — it processes mini-batches and updates global parameters incrementally.

Caveat: if new documents introduce vocabulary not in the dictionary, you need to extend the dictionary and potentially retrain from scratch.

LSI as an Alternative

Latent Semantic Indexing is faster and deterministic (no random initialization):

from gensim.models import LsiModel

lsi = LsiModel(corpus_tfidf, id2word=dictionary, num_topics=200)

LSI topics have positive and negative weights, making them harder to interpret than LDA. However, LSI excels at document similarity search:

from gensim.similarities import MatrixSimilarity

index = MatrixSimilarity(lsi[corpus_tfidf])
query = dictionary.doc2bow("machine learning neural network".split())
sims = index[lsi[query]]
# sims is an array of similarity scores for each document

Saving, Loading, and Serving

# Save
model.save("models/lda_20topics.model")
dictionary.save("models/dictionary.dict")

# Load
from gensim.models import LdaMulticore
model = LdaMulticore.load("models/lda_20topics.model")

For serving in a web application:

from fastapi import FastAPI
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary

app = FastAPI()
model = LdaMulticore.load("models/lda_20topics.model")
dictionary = Dictionary.load("models/dictionary.dict")

@app.post("/topics")
def get_topics(text: str):
    bow = dictionary.doc2bow(text.lower().split())
    topics = model.get_document_topics(bow, minimum_probability=0.05)
    return [{"topic": t, "probability": round(p, 3)} for t, p in topics]

Performance Benchmarks

Corpus SizeTopicsTraining TimeRAM Usage
10k docs20~30 sec~200 MB
100k docs20~8 min~500 MB
1M docs50~2 hours~2 GB (streaming)

These assume LdaMulticore with 3 workers, 10 passes, on a modern 8-core CPU.

Common Pitfalls

  1. Skipping preprocessing. LDA on raw text produces topics dominated by stopwords and punctuation. Always tokenize, remove stopwords, and lemmatize first.
  2. Choosing topic count by gut feeling. Use coherence scores. The “right” number varies wildly — 10 topics for a focused corpus, 50+ for a broad one.
  3. Expecting reproducibility without random_state. LDA uses random initialization. Set random_state for consistent results across runs.
  4. Ignoring short documents. Documents with fewer than 20 words after preprocessing provide almost no signal. Filter them out or combine them.
  5. Using LDA for classification. LDA is exploratory, not discriminative. If you have labels, use a supervised classifier instead.

The one thing to remember: Gensim’s streaming architecture lets you build topic models on corpora of any size — but the quality of those models depends far more on preprocessing and evaluation than on algorithm choice or hyperparameter tuning.

pythongensimtopic-modelingnlp

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.