Text Classification in Python — Core Concepts

Understand feature extraction, model selection, and evaluation metrics for building text classifiers that work on real data.

Text classification assigns predefined labels to documents. It is one of the most commercially valuable NLP tasks — behind spam filters, content moderation, customer support routing, sentiment analysis, and document triage systems.

The Pipeline

Every text classifier follows the same four-step pipeline:

Preprocessing — clean and normalize text (lowercase, remove punctuation, handle special characters).
Feature extraction — convert text into numbers that a model can process.
Model training — fit a classifier on labeled examples.
Evaluation — measure how well the model performs on unseen data.

Each step matters. Skipping preprocessing produces noisy features. Choosing the wrong feature representation limits your model’s ceiling. Evaluating on training data gives misleading results.

Feature Extraction Methods

Bag of Words (BoW)

The simplest approach: count how many times each word appears in a document. A document becomes a vector where each dimension corresponds to a word in the vocabulary.

Limitation: “The cat sat on the mat” and “The mat sat on the cat” produce identical vectors. Word order is lost.

TF-IDF

An improvement over raw counts. TF-IDF (Term Frequency–Inverse Document Frequency) downweights words that appear in many documents (like “the” and “is”) and upweights words that are distinctive to specific documents. This is the most common feature representation for traditional classifiers.

Word Embeddings

Instead of sparse counts, represent each word as a dense vector (typically 100-300 dimensions) trained on large corpora. Document vectors are often computed as the average of their word embeddings. This captures some semantic meaning — “happy” and “joyful” get similar vectors — but loses word order.

Transformer Representations

Models like BERT produce context-aware embeddings where the same word gets different vectors depending on surrounding words. These give the best accuracy but require more compute.

Common Algorithms

Algorithm	Strengths	Best When
Naive Bayes	Fast, works with small data	Quick baseline, high-dimensional features
Logistic Regression	Interpretable, handles TF-IDF well	Medium datasets, need to understand decisions
SVM (Linear)	Strong with high-dimensional sparse data	TF-IDF features, binary classification
Random Forest	Handles non-linear patterns	Mixed feature types
Fine-tuned BERT	State-of-the-art accuracy	Enough compute, need max performance

For most projects, start with Logistic Regression + TF-IDF. It trains in seconds, performs surprisingly well, and gives you a strong baseline to beat.

Evaluation Metrics

Accuracy alone is misleading when classes are imbalanced. If 95% of emails are not spam, a model that always predicts “not spam” gets 95% accuracy while being completely useless.

Better metrics:

Precision — of all items the model labeled as X, how many actually were X? High precision = few false alarms.
Recall — of all actual X items, how many did the model find? High recall = few missed cases.
F1 score — harmonic mean of precision and recall. Balances both concerns.
Confusion matrix — shows exactly where the model makes mistakes across all classes.

Which metric to prioritize depends on the cost of errors. Spam filtering prioritizes precision (do not put real emails in spam). Medical triage prioritizes recall (do not miss a sick patient).

Handling Imbalanced Data

Real-world datasets are rarely balanced. Approaches include:

Class weights — tell the model to penalize mistakes on rare classes more heavily. Most scikit-learn classifiers accept a class_weight='balanced' parameter.
Oversampling — duplicate examples from the minority class (SMOTE creates synthetic examples rather than exact copies).
Undersampling — reduce the majority class. Simple but throws away data.
Stratified splitting — ensure train/test splits maintain the original class ratio.

Multi-label vs Multi-class

Multi-class: each document gets exactly one label (spam/ham, topic A/B/C).
Multi-label: each document can get multiple labels (a news article can be both “politics” and “economy”).

Multi-label requires different approaches: binary relevance (one classifier per label), classifier chains, or models that natively support multi-label output.

Common Misunderstanding

People often jump to deep learning for text classification. In practice, a well-tuned TF-IDF + Logistic Regression model beats a poorly configured BERT model. Deep learning shines when you have millions of labeled examples and need to capture subtle semantic nuances. For most business problems with thousands of labeled examples, traditional ML is faster to develop, easier to debug, and cheaper to run.

The one thing to remember: Text classification is a pipeline — preprocessing, features, model, evaluation — and the quality of each step matters more than which specific algorithm you choose at step three.

pythontext-classificationnlpmachine-learning