Text Classification in Python — Deep Dive
Text classification is deceptively simple in concept but full of practical traps. This guide walks through building classifiers at multiple complexity levels, with code you can adapt to production systems.
Baseline: TF-IDF + Logistic Regression
Always start here. This combination trains in seconds, requires no GPU, and is hard to beat on datasets under 100k examples.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# texts: list of strings, labels: list of category names
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.2, stratify=labels, random_state=42
)
vectorizer = TfidfVectorizer(
max_features=50000,
ngram_range=(1, 2), # unigrams + bigrams
min_df=3, # ignore words in fewer than 3 docs
max_df=0.95, # ignore words in more than 95% of docs
sublinear_tf=True # apply log normalization to term frequency
)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
clf = LogisticRegression(
C=1.0,
max_iter=1000,
class_weight='balanced', # handles imbalanced classes
solver='lbfgs',
n_jobs=-1
)
clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
Key tuning knobs:
ngram_range=(1, 2)captures phrases like “not good” that unigrams miss.sublinear_tf=Trueapplies1 + log(tf)instead of raw counts, reducing the impact of word frequency outliers.Cin LogisticRegression controls regularization strength. Lower C = more regularization. Grid search between 0.01 and 10.
Feature Engineering Beyond TF-IDF
Custom Features
Sometimes domain knowledge beats algorithmic sophistication:
import numpy as np
from scipy.sparse import hstack
def extract_meta_features(texts):
features = []
for text in texts:
features.append([
len(text), # document length
text.count('!'), # exclamation marks
text.count('?'), # question marks
sum(1 for c in text if c.isupper()) / max(len(text), 1), # caps ratio
len(text.split()), # word count
])
return np.array(features)
meta_train = extract_meta_features(X_train)
meta_test = extract_meta_features(X_test)
# Combine with TF-IDF
X_train_combined = hstack([X_train_tfidf, meta_train])
X_test_combined = hstack([X_test_tfidf, meta_test])
These meta-features help when the style of text matters (formal vs. informal, short vs. long).
Character N-grams
For tasks where spelling patterns matter (language detection, authorship attribution), character n-grams outperform word-level features:
char_vectorizer = TfidfVectorizer(
analyzer='char_wb', # character n-grams within word boundaries
ngram_range=(3, 5),
max_features=50000
)
Hyperparameter Optimization
Use RandomizedSearchCV for efficient tuning:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from scipy.stats import uniform, randint
pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])
param_dist = {
'tfidf__max_features': randint(10000, 100000),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'tfidf__min_df': randint(1, 10),
'tfidf__sublinear_tf': [True, False],
'clf__C': uniform(0.01, 10),
}
search = RandomizedSearchCV(
pipe, param_dist, n_iter=50, cv=5,
scoring='f1_macro', n_jobs=-1, random_state=42
)
search.fit(X_train, y_train)
print(f"Best F1: {search.best_score_:.4f}")
print(search.best_params_)
Multi-label Classification
When documents can have multiple labels:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train_bin = mlb.fit_transform(y_train_multilabel) # list of label lists
y_test_bin = mlb.transform(y_test_multilabel)
clf = OneVsRestClassifier(
LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')
)
clf.fit(X_train_tfidf, y_train_bin)
y_pred_bin = clf.predict(X_test_tfidf)
predicted_labels = mlb.inverse_transform(y_pred_bin)
Transformer-Based Classification
When you need maximum accuracy and have GPU resources:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Prepare dataset
train_dataset = Dataset.from_dict({"text": X_train, "label": y_train_encoded})
test_dataset = Dataset.from_dict({"text": X_test, "label": y_test_encoded})
def tokenize_fn(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)
train_dataset = train_dataset.map(tokenize_fn, batched=True)
test_dataset = test_dataset.map(tokenize_fn, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=num_classes
)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
)
def compute_metrics(eval_pred):
from sklearn.metrics import f1_score
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, preds, average="macro")}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
DistilBERT is a good starting point — 60% the size of BERT with 97% of the accuracy. For longer documents, consider Longformer or chunking strategies.
Deployment and Inference Optimization
Scikit-learn Models
import joblib
# Save
joblib.dump({'vectorizer': vectorizer, 'classifier': clf}, 'model.joblib')
# Load and predict
bundle = joblib.load('model.joblib')
text = "Your account has been suspended"
features = bundle['vectorizer'].transform([text])
prediction = bundle['classifier'].predict(features)[0]
confidence = bundle['classifier'].predict_proba(features).max()
Transformer Models with ONNX
For 3-5× inference speedup without GPU:
pip install optimum[onnxruntime]
optimum-cli export onnx --model ./results/checkpoint-best --task text-classification model_onnx/
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
model = ORTModelForSequenceClassification.from_pretrained("model_onnx")
tokenizer = AutoTokenizer.from_pretrained("model_onnx")
inputs = tokenizer("Sample text", return_tensors="np")
outputs = model(**inputs)
Benchmarks: Accuracy vs. Speed
Tested on the AG News dataset (120k training, 7.6k test, 4 classes):
| Model | F1 (macro) | Training Time | Inference (1k docs) |
|---|---|---|---|
| TF-IDF + Logistic Regression | 0.92 | 8 sec | 0.1 sec |
| TF-IDF + SVM (linear) | 0.92 | 15 sec | 0.1 sec |
| TF-IDF + Random Forest | 0.88 | 45 sec | 0.5 sec |
| DistilBERT fine-tuned | 0.94 | 25 min (GPU) | 12 sec (CPU) |
| RoBERTa fine-tuned | 0.95 | 45 min (GPU) | 18 sec (CPU) |
The jump from TF-IDF + LR to DistilBERT is only 2 F1 points — often not worth the 100× increase in inference cost.
Error Analysis
After training, always analyze errors:
import pandas as pd
errors = pd.DataFrame({
'text': X_test,
'true': y_test,
'predicted': y_pred,
'correct': y_test == y_pred
})
# Focus on misclassifications
misclassified = errors[~errors['correct']].sort_values('true')
print(misclassified.groupby(['true', 'predicted']).size())
Common patterns in errors:
- Short texts with ambiguous wording.
- Sarcasm or negation (“This is not great” classified as positive).
- Out-of-domain examples not represented in training data.
- Label noise — training examples that were labeled incorrectly.
Fix label noise first. It has more impact than changing algorithms.
Common Pitfalls
- Data leakage. Fitting the vectorizer on the full dataset (including test) inflates scores. Always
fit_transformon train,transformon test. - Ignoring class distribution. Plot label frequencies before training. If the rarest class has fewer than 50 examples, consider collecting more data before building a model.
- Over-engineering preprocessing. Aggressive stemming, emoji removal, and case folding can hurt as much as help. Measure the impact of each step.
- Deploying without confidence thresholds. In production, reject predictions below a confidence threshold rather than forcing a label on ambiguous inputs.
- Not versioning data alongside models. When you retrain with new data, you need to reproduce previous results. Version your training data, not just your model files.
The one thing to remember: Start with TF-IDF + Logistic Regression, measure carefully, and only add complexity when the baseline’s errors point to a specific limitation that a more complex model would address.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.