Text Classification in Python — Deep Dive

Build, tune, and deploy text classifiers in Python — from TF-IDF baselines through transformer fine-tuning with real code and benchmarks.

Text classification is deceptively simple in concept but full of practical traps. This guide walks through building classifiers at multiple complexity levels, with code you can adapt to production systems.

Baseline: TF-IDF + Logistic Regression

Always start here. This combination trains in seconds, requires no GPU, and is hard to beat on datasets under 100k examples.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# texts: list of strings, labels: list of category names
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, stratify=labels, random_state=42
)

vectorizer = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1, 2),    # unigrams + bigrams
    min_df=3,              # ignore words in fewer than 3 docs
    max_df=0.95,           # ignore words in more than 95% of docs
    sublinear_tf=True      # apply log normalization to term frequency
)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

clf = LogisticRegression(
    C=1.0,
    max_iter=1000,
    class_weight='balanced',  # handles imbalanced classes
    solver='lbfgs',
    n_jobs=-1
)
clf.fit(X_train_tfidf, y_train)

y_pred = clf.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))

Key tuning knobs:

ngram_range=(1, 2) captures phrases like “not good” that unigrams miss.
sublinear_tf=True applies 1 + log(tf) instead of raw counts, reducing the impact of word frequency outliers.
C in LogisticRegression controls regularization strength. Lower C = more regularization. Grid search between 0.01 and 10.

Feature Engineering Beyond TF-IDF

Custom Features

Sometimes domain knowledge beats algorithmic sophistication:

import numpy as np
from scipy.sparse import hstack

def extract_meta_features(texts):
    features = []
    for text in texts:
        features.append([
            len(text),                           # document length
            text.count('!'),                     # exclamation marks
            text.count('?'),                     # question marks
            sum(1 for c in text if c.isupper()) / max(len(text), 1),  # caps ratio
            len(text.split()),                   # word count
        ])
    return np.array(features)

meta_train = extract_meta_features(X_train)
meta_test = extract_meta_features(X_test)

# Combine with TF-IDF
X_train_combined = hstack([X_train_tfidf, meta_train])
X_test_combined = hstack([X_test_tfidf, meta_test])

These meta-features help when the style of text matters (formal vs. informal, short vs. long).

Character N-grams

For tasks where spelling patterns matter (language detection, authorship attribution), character n-grams outperform word-level features:

char_vectorizer = TfidfVectorizer(
    analyzer='char_wb',    # character n-grams within word boundaries
    ngram_range=(3, 5),
    max_features=50000
)

Hyperparameter Optimization

Use RandomizedSearchCV for efficient tuning:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from scipy.stats import uniform, randint

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

param_dist = {
    'tfidf__max_features': randint(10000, 100000),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'tfidf__min_df': randint(1, 10),
    'tfidf__sublinear_tf': [True, False],
    'clf__C': uniform(0.01, 10),
}

search = RandomizedSearchCV(
    pipe, param_dist, n_iter=50, cv=5,
    scoring='f1_macro', n_jobs=-1, random_state=42
)
search.fit(X_train, y_train)
print(f"Best F1: {search.best_score_:.4f}")
print(search.best_params_)

Multi-label Classification

When documents can have multiple labels:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y_train_bin = mlb.fit_transform(y_train_multilabel)  # list of label lists
y_test_bin = mlb.transform(y_test_multilabel)

clf = OneVsRestClassifier(
    LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')
)
clf.fit(X_train_tfidf, y_train_bin)

y_pred_bin = clf.predict(X_test_tfidf)
predicted_labels = mlb.inverse_transform(y_pred_bin)

Transformer-Based Classification

When you need maximum accuracy and have GPU resources:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import Dataset
import numpy as np

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare dataset
train_dataset = Dataset.from_dict({"text": X_train, "label": y_train_encoded})
test_dataset = Dataset.from_dict({"text": X_test, "label": y_test_encoded})

def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

train_dataset = train_dataset.map(tokenize_fn, batched=True)
test_dataset = test_dataset.map(tokenize_fn, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=num_classes
)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

def compute_metrics(eval_pred):
    from sklearn.metrics import f1_score
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"f1": f1_score(labels, preds, average="macro")}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

DistilBERT is a good starting point — 60% the size of BERT with 97% of the accuracy. For longer documents, consider Longformer or chunking strategies.

Deployment and Inference Optimization

Scikit-learn Models

import joblib

# Save
joblib.dump({'vectorizer': vectorizer, 'classifier': clf}, 'model.joblib')

# Load and predict
bundle = joblib.load('model.joblib')
text = "Your account has been suspended"
features = bundle['vectorizer'].transform([text])
prediction = bundle['classifier'].predict(features)[0]
confidence = bundle['classifier'].predict_proba(features).max()

Transformer Models with ONNX

For 3-5× inference speedup without GPU:

pip install optimum[onnxruntime]
optimum-cli export onnx --model ./results/checkpoint-best --task text-classification model_onnx/

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model = ORTModelForSequenceClassification.from_pretrained("model_onnx")
tokenizer = AutoTokenizer.from_pretrained("model_onnx")

inputs = tokenizer("Sample text", return_tensors="np")
outputs = model(**inputs)

Benchmarks: Accuracy vs. Speed

Tested on the AG News dataset (120k training, 7.6k test, 4 classes):

Model	F1 (macro)	Training Time	Inference (1k docs)
TF-IDF + Logistic Regression	0.92	8 sec	0.1 sec
TF-IDF + SVM (linear)	0.92	15 sec	0.1 sec
TF-IDF + Random Forest	0.88	45 sec	0.5 sec
DistilBERT fine-tuned	0.94	25 min (GPU)	12 sec (CPU)
RoBERTa fine-tuned	0.95	45 min (GPU)	18 sec (CPU)

The jump from TF-IDF + LR to DistilBERT is only 2 F1 points — often not worth the 100× increase in inference cost.

Error Analysis

After training, always analyze errors:

import pandas as pd

errors = pd.DataFrame({
    'text': X_test,
    'true': y_test,
    'predicted': y_pred,
    'correct': y_test == y_pred
})

# Focus on misclassifications
misclassified = errors[~errors['correct']].sort_values('true')
print(misclassified.groupby(['true', 'predicted']).size())

Common patterns in errors:

Short texts with ambiguous wording.
Sarcasm or negation (“This is not great” classified as positive).
Out-of-domain examples not represented in training data.
Label noise — training examples that were labeled incorrectly.

Fix label noise first. It has more impact than changing algorithms.

Common Pitfalls

Data leakage. Fitting the vectorizer on the full dataset (including test) inflates scores. Always fit_transform on train, transform on test.
Ignoring class distribution. Plot label frequencies before training. If the rarest class has fewer than 50 examples, consider collecting more data before building a model.
Over-engineering preprocessing. Aggressive stemming, emoji removal, and case folding can hurt as much as help. Measure the impact of each step.
Deploying without confidence thresholds. In production, reject predictions below a confidence threshold rather than forcing a label on ambiguous inputs.
Not versioning data alongside models. When you retrain with new data, you need to reproduce previous results. Version your training data, not just your model files.

The one thing to remember: Start with TF-IDF + Logistic Regression, measure carefully, and only add complexity when the baseline’s errors point to a specific limitation that a more complex model would address.

pythontext-classificationnlpmachine-learning