Python Intent Classification — Deep Dive
Why Intent Classification Is the Bottleneck
In a task-oriented chatbot, the intent classifier is the single point of failure. A dialog manager can recover from a missing entity, but if the intent is wrong, the entire conversation derails. Production chatbot teams typically spend more time improving intent classification than any other component.
Building a Baseline with scikit-learn
Data Preparation
Start with a labeled dataset in a simple format:
import json
from sklearn.model_selection import train_test_split
# training_data.json: [{"text": "book a flight", "intent": "book_flight"}, ...]
with open("training_data.json") as f:
data = json.load(f)
texts = [d["text"] for d in data]
labels = [d["intent"] for d in data]
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.2, random_state=42, stratify=labels
)
TF-IDF + Logistic Regression Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
pipeline = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1, 2),
max_features=10000,
sublinear_tf=True,
)),
("clf", LogisticRegression(
C=5.0,
max_iter=1000,
class_weight="balanced",
)),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
This baseline typically achieves 85-92% accuracy on well-separated intents with 50+ examples per class. It trains in seconds and serves predictions in under 1ms — making it a strong production choice for small to medium intent sets.
When the Baseline Fails
TF-IDF struggles with:
- Synonyms: “cancel” vs. “terminate” vs. “get rid of” are unrelated tokens.
- Short messages: One or two words provide minimal signal.
- Overlapping intents: “Change my booking” could be
modify_bookingorcancel_and_rebook.
These failures motivate embedding-based approaches.
Embedding-Based Classifiers
Using Sentence Transformers
Pre-trained sentence encoders map semantically similar texts to nearby vectors:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import numpy as np
encoder = SentenceTransformer("all-MiniLM-L6-v2")
X_train_emb = encoder.encode(X_train, show_progress_bar=True)
X_test_emb = encoder.encode(X_test, show_progress_bar=True)
clf = LogisticRegression(C=1.0, max_iter=500)
clf.fit(X_train_emb, y_train)
y_pred = clf.predict(X_test_emb)
This approach captures semantic similarity without fine-tuning. “I want to cancel” and “Please remove my order” produce similar embeddings even though they share no words.
Few-Shot Classification
When training data is scarce (under 20 examples per intent), sentence embeddings with a nearest-neighbor classifier outperform fine-tuned models:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric="cosine")
knn.fit(X_train_emb, y_train)
Rasa’s DIET Classifier
Rasa’s Dual Intent and Entity Transformer (DIET) handles intent classification and entity extraction jointly in a single model. Key design choices:
- Sparse + dense features: Combines bag-of-words features (for rare/domain-specific terms) with pre-trained embeddings (for semantic generalization).
- Transformer encoder: A lightweight Transformer (2 layers, 256 hidden units) processes the combined features.
- Multi-task heads: Separate output heads for intent classification and entity tagging share the same encoder, which improves both tasks.
- StarSpace-style loss: Uses embedding similarity rather than softmax, which scales better with large intent sets.
Configuration in Rasa’s pipeline:
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LanguageModelFeaturizer
model_name: bert
model_weights: bert-base-uncased
- name: DIETClassifier
epochs: 100
constrain_similarities: true
Fine-Tuning Transformers
DistilBERT for Intent Classification
For maximum accuracy on ambiguous intents, fine-tune a Transformer:
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
from datasets import Dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=len(set(labels))
)
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=64)
train_ds = Dataset.from_dict({"text": X_train, "label": label_encoder.transform(y_train)})
train_ds = train_ds.map(tokenize, batched=True)
args = TrainingArguments(
output_dir="./intent_model",
num_train_epochs=5,
per_device_train_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
)
trainer = Trainer(model=model, args=args, train_dataset=train_ds)
trainer.train()
Fine-tuned DistilBERT typically achieves 94-98% accuracy on well-curated datasets, but inference takes 5-20ms per message (compared to sub-1ms for logistic regression).
Confidence Calibration
The Problem
Neural network confidence scores are often miscalibrated — a model that says 0.9 confidence might only be correct 75% of the time. This matters because dialog managers use confidence thresholds to decide when to ask for clarification.
Temperature Scaling
The simplest calibration technique fits a single temperature parameter on a held-out validation set:
import torch
import torch.nn as nn
from torch.optim import LBFGS
class TemperatureScaler(nn.Module):
def __init__(self):
super().__init__()
self.temperature = nn.Parameter(torch.ones(1) * 1.5)
def forward(self, logits):
return logits / self.temperature
def calibrate(logits: torch.Tensor, labels: torch.Tensor) -> float:
scaler = TemperatureScaler()
criterion = nn.CrossEntropyLoss()
optimizer = LBFGS([scaler.temperature], lr=0.01, max_iter=50)
def eval_step():
optimizer.zero_grad()
loss = criterion(scaler(logits), labels)
loss.backward()
return loss
optimizer.step(eval_step)
return scaler.temperature.item()
After calibration, the confidence scores better reflect actual accuracy, making threshold-based fallback policies more reliable.
Multi-Intent Classification
Approach: Multi-Label with Sigmoid
Replace the final softmax with independent sigmoid activations. Each intent gets its own binary decision:
import torch.nn as nn
class MultiIntentHead(nn.Module):
def __init__(self, hidden_size: int, num_intents: int):
super().__init__()
self.classifier = nn.Linear(hidden_size, num_intents)
def forward(self, features: torch.Tensor) -> torch.Tensor:
return torch.sigmoid(self.classifier(features))
At inference, any intent with a sigmoid score above a threshold (0.5 is typical) is considered active.
Data Requirements
Multi-intent training data needs examples of combined intents:
{"text": "Book a flight and find me a hotel", "intents": ["book_flight", "book_hotel"]}
{"text": "Cancel my flight", "intents": ["cancel_booking"]}
Evaluation Beyond Accuracy
Per-Intent Metrics
Overall accuracy hides problems. Always check per-intent precision, recall, and F1. A model with 95% overall accuracy might have 40% recall on your most important intent.
Confusion Analysis
Build a confusion matrix and focus on the most confused intent pairs. If modify_booking and cancel_booking are frequently confused, consider merging them into a single intent and disambiguating downstream.
Out-of-Scope Detection
Real users send messages that do not match any intent. An out-of-scope detector prevents the bot from confidently misclassifying these:
def is_out_of_scope(confidence: float, entropy: float,
conf_threshold: float = 0.4, ent_threshold: float = 1.5) -> bool:
return confidence < conf_threshold or entropy > ent_threshold
High entropy (uncertainty spread across many intents) combined with low top confidence is a strong signal that the message is out of scope.
Production Deployment
| Model | Accuracy | Latency | Memory | Training Time |
|---|---|---|---|---|
| TF-IDF + LR | 85-92% | <1ms | ~10 MB | Seconds |
| Sentence-BERT + LR | 90-95% | ~15ms | ~100 MB | Minutes |
| DIET (Rasa) | 92-96% | ~10ms | ~200 MB | 10-30 min |
| Fine-tuned DistilBERT | 94-98% | 5-20ms | ~250 MB | 30-120 min |
Choose based on your constraints. Start with the simplest model that meets your accuracy requirement. Most teams are surprised by how far TF-IDF + logistic regression goes with clean, balanced training data.
The one thing to remember: Intent classification accuracy depends more on training data quality — balanced classes, diverse phrasings, clear intent boundaries — than on model sophistication. Start simple, measure per-intent metrics, and only upgrade models when the data proves the baseline insufficient.
See Also
- Python Chatbot Architecture Discover how Python chatbots are built from simple building blocks that listen, think, and reply — like a friendly robot pen-pal.
- Python Conversation Memory Discover how chatbots remember what you said five minutes ago — and why some forget everything the moment you close the window.
- Python Dialog Management See how chatbots remember where they are in a conversation — like a waiter who never forgets your order.
- Python Rasa Framework Meet Rasa — the free toolkit that lets anyone build a chatbot that actually understands conversations, not just keywords.
- Python Response Generation Learn how chatbots craft their replies — from filling in the blanks to writing sentences from scratch like a tiny author.