Python Intent Classification — Deep Dive

Build and deploy intent classifiers in Python — from TF-IDF baselines through DIET to fine-tuned Transformers with confidence calibration.

Why Intent Classification Is the Bottleneck

In a task-oriented chatbot, the intent classifier is the single point of failure. A dialog manager can recover from a missing entity, but if the intent is wrong, the entire conversation derails. Production chatbot teams typically spend more time improving intent classification than any other component.

Building a Baseline with scikit-learn

Data Preparation

Start with a labeled dataset in a simple format:

import json
from sklearn.model_selection import train_test_split

# training_data.json: [{"text": "book a flight", "intent": "book_flight"}, ...]
with open("training_data.json") as f:
    data = json.load(f)

texts = [d["text"] for d in data]
labels = [d["intent"] for d in data]
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

TF-IDF + Logistic Regression Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),
        max_features=10000,
        sublinear_tf=True,
    )),
    ("clf", LogisticRegression(
        C=5.0,
        max_iter=1000,
        class_weight="balanced",
    )),
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

This baseline typically achieves 85-92% accuracy on well-separated intents with 50+ examples per class. It trains in seconds and serves predictions in under 1ms — making it a strong production choice for small to medium intent sets.

When the Baseline Fails

TF-IDF struggles with:

Synonyms: “cancel” vs. “terminate” vs. “get rid of” are unrelated tokens.
Short messages: One or two words provide minimal signal.
Overlapping intents: “Change my booking” could be modify_booking or cancel_and_rebook.

These failures motivate embedding-based approaches.

Embedding-Based Classifiers

Using Sentence Transformers

Pre-trained sentence encoders map semantically similar texts to nearby vectors:

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import numpy as np

encoder = SentenceTransformer("all-MiniLM-L6-v2")

X_train_emb = encoder.encode(X_train, show_progress_bar=True)
X_test_emb = encoder.encode(X_test, show_progress_bar=True)

clf = LogisticRegression(C=1.0, max_iter=500)
clf.fit(X_train_emb, y_train)
y_pred = clf.predict(X_test_emb)

This approach captures semantic similarity without fine-tuning. “I want to cancel” and “Please remove my order” produce similar embeddings even though they share no words.

Few-Shot Classification

When training data is scarce (under 20 examples per intent), sentence embeddings with a nearest-neighbor classifier outperform fine-tuned models:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, metric="cosine")
knn.fit(X_train_emb, y_train)

Rasa’s DIET Classifier

Rasa’s Dual Intent and Entity Transformer (DIET) handles intent classification and entity extraction jointly in a single model. Key design choices:

Sparse + dense features: Combines bag-of-words features (for rare/domain-specific terms) with pre-trained embeddings (for semantic generalization).
Transformer encoder: A lightweight Transformer (2 layers, 256 hidden units) processes the combined features.
Multi-task heads: Separate output heads for intent classification and entity tagging share the same encoder, which improves both tasks.
StarSpace-style loss: Uses embedding similarity rather than softmax, which scales better with large intent sets.

Configuration in Rasa’s pipeline:

pipeline:
  - name: WhitespaceTokenizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: LanguageModelFeaturizer
    model_name: bert
    model_weights: bert-base-uncased
  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true

Fine-Tuning Transformers

DistilBERT for Intent Classification

For maximum accuracy on ambiguous intents, fine-tune a Transformer:

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=len(set(labels))
)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=64)

train_ds = Dataset.from_dict({"text": X_train, "label": label_encoder.transform(y_train)})
train_ds = train_ds.map(tokenize, batched=True)

args = TrainingArguments(
    output_dir="./intent_model",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds)
trainer.train()

Fine-tuned DistilBERT typically achieves 94-98% accuracy on well-curated datasets, but inference takes 5-20ms per message (compared to sub-1ms for logistic regression).

Confidence Calibration

The Problem

Neural network confidence scores are often miscalibrated — a model that says 0.9 confidence might only be correct 75% of the time. This matters because dialog managers use confidence thresholds to decide when to ask for clarification.

Temperature Scaling

The simplest calibration technique fits a single temperature parameter on a held-out validation set:

import torch
import torch.nn as nn
from torch.optim import LBFGS

class TemperatureScaler(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1) * 1.5)

    def forward(self, logits):
        return logits / self.temperature

def calibrate(logits: torch.Tensor, labels: torch.Tensor) -> float:
    scaler = TemperatureScaler()
    criterion = nn.CrossEntropyLoss()
    optimizer = LBFGS([scaler.temperature], lr=0.01, max_iter=50)

    def eval_step():
        optimizer.zero_grad()
        loss = criterion(scaler(logits), labels)
        loss.backward()
        return loss

    optimizer.step(eval_step)
    return scaler.temperature.item()

After calibration, the confidence scores better reflect actual accuracy, making threshold-based fallback policies more reliable.

Multi-Intent Classification

Approach: Multi-Label with Sigmoid

Replace the final softmax with independent sigmoid activations. Each intent gets its own binary decision:

import torch.nn as nn

class MultiIntentHead(nn.Module):
    def __init__(self, hidden_size: int, num_intents: int):
        super().__init__()
        self.classifier = nn.Linear(hidden_size, num_intents)

    def forward(self, features: torch.Tensor) -> torch.Tensor:
        return torch.sigmoid(self.classifier(features))

At inference, any intent with a sigmoid score above a threshold (0.5 is typical) is considered active.

Data Requirements

Multi-intent training data needs examples of combined intents:

{"text": "Book a flight and find me a hotel", "intents": ["book_flight", "book_hotel"]}
{"text": "Cancel my flight", "intents": ["cancel_booking"]}

Evaluation Beyond Accuracy

Per-Intent Metrics

Overall accuracy hides problems. Always check per-intent precision, recall, and F1. A model with 95% overall accuracy might have 40% recall on your most important intent.

Confusion Analysis

Build a confusion matrix and focus on the most confused intent pairs. If modify_booking and cancel_booking are frequently confused, consider merging them into a single intent and disambiguating downstream.

Out-of-Scope Detection

Real users send messages that do not match any intent. An out-of-scope detector prevents the bot from confidently misclassifying these:

def is_out_of_scope(confidence: float, entropy: float, 
                     conf_threshold: float = 0.4, ent_threshold: float = 1.5) -> bool:
    return confidence < conf_threshold or entropy > ent_threshold

High entropy (uncertainty spread across many intents) combined with low top confidence is a strong signal that the message is out of scope.

Production Deployment

Model	Accuracy	Latency	Memory	Training Time
TF-IDF + LR	85-92%	<1ms	~10 MB	Seconds
Sentence-BERT + LR	90-95%	~15ms	~100 MB	Minutes
DIET (Rasa)	92-96%	~10ms	~200 MB	10-30 min
Fine-tuned DistilBERT	94-98%	5-20ms	~250 MB	30-120 min

Choose based on your constraints. Start with the simplest model that meets your accuracy requirement. Most teams are surprised by how far TF-IDF + logistic regression goes with clean, balanced training data.

The one thing to remember: Intent classification accuracy depends more on training data quality — balanced classes, diverse phrasings, clear intent boundaries — than on model sophistication. Start simple, measure per-intent metrics, and only upgrade models when the data proves the baseline insufficient.

pythonintent-classificationchatbotsnlptransformers