Content-Based Filtering in Python — Deep Dive

Engineer production content-based recommenders in Python with TF-IDF, sentence embeddings, and hybrid feature pipelines — including evaluation and scaling.

Content-based filtering becomes powerful when you combine classical feature engineering with modern embedding techniques and design for real-world constraints like latency, freshness, and the filter-bubble problem.

1) Multi-signal feature pipelines

Production recommenders rarely rely on a single feature type. A robust pipeline combines:

Text features — TF-IDF or embeddings from item descriptions, titles, reviews
Categorical features — genre, brand, author encoded as multi-hot vectors
Numerical features — price, duration, popularity score (normalized)
Temporal features — release date, trending score, seasonal relevance

Combine them by concatenating normalized vectors or using a learned fusion layer.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler

# Text features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
text_vectors = tfidf.fit_transform(items['description']).toarray()

# Categorical features
mlb = MultiLabelBinarizer()
genre_vectors = mlb.fit_transform(items['genres'])

# Numerical features
scaler = MinMaxScaler()
num_vectors = scaler.fit_transform(items[['price', 'avg_rating', 'popularity']])

# Combined feature matrix
item_features = np.hstack([text_vectors, genre_vectors, num_vectors])

2) Embedding-based content filtering

Sentence-transformers produce dense vectors that capture semantic meaning, outperforming TF-IDF for nuanced content.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(items['description'].tolist(), batch_size=256, show_progress_bar=True)

# User profile: weighted average of liked item embeddings
def build_user_profile(liked_item_indices, weights=None):
    if weights is None:
        weights = np.ones(len(liked_item_indices))
    weights = weights / weights.sum()
    profile = np.average(embeddings[liked_item_indices], axis=0, weights=weights)
    return profile / np.linalg.norm(profile)

# Score candidates
from sklearn.metrics.pairwise import cosine_similarity

profile = build_user_profile([10, 42, 87], weights=np.array([0.5, 0.3, 0.2]))
scores = cosine_similarity(profile.reshape(1, -1), embeddings).flatten()
top_k = np.argsort(scores)[::-1][:20]

3) Time-decay user profiles

User preferences evolve. A profile that weighs a purchase from three years ago equally with yesterday’s click misrepresents current taste.

import numpy as np
from datetime import datetime

def time_decay_weights(interaction_dates, half_life_days=30):
    now = datetime.now()
    days_ago = np.array([(now - d).days for d in interaction_dates])
    return np.exp(-np.log(2) * days_ago / half_life_days)

weights = time_decay_weights(user_interaction_dates)
profile = build_user_profile(user_liked_indices, weights=weights)

Half-life of 30 days means an interaction from a month ago contributes half as much as one from today. Tune this per domain — fashion needs shorter half-lives than book recommendations.

4) Addressing the filter bubble

Pure content-based systems create echo chambers. Strategies to inject diversity:

Maximal Marginal Relevance (MMR): Re-ranks candidates to balance relevance with diversity. At each step, pick the item that is most relevant but also most different from items already selected.

def mmr_rerank(scores, item_vectors, k=10, lambda_param=0.5):
    selected = []
    candidates = list(range(len(scores)))

    for _ in range(k):
        if not candidates:
            break

        if not selected:
            best = max(candidates, key=lambda i: scores[i])
        else:
            selected_vecs = item_vectors[selected]
            mmr_scores = []
            for c in candidates:
                relevance = scores[c]
                max_sim = cosine_similarity(
                    item_vectors[c:c+1], selected_vecs
                ).max()
                mmr = lambda_param * relevance - (1 - lambda_param) * max_sim
                mmr_scores.append((c, mmr))
            best = max(mmr_scores, key=lambda x: x[1])[0]

        selected.append(best)
        candidates.remove(best)

    return selected

Exploration slots: Reserve 10-20% of recommendation slots for items outside the user’s usual profile — random popular items, trending content, or items from adjacent categories.

5) Evaluation strategies

Offline metrics

Precision@K / Recall@K — standard ranking metrics
Catalog coverage — percentage of items ever recommended. Low coverage signals over-specialization.
Intra-list diversity — average pairwise distance between recommended items. Higher means more diverse.

A/B testing considerations

Content-based systems are easier to A/B test than collaborative ones because user A’s recommendations don’t depend on user B’s behavior. Change the feature pipeline for the treatment group without side effects.

Freshness measurement

Track how often new items (added in the last 7 days) appear in recommendations. Content-based systems have a natural advantage here — new items get recommended as soon as their features are extracted, with no cold-start delay.

6) Scaling with approximate nearest neighbors

For catalogs beyond 100K items, computing cosine similarity against every item per request becomes expensive. Pre-compute item vectors and index them:

import faiss

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # inner product (cosine after normalization)
faiss.normalize_L2(embeddings)
index.add(embeddings)

# Query
query = profile.reshape(1, -1).astype('float32')
faiss.normalize_L2(query)
distances, indices = index.search(query, 100)

For 10M+ items, switch to IndexIVFFlat or IndexHNSWFlat for sub-linear search time.

7) Production architecture

A typical deployment:

Offline pipeline (daily/hourly): extract features, compute embeddings, build FAISS index, update user profiles from recent interactions.
Online serving: receive user ID → load cached user profile → query FAISS index → apply business rules (already-seen filter, MMR diversity) → return ranked list.
Feedback loop: log impressions and clicks, feed back into profile updates and model retraining.

Cache user profiles in Redis with TTL matching your update frequency. Serve the FAISS index from memory-mapped files so multiple workers share the same memory.

One thing to remember: the quality ceiling of content-based filtering is set by your feature engineering — invest in embeddings that truly capture what makes items similar in your domain, and pair them with diversity mechanisms to avoid trapping users in filter bubbles.

pythoncontent-based-filteringsentence-transformerssklearn