Content-Based Filtering in Python — Deep Dive
Content-based filtering becomes powerful when you combine classical feature engineering with modern embedding techniques and design for real-world constraints like latency, freshness, and the filter-bubble problem.
1) Multi-signal feature pipelines
Production recommenders rarely rely on a single feature type. A robust pipeline combines:
- Text features — TF-IDF or embeddings from item descriptions, titles, reviews
- Categorical features — genre, brand, author encoded as multi-hot vectors
- Numerical features — price, duration, popularity score (normalized)
- Temporal features — release date, trending score, seasonal relevance
Combine them by concatenating normalized vectors or using a learned fusion layer.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
# Text features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
text_vectors = tfidf.fit_transform(items['description']).toarray()
# Categorical features
mlb = MultiLabelBinarizer()
genre_vectors = mlb.fit_transform(items['genres'])
# Numerical features
scaler = MinMaxScaler()
num_vectors = scaler.fit_transform(items[['price', 'avg_rating', 'popularity']])
# Combined feature matrix
item_features = np.hstack([text_vectors, genre_vectors, num_vectors])
2) Embedding-based content filtering
Sentence-transformers produce dense vectors that capture semantic meaning, outperforming TF-IDF for nuanced content.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(items['description'].tolist(), batch_size=256, show_progress_bar=True)
# User profile: weighted average of liked item embeddings
def build_user_profile(liked_item_indices, weights=None):
if weights is None:
weights = np.ones(len(liked_item_indices))
weights = weights / weights.sum()
profile = np.average(embeddings[liked_item_indices], axis=0, weights=weights)
return profile / np.linalg.norm(profile)
# Score candidates
from sklearn.metrics.pairwise import cosine_similarity
profile = build_user_profile([10, 42, 87], weights=np.array([0.5, 0.3, 0.2]))
scores = cosine_similarity(profile.reshape(1, -1), embeddings).flatten()
top_k = np.argsort(scores)[::-1][:20]
3) Time-decay user profiles
User preferences evolve. A profile that weighs a purchase from three years ago equally with yesterday’s click misrepresents current taste.
import numpy as np
from datetime import datetime
def time_decay_weights(interaction_dates, half_life_days=30):
now = datetime.now()
days_ago = np.array([(now - d).days for d in interaction_dates])
return np.exp(-np.log(2) * days_ago / half_life_days)
weights = time_decay_weights(user_interaction_dates)
profile = build_user_profile(user_liked_indices, weights=weights)
Half-life of 30 days means an interaction from a month ago contributes half as much as one from today. Tune this per domain — fashion needs shorter half-lives than book recommendations.
4) Addressing the filter bubble
Pure content-based systems create echo chambers. Strategies to inject diversity:
Maximal Marginal Relevance (MMR): Re-ranks candidates to balance relevance with diversity. At each step, pick the item that is most relevant but also most different from items already selected.
def mmr_rerank(scores, item_vectors, k=10, lambda_param=0.5):
selected = []
candidates = list(range(len(scores)))
for _ in range(k):
if not candidates:
break
if not selected:
best = max(candidates, key=lambda i: scores[i])
else:
selected_vecs = item_vectors[selected]
mmr_scores = []
for c in candidates:
relevance = scores[c]
max_sim = cosine_similarity(
item_vectors[c:c+1], selected_vecs
).max()
mmr = lambda_param * relevance - (1 - lambda_param) * max_sim
mmr_scores.append((c, mmr))
best = max(mmr_scores, key=lambda x: x[1])[0]
selected.append(best)
candidates.remove(best)
return selected
Exploration slots: Reserve 10-20% of recommendation slots for items outside the user’s usual profile — random popular items, trending content, or items from adjacent categories.
5) Evaluation strategies
Offline metrics
- Precision@K / Recall@K — standard ranking metrics
- Catalog coverage — percentage of items ever recommended. Low coverage signals over-specialization.
- Intra-list diversity — average pairwise distance between recommended items. Higher means more diverse.
A/B testing considerations
Content-based systems are easier to A/B test than collaborative ones because user A’s recommendations don’t depend on user B’s behavior. Change the feature pipeline for the treatment group without side effects.
Freshness measurement
Track how often new items (added in the last 7 days) appear in recommendations. Content-based systems have a natural advantage here — new items get recommended as soon as their features are extracted, with no cold-start delay.
6) Scaling with approximate nearest neighbors
For catalogs beyond 100K items, computing cosine similarity against every item per request becomes expensive. Pre-compute item vectors and index them:
import faiss
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension) # inner product (cosine after normalization)
faiss.normalize_L2(embeddings)
index.add(embeddings)
# Query
query = profile.reshape(1, -1).astype('float32')
faiss.normalize_L2(query)
distances, indices = index.search(query, 100)
For 10M+ items, switch to IndexIVFFlat or IndexHNSWFlat for sub-linear search time.
7) Production architecture
A typical deployment:
- Offline pipeline (daily/hourly): extract features, compute embeddings, build FAISS index, update user profiles from recent interactions.
- Online serving: receive user ID → load cached user profile → query FAISS index → apply business rules (already-seen filter, MMR diversity) → return ranked list.
- Feedback loop: log impressions and clicks, feed back into profile updates and model retraining.
Cache user profiles in Redis with TTL matching your update frequency. Serve the FAISS index from memory-mapped files so multiple workers share the same memory.
One thing to remember: the quality ceiling of content-based filtering is set by your feature engineering — invest in embeddings that truly capture what makes items similar in your domain, and pair them with diversity mechanisms to avoid trapping users in filter bubbles.
See Also
- Python Collaborative Filtering Discover how Python uses the tastes of thousands of people to guess what you'll love next — no mind-reading required.
- Python Hybrid Recommendation Systems Find out why the best recommendation engines mix multiple strategies — like asking both a friend and a librarian for book picks.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.