Content-Based Filtering in Python — Core Concepts
Content-based filtering (CBF) recommends items by matching item features to a user’s preference profile. Unlike collaborative filtering, it needs zero data from other users — only the current user’s history and item metadata.
How it works
The process has three steps:
- Feature extraction — represent each item as a vector of features (genre flags, keywords, embeddings).
- Profile building — aggregate the feature vectors of items a user liked into a user-preference vector.
- Scoring — compute similarity between the user profile and every candidate item. Rank by similarity.
Feature representation
TF-IDF for text
For text-heavy items like articles or product descriptions, TF-IDF (Term Frequency–Inverse Document Frequency) converts text into numerical vectors where important, distinctive words get high weights.
TF-IDF("neural") in an AI article = high (common in doc, rare across all docs)
TF-IDF("the") in any article = low (common everywhere)
Categorical features
For structured metadata (genre, director, language), one-hot encoding or multi-hot encoding creates binary feature vectors. A movie tagged “comedy” and “romance” gets 1s in both columns.
Embeddings
Modern systems use pre-trained language models to create dense vector representations. A sentence-transformer can encode a movie synopsis into a 384-dimensional vector that captures semantic meaning far better than keyword counts.
Similarity computation
Cosine similarity is the standard choice. It measures the angle between two vectors, ignoring magnitude — so a user who rated 100 items isn’t unfairly compared to one who rated 10.
Euclidean distance works but is sensitive to vector scale. Normalize vectors first if using it.
Python implementation pattern
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Item descriptions
descriptions = [
"A thrilling mystery set in Victorian London",
"A romantic comedy about two chefs in Paris",
"A dark detective story in foggy England",
]
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(descriptions)
# Similarity between item 0 and all others
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten()
# [1.0, 0.0, 0.42] — item 2 is somewhat similar to item 0
Building a user profile
The simplest approach averages the TF-IDF vectors of items the user liked:
user_profile = mean(vectors of liked items)
A weighted version gives more recent interactions higher weight, so the profile tracks evolving tastes.
Advantages and limitations
| Advantage | Limitation |
|---|---|
| Works for new users (with a few interactions) | Over-specialization (filter bubble) |
| No cold-start for new items (features are known) | Requires good feature engineering |
| Transparent — you can explain why an item was recommended | Misses serendipitous connections |
| Privacy-friendly — no cross-user data needed | Quality depends on metadata quality |
Common misconception
People think CBF is “simple” compared to collaborative filtering and therefore inferior. In practice, content-based approaches with modern embeddings can outperform traditional CF on domains with rich metadata — news articles, academic papers, and job listings all benefit because item text carries strong signal.
One thing to remember: content-based filtering treats each user as an island — it only needs to know what you liked and what items look like to make predictions.
See Also
- Python Collaborative Filtering Discover how Python uses the tastes of thousands of people to guess what you'll love next — no mind-reading required.
- Python Hybrid Recommendation Systems Find out why the best recommendation engines mix multiple strategies — like asking both a friend and a librarian for book picks.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.