Content-Based Filtering in Python — Core Concepts

Understand TF-IDF profiles, cosine similarity, and how Python's scikit-learn powers content-based recommendation engines without needing crowd data.

Content-based filtering (CBF) recommends items by matching item features to a user’s preference profile. Unlike collaborative filtering, it needs zero data from other users — only the current user’s history and item metadata.

How it works

The process has three steps:

Feature extraction — represent each item as a vector of features (genre flags, keywords, embeddings).
Profile building — aggregate the feature vectors of items a user liked into a user-preference vector.
Scoring — compute similarity between the user profile and every candidate item. Rank by similarity.

Feature representation

TF-IDF for text

For text-heavy items like articles or product descriptions, TF-IDF (Term Frequency–Inverse Document Frequency) converts text into numerical vectors where important, distinctive words get high weights.

TF-IDF("neural") in an AI article = high (common in doc, rare across all docs)
TF-IDF("the") in any article = low (common everywhere)

Categorical features

For structured metadata (genre, director, language), one-hot encoding or multi-hot encoding creates binary feature vectors. A movie tagged “comedy” and “romance” gets 1s in both columns.

Embeddings

Modern systems use pre-trained language models to create dense vector representations. A sentence-transformer can encode a movie synopsis into a 384-dimensional vector that captures semantic meaning far better than keyword counts.

Similarity computation

Cosine similarity is the standard choice. It measures the angle between two vectors, ignoring magnitude — so a user who rated 100 items isn’t unfairly compared to one who rated 10.

Euclidean distance works but is sensitive to vector scale. Normalize vectors first if using it.

Python implementation pattern

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Item descriptions
descriptions = [
    "A thrilling mystery set in Victorian London",
    "A romantic comedy about two chefs in Paris",
    "A dark detective story in foggy England",
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(descriptions)

# Similarity between item 0 and all others
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten()
# [1.0, 0.0, 0.42] — item 2 is somewhat similar to item 0

Building a user profile

The simplest approach averages the TF-IDF vectors of items the user liked:

user_profile = mean(vectors of liked items)

A weighted version gives more recent interactions higher weight, so the profile tracks evolving tastes.

Advantages and limitations

Advantage	Limitation
Works for new users (with a few interactions)	Over-specialization (filter bubble)
No cold-start for new items (features are known)	Requires good feature engineering
Transparent — you can explain why an item was recommended	Misses serendipitous connections
Privacy-friendly — no cross-user data needed	Quality depends on metadata quality

Common misconception

People think CBF is “simple” compared to collaborative filtering and therefore inferior. In practice, content-based approaches with modern embeddings can outperform traditional CF on domains with rich metadata — news articles, academic papers, and job listings all benefit because item text carries strong signal.

One thing to remember: content-based filtering treats each user as an island — it only needs to know what you liked and what items look like to make predictions.

pythoncontent-based-filteringtf-idf