Content-Based Filtering in Python — Core Concepts

Content-based filtering (CBF) recommends items by matching item features to a user’s preference profile. Unlike collaborative filtering, it needs zero data from other users — only the current user’s history and item metadata.

How it works

The process has three steps:

  1. Feature extraction — represent each item as a vector of features (genre flags, keywords, embeddings).
  2. Profile building — aggregate the feature vectors of items a user liked into a user-preference vector.
  3. Scoring — compute similarity between the user profile and every candidate item. Rank by similarity.

Feature representation

TF-IDF for text

For text-heavy items like articles or product descriptions, TF-IDF (Term Frequency–Inverse Document Frequency) converts text into numerical vectors where important, distinctive words get high weights.

TF-IDF("neural") in an AI article = high (common in doc, rare across all docs)
TF-IDF("the") in any article = low (common everywhere)

Categorical features

For structured metadata (genre, director, language), one-hot encoding or multi-hot encoding creates binary feature vectors. A movie tagged “comedy” and “romance” gets 1s in both columns.

Embeddings

Modern systems use pre-trained language models to create dense vector representations. A sentence-transformer can encode a movie synopsis into a 384-dimensional vector that captures semantic meaning far better than keyword counts.

Similarity computation

Cosine similarity is the standard choice. It measures the angle between two vectors, ignoring magnitude — so a user who rated 100 items isn’t unfairly compared to one who rated 10.

Euclidean distance works but is sensitive to vector scale. Normalize vectors first if using it.

Python implementation pattern

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Item descriptions
descriptions = [
    "A thrilling mystery set in Victorian London",
    "A romantic comedy about two chefs in Paris",
    "A dark detective story in foggy England",
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(descriptions)

# Similarity between item 0 and all others
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten()
# [1.0, 0.0, 0.42] — item 2 is somewhat similar to item 0

Building a user profile

The simplest approach averages the TF-IDF vectors of items the user liked:

user_profile = mean(vectors of liked items)

A weighted version gives more recent interactions higher weight, so the profile tracks evolving tastes.

Advantages and limitations

AdvantageLimitation
Works for new users (with a few interactions)Over-specialization (filter bubble)
No cold-start for new items (features are known)Requires good feature engineering
Transparent — you can explain why an item was recommendedMisses serendipitous connections
Privacy-friendly — no cross-user data neededQuality depends on metadata quality

Common misconception

People think CBF is “simple” compared to collaborative filtering and therefore inferior. In practice, content-based approaches with modern embeddings can outperform traditional CF on domains with rich metadata — news articles, academic papers, and job listings all benefit because item text carries strong signal.

One thing to remember: content-based filtering treats each user as an island — it only needs to know what you liked and what items look like to make predictions.

pythoncontent-based-filteringtf-idf

See Also

  • Python Collaborative Filtering Discover how Python uses the tastes of thousands of people to guess what you'll love next — no mind-reading required.
  • Python Hybrid Recommendation Systems Find out why the best recommendation engines mix multiple strategies — like asking both a friend and a librarian for book picks.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.