Scikit-Learn Clustering Algorithms — Core Concepts

K-Means, DBSCAN, hierarchical, and more — when to use each scikit-learn clustering algorithm and how to evaluate results without labels.

Why clustering matters

Clustering is unsupervised learning — you have data but no labels. Nobody told you which customers are “high value” or which documents belong together. Clustering discovers this structure automatically, making it essential for:

Customer segmentation (marketing)
Anomaly detection (security)
Document organization (search engines)
Image compression (reducing colors)
Gene expression analysis (bioinformatics)

The main clustering families

K-Means

The most popular algorithm. You specify the number of clusters k. The algorithm:

Places k random center points (centroids)
Assigns each data point to its nearest centroid
Moves each centroid to the average of its assigned points
Repeats steps 2-3 until centroids stop moving

Strengths: Fast, scales to millions of samples, easy to understand. Weaknesses: You must choose k in advance. Assumes clusters are spherical and roughly equal-sized. Sensitive to initialization (use n_init=10 or init='k-means++').

DBSCAN (Density-Based Spatial Clustering)

Groups together points that are densely packed, and marks isolated points as noise. Two parameters:

eps — the maximum distance between neighboring points
min_samples — minimum points needed to form a dense region

Strengths: Discovers clusters of arbitrary shape. Automatically detects noise/outliers. Doesn’t require specifying number of clusters. Weaknesses: Struggles with clusters of varying density. Sensitive to eps parameter. Performance degrades in high dimensions.

Hierarchical (Agglomerative) Clustering

Builds a tree of clusters by progressively merging the closest pairs:

Start with each point as its own cluster
Merge the two closest clusters
Repeat until you reach the desired number of clusters

The merging history forms a dendrogram — a tree diagram you can cut at any level to get different numbers of clusters.

Strengths: Produces a hierarchy (useful when you need multiple granularity levels). No need to pre-specify k. Works with any distance metric. Weaknesses: Slow for large datasets (O(n²) memory, O(n³) time). Merge decisions are irreversible.

Mean Shift

Finds cluster centers by iteratively shifting points toward areas of highest density. Like placing marbles on a bumpy surface — they roll to the nearest valley.

Strengths: Automatically determines number of clusters. Finds clusters of arbitrary shape. Weaknesses: Computationally expensive. Requires setting a bandwidth parameter.

Gaussian Mixture Models (GMM)

Assumes data comes from a mixture of several Gaussian (bell curve) distributions. Each cluster is one Gaussian with its own center, spread, and shape.

Strengths: Provides soft assignments (probability of belonging to each cluster). Models elliptical clusters of different sizes. Principled statistical framework. Weaknesses: Assumes Gaussian-shaped clusters. Sensitive to initialization. Must specify number of components.

Choosing the right algorithm

Scenario	Best Choice
Known number of round clusters	K-Means
Unknown number, irregular shapes	DBSCAN
Need hierarchy of granularity	Agglomerative
Soft (probabilistic) assignments	GMM
Very large datasets (>1M samples)	Mini-Batch K-Means
Data has noise/outliers	DBSCAN or HDBSCAN

Evaluating clusters without labels

Since clustering is unsupervised, you can’t check accuracy. Instead use:

Silhouette score — measures how similar a point is to its own cluster vs. the nearest other cluster. Range: -1 to 1, higher is better.
Calinski-Harabasz index — ratio of between-cluster variance to within-cluster variance. Higher is better.
Davies-Bouldin index — average similarity between each cluster and its most similar one. Lower is better.
Elbow method — plot K-Means inertia (within-cluster sum of squares) against number of clusters and look for the “elbow” where adding clusters stops helping.

Common misconception

Clustering results are not ground truth. Different algorithms, different parameters, even different random seeds produce different groupings — all potentially valid. Clustering reveals one possible structure in your data, not the only structure. Always validate clusters against domain knowledge and downstream task performance.

One thing to remember: No single clustering algorithm works best for all data. The shape of your clusters, whether you know how many to expect, and how much noise exists in your data should drive your algorithm choice.

pythonmachine-learningscikit-learn