Scikit-Learn Clustering Algorithms — Deep Dive
Technical foundation
Clustering algorithms optimize different objective functions, which is why they find different structures in the same data:
- K-Means minimizes within-cluster sum of squares (WCSS):
Σ Σ ||x - μ_k||² - DBSCAN maximizes density-connected components
- Agglomerative minimizes a linkage criterion (Ward’s = WCSS increment, complete = max inter-cluster distance)
- GMM maximizes the log-likelihood of a mixture of Gaussians
Understanding these objectives explains each algorithm’s biases: K-Means prefers spherical clusters because WCSS penalizes distance from centroids. DBSCAN prefers uniformly dense regions because its density criterion doesn’t adapt to varying densities.
K-Means: deep implementation
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# Generate clustered data
X, y_true = make_blobs(n_samples=10000, centers=5, n_features=15, random_state=42)
# ALWAYS scale before K-Means — it uses Euclidean distance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(
n_clusters=5,
init='k-means++', # smart initialization, avoids poor convergence
n_init=10, # run 10 times with different seeds, keep best
max_iter=300,
tol=1e-4, # convergence threshold
algorithm='lloyd', # 'lloyd' (standard) or 'elkan' (faster for low-d)
random_state=42,
)
kmeans.fit(X_scaled)
print(f"Inertia: {kmeans.inertia_:.0f}")
print(f"Iterations: {kmeans.n_iter_}")
print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")
K-Means++ initialization
Standard random initialization can place centroids in the same cluster, causing poor convergence. K-Means++ selects initial centroids that are spread apart:
- Choose the first centroid randomly
- For each subsequent centroid, select a point with probability proportional to its squared distance from the nearest existing centroid
- Repeat until
kcentroids are placed
This guarantees an O(log k)-competitive solution and dramatically improves convergence in practice.
Optimal K selection
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import matplotlib.pyplot as plt
k_range = range(2, 15)
metrics = {'inertia': [], 'silhouette': [], 'calinski': [], 'davies_bouldin': []}
for k in k_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
metrics['inertia'].append(km.inertia_)
metrics['silhouette'].append(silhouette_score(X_scaled, labels))
metrics['calinski'].append(calinski_harabasz_score(X_scaled, labels))
metrics['davies_bouldin'].append(davies_bouldin_score(X_scaled, labels))
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes[0, 0].plot(k_range, metrics['inertia'], 'bo-')
axes[0, 0].set_title('Elbow Method (Inertia)')
axes[0, 1].plot(k_range, metrics['silhouette'], 'go-')
axes[0, 1].set_title('Silhouette Score (higher = better)')
axes[1, 0].plot(k_range, metrics['calinski'], 'ro-')
axes[1, 0].set_title('Calinski-Harabasz (higher = better)')
axes[1, 1].plot(k_range, metrics['davies_bouldin'], 'mo-')
axes[1, 1].set_title('Davies-Bouldin (lower = better)')
for ax in axes.flat:
ax.set_xlabel('Number of clusters (k)')
plt.tight_layout()
Interpreting results: When metrics disagree on optimal K, prefer the K that makes domain sense. A retail customer segmentation with K=47 might score well mathematically but be useless for marketing.
Mini-Batch K-Means for scale
mbk = MiniBatchKMeans(
n_clusters=5,
batch_size=1024,
n_init=10,
max_iter=300,
random_state=42,
)
# 10-100x faster than standard K-Means on large datasets
# Slight quality tradeoff (typically <1% silhouette score difference)
mbk.fit(X_scaled)
Mini-Batch K-Means uses random mini-batches instead of the full dataset per iteration, reducing per-iteration cost from O(nkd) to O(bkd) where b << n.
DBSCAN: density-based clustering
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
# Heuristic for eps: k-distance plot
nn = NearestNeighbors(n_neighbors=10)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])
plt.figure(figsize=(10, 5))
plt.plot(k_distances)
plt.xlabel('Points (sorted by 10-NN distance)')
plt.ylabel('10-NN Distance')
plt.title('K-Distance Plot — Look for the "knee"')
plt.tight_layout()
# The "knee" in this plot suggests a good eps value
eps_value = 1.5 # read from the knee of the plot
dbscan = DBSCAN(
eps=eps_value,
min_samples=10, # rule of thumb: 2 × n_features
metric='euclidean',
algorithm='auto', # 'ball_tree', 'kd_tree', or 'brute'
n_jobs=-1,
)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise} ({n_noise/len(labels)*100:.1f}%)")
DBSCAN parameter sensitivity
# Systematic exploration of eps and min_samples
from itertools import product
results = []
for eps in np.arange(0.3, 3.0, 0.2):
for min_samples in [5, 10, 15, 20, 30]:
db = DBSCAN(eps=eps, min_samples=min_samples)
labels = db.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
noise_ratio = (labels == -1).sum() / len(labels)
if n_clusters >= 2:
non_noise = labels != -1
sil = silhouette_score(X_scaled[non_noise], labels[non_noise]) if non_noise.sum() > 1 else -1
else:
sil = -1
results.append({
'eps': eps, 'min_samples': min_samples,
'n_clusters': n_clusters, 'noise_ratio': noise_ratio,
'silhouette': sil
})
import pandas as pd
results_df = pd.DataFrame(results)
best = results_df.loc[results_df['silhouette'].idxmax()]
print(f"Best: eps={best['eps']:.1f}, min_samples={best['min_samples']:.0f}, "
f"clusters={best['n_clusters']:.0f}, silhouette={best['silhouette']:.4f}")
Agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Compute linkage for dendrogram
Z = linkage(X_scaled[:2000], method='ward', metric='euclidean')
plt.figure(figsize=(14, 6))
dendrogram(Z, truncate_mode='lastp', p=30, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.tight_layout()
# Fit with chosen number of clusters
agg = AgglomerativeClustering(
n_clusters=5,
linkage='ward', # 'ward', 'complete', 'average', 'single'
metric='euclidean',
connectivity=None, # optional sparse connectivity matrix for spatial constraints
)
labels = agg.fit_predict(X_scaled)
Linkage methods compared
| Linkage | Merging Criterion | Cluster Shape | Robustness |
|---|---|---|---|
| Ward | Min variance increase | Spherical, equal size | High |
| Complete | Max inter-cluster distance | Compact, similar diameter | Medium |
| Average | Mean inter-cluster distance | Moderate | Medium |
| Single | Min inter-cluster distance | Arbitrary (chaining risk) | Low |
Ward linkage is the default and most robust for general use. Single linkage can chain — creating long, thin clusters — but detects non-convex shapes that Ward misses.
Gaussian Mixture Models
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(
n_components=5,
covariance_type='full', # 'full', 'tied', 'diag', 'spherical'
n_init=5,
max_iter=200,
random_state=42,
)
gmm.fit(X_scaled)
labels = gmm.predict(X_scaled)
probabilities = gmm.predict_proba(X_scaled) # soft assignments
# Model selection with BIC/AIC
bic_scores = []
aic_scores = []
for k in range(2, 15):
gm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
gm.fit(X_scaled)
bic_scores.append(gm.bic(X_scaled))
aic_scores.append(gm.aic(X_scaled))
optimal_k_bic = np.argmin(bic_scores) + 2
print(f"Optimal components (BIC): {optimal_k_bic}")
Covariance types control model flexibility:
spherical— all clusters are round (like K-Means)diag— axis-aligned ellipsesfull— arbitrary ellipses (most flexible, most parameters)tied— all clusters share the same shape
Evaluation framework
def evaluate_clustering(X, labels, algorithm_name):
"""Comprehensive clustering evaluation."""
mask = labels != -1 # exclude noise points for DBSCAN
X_clean = X[mask]
labels_clean = labels[mask]
n_clusters = len(set(labels_clean))
if n_clusters < 2:
print(f"{algorithm_name}: Only {n_clusters} cluster(s) found — cannot evaluate")
return
metrics = {
'n_clusters': n_clusters,
'noise_points': (~mask).sum(),
'silhouette': silhouette_score(X_clean, labels_clean),
'calinski_harabasz': calinski_harabasz_score(X_clean, labels_clean),
'davies_bouldin': davies_bouldin_score(X_clean, labels_clean),
}
# Cluster size distribution
sizes = np.bincount(labels_clean)
metrics['min_cluster_size'] = sizes.min()
metrics['max_cluster_size'] = sizes.max()
metrics['size_ratio'] = sizes.max() / sizes.min()
print(f"\n{algorithm_name}:")
for k, v in metrics.items():
print(f" {k}: {v:.4f}" if isinstance(v, float) else f" {k}: {v}")
return metrics
Scaling to large datasets
| Algorithm | Time Complexity | Memory | Max Practical Size |
|---|---|---|---|
| K-Means | O(nkdi) | O(nk) | Millions |
| Mini-Batch K-Means | O(bkdi) | O(bk) | Billions |
| DBSCAN | O(n log n) with tree | O(n) | Hundreds of thousands |
| Agglomerative | O(n³) / O(n²) | O(n²) | ~50K |
| GMM | O(nk²d²i) | O(nk) | Hundreds of thousands |
For datasets exceeding these limits:
- Subsample: Cluster a representative sample, then assign remaining points to nearest cluster center
- Dimensionality reduction: Apply PCA or UMAP before clustering to reduce feature space
- Approximate algorithms: Use
MiniBatchKMeansor FAISS for K-Means at billion scale
# Two-stage approach for large datasets
from sklearn.decomposition import PCA
# Stage 1: Reduce dimensions
pca = PCA(n_components=10, random_state=42)
X_reduced = pca.fit_transform(X_scaled)
# Stage 2: Cluster in reduced space
kmeans = MiniBatchKMeans(n_clusters=5, batch_size=2048, random_state=42)
labels = kmeans.fit_predict(X_reduced)
Tradeoffs
| Algorithm | Assumptions | Handles Noise | Arbitrary Shapes | Soft Assignments | Scalability |
|---|---|---|---|---|---|
| K-Means | Spherical, equal size | No | No | No | Excellent |
| DBSCAN | Uniform density | Yes | Yes | No | Good |
| Agglomerative | Depends on linkage | No | Depends | No | Poor |
| GMM | Gaussian distributions | No | Elliptical | Yes | Good |
| Mean Shift | Smooth density | No | Yes | No | Poor |
One thing to remember: Clustering is exploratory, not definitive. Run multiple algorithms, compare their outputs, and validate against domain knowledge. The best clustering is the one that produces actionable insights for your specific use case.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'