Scikit-Learn Clustering Algorithms — Deep Dive

Production clustering with scikit-learn — from K-Means internals and DBSCAN parameter tuning to evaluation frameworks and scaling strategies.

Technical foundation

Clustering algorithms optimize different objective functions, which is why they find different structures in the same data:

K-Means minimizes within-cluster sum of squares (WCSS): Σ Σ ||x - μ_k||²
DBSCAN maximizes density-connected components
Agglomerative minimizes a linkage criterion (Ward’s = WCSS increment, complete = max inter-cluster distance)
GMM maximizes the log-likelihood of a mixture of Gaussians

Understanding these objectives explains each algorithm’s biases: K-Means prefers spherical clusters because WCSS penalizes distance from centroids. DBSCAN prefers uniformly dense regions because its density criterion doesn’t adapt to varying densities.

K-Means: deep implementation

import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Generate clustered data
X, y_true = make_blobs(n_samples=10000, centers=5, n_features=15, random_state=42)

# ALWAYS scale before K-Means — it uses Euclidean distance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(
    n_clusters=5,
    init='k-means++',    # smart initialization, avoids poor convergence
    n_init=10,           # run 10 times with different seeds, keep best
    max_iter=300,
    tol=1e-4,            # convergence threshold
    algorithm='lloyd',   # 'lloyd' (standard) or 'elkan' (faster for low-d)
    random_state=42,
)

kmeans.fit(X_scaled)

print(f"Inertia: {kmeans.inertia_:.0f}")
print(f"Iterations: {kmeans.n_iter_}")
print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")

K-Means++ initialization

Standard random initialization can place centroids in the same cluster, causing poor convergence. K-Means++ selects initial centroids that are spread apart:

Choose the first centroid randomly
For each subsequent centroid, select a point with probability proportional to its squared distance from the nearest existing centroid
Repeat until k centroids are placed

This guarantees an O(log k)-competitive solution and dramatically improves convergence in practice.

Optimal K selection

from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import matplotlib.pyplot as plt

k_range = range(2, 15)
metrics = {'inertia': [], 'silhouette': [], 'calinski': [], 'davies_bouldin': []}

for k in k_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)

    metrics['inertia'].append(km.inertia_)
    metrics['silhouette'].append(silhouette_score(X_scaled, labels))
    metrics['calinski'].append(calinski_harabasz_score(X_scaled, labels))
    metrics['davies_bouldin'].append(davies_bouldin_score(X_scaled, labels))

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].plot(k_range, metrics['inertia'], 'bo-')
axes[0, 0].set_title('Elbow Method (Inertia)')

axes[0, 1].plot(k_range, metrics['silhouette'], 'go-')
axes[0, 1].set_title('Silhouette Score (higher = better)')

axes[1, 0].plot(k_range, metrics['calinski'], 'ro-')
axes[1, 0].set_title('Calinski-Harabasz (higher = better)')

axes[1, 1].plot(k_range, metrics['davies_bouldin'], 'mo-')
axes[1, 1].set_title('Davies-Bouldin (lower = better)')

for ax in axes.flat:
    ax.set_xlabel('Number of clusters (k)')

plt.tight_layout()

Interpreting results: When metrics disagree on optimal K, prefer the K that makes domain sense. A retail customer segmentation with K=47 might score well mathematically but be useless for marketing.

Mini-Batch K-Means for scale

mbk = MiniBatchKMeans(
    n_clusters=5,
    batch_size=1024,
    n_init=10,
    max_iter=300,
    random_state=42,
)

# 10-100x faster than standard K-Means on large datasets
# Slight quality tradeoff (typically <1% silhouette score difference)
mbk.fit(X_scaled)

Mini-Batch K-Means uses random mini-batches instead of the full dataset per iteration, reducing per-iteration cost from O(nkd) to O(bkd) where b << n.

DBSCAN: density-based clustering

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Heuristic for eps: k-distance plot
nn = NearestNeighbors(n_neighbors=10)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])

plt.figure(figsize=(10, 5))
plt.plot(k_distances)
plt.xlabel('Points (sorted by 10-NN distance)')
plt.ylabel('10-NN Distance')
plt.title('K-Distance Plot — Look for the "knee"')
plt.tight_layout()

# The "knee" in this plot suggests a good eps value
eps_value = 1.5  # read from the knee of the plot

dbscan = DBSCAN(
    eps=eps_value,
    min_samples=10,      # rule of thumb: 2 × n_features
    metric='euclidean',
    algorithm='auto',    # 'ball_tree', 'kd_tree', or 'brute'
    n_jobs=-1,
)

labels = dbscan.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise} ({n_noise/len(labels)*100:.1f}%)")

DBSCAN parameter sensitivity

# Systematic exploration of eps and min_samples
from itertools import product

results = []
for eps in np.arange(0.3, 3.0, 0.2):
    for min_samples in [5, 10, 15, 20, 30]:
        db = DBSCAN(eps=eps, min_samples=min_samples)
        labels = db.fit_predict(X_scaled)
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        noise_ratio = (labels == -1).sum() / len(labels)

        if n_clusters >= 2:
            non_noise = labels != -1
            sil = silhouette_score(X_scaled[non_noise], labels[non_noise]) if non_noise.sum() > 1 else -1
        else:
            sil = -1

        results.append({
            'eps': eps, 'min_samples': min_samples,
            'n_clusters': n_clusters, 'noise_ratio': noise_ratio,
            'silhouette': sil
        })

import pandas as pd
results_df = pd.DataFrame(results)
best = results_df.loc[results_df['silhouette'].idxmax()]
print(f"Best: eps={best['eps']:.1f}, min_samples={best['min_samples']:.0f}, "
      f"clusters={best['n_clusters']:.0f}, silhouette={best['silhouette']:.4f}")

Agglomerative clustering

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Compute linkage for dendrogram
Z = linkage(X_scaled[:2000], method='ward', metric='euclidean')

plt.figure(figsize=(14, 6))
dendrogram(Z, truncate_mode='lastp', p=30, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.tight_layout()

# Fit with chosen number of clusters
agg = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward',          # 'ward', 'complete', 'average', 'single'
    metric='euclidean',
    connectivity=None,       # optional sparse connectivity matrix for spatial constraints
)

labels = agg.fit_predict(X_scaled)

Linkage methods compared

Linkage	Merging Criterion	Cluster Shape	Robustness
Ward	Min variance increase	Spherical, equal size	High
Complete	Max inter-cluster distance	Compact, similar diameter	Medium
Average	Mean inter-cluster distance	Moderate	Medium
Single	Min inter-cluster distance	Arbitrary (chaining risk)	Low

Ward linkage is the default and most robust for general use. Single linkage can chain — creating long, thin clusters — but detects non-convex shapes that Ward misses.

Gaussian Mixture Models

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(
    n_components=5,
    covariance_type='full',  # 'full', 'tied', 'diag', 'spherical'
    n_init=5,
    max_iter=200,
    random_state=42,
)

gmm.fit(X_scaled)
labels = gmm.predict(X_scaled)
probabilities = gmm.predict_proba(X_scaled)  # soft assignments

# Model selection with BIC/AIC
bic_scores = []
aic_scores = []
for k in range(2, 15):
    gm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
    gm.fit(X_scaled)
    bic_scores.append(gm.bic(X_scaled))
    aic_scores.append(gm.aic(X_scaled))

optimal_k_bic = np.argmin(bic_scores) + 2
print(f"Optimal components (BIC): {optimal_k_bic}")

Covariance types control model flexibility:

spherical — all clusters are round (like K-Means)
diag — axis-aligned ellipses
full — arbitrary ellipses (most flexible, most parameters)
tied — all clusters share the same shape

Evaluation framework

def evaluate_clustering(X, labels, algorithm_name):
    """Comprehensive clustering evaluation."""
    mask = labels != -1  # exclude noise points for DBSCAN
    X_clean = X[mask]
    labels_clean = labels[mask]

    n_clusters = len(set(labels_clean))
    if n_clusters < 2:
        print(f"{algorithm_name}: Only {n_clusters} cluster(s) found — cannot evaluate")
        return

    metrics = {
        'n_clusters': n_clusters,
        'noise_points': (~mask).sum(),
        'silhouette': silhouette_score(X_clean, labels_clean),
        'calinski_harabasz': calinski_harabasz_score(X_clean, labels_clean),
        'davies_bouldin': davies_bouldin_score(X_clean, labels_clean),
    }

    # Cluster size distribution
    sizes = np.bincount(labels_clean)
    metrics['min_cluster_size'] = sizes.min()
    metrics['max_cluster_size'] = sizes.max()
    metrics['size_ratio'] = sizes.max() / sizes.min()

    print(f"\n{algorithm_name}:")
    for k, v in metrics.items():
        print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

    return metrics

Scaling to large datasets

Algorithm	Time Complexity	Memory	Max Practical Size
K-Means	O(nkdi)	O(nk)	Millions
Mini-Batch K-Means	O(bkdi)	O(bk)	Billions
DBSCAN	O(n log n) with tree	O(n)	Hundreds of thousands
Agglomerative	O(n³) / O(n²)	O(n²)	~50K
GMM	O(nk²d²i)	O(nk)	Hundreds of thousands

For datasets exceeding these limits:

Subsample: Cluster a representative sample, then assign remaining points to nearest cluster center
Dimensionality reduction: Apply PCA or UMAP before clustering to reduce feature space
Approximate algorithms: Use MiniBatchKMeans or FAISS for K-Means at billion scale

# Two-stage approach for large datasets
from sklearn.decomposition import PCA

# Stage 1: Reduce dimensions
pca = PCA(n_components=10, random_state=42)
X_reduced = pca.fit_transform(X_scaled)

# Stage 2: Cluster in reduced space
kmeans = MiniBatchKMeans(n_clusters=5, batch_size=2048, random_state=42)
labels = kmeans.fit_predict(X_reduced)

Tradeoffs

Algorithm	Assumptions	Handles Noise	Arbitrary Shapes	Soft Assignments	Scalability
K-Means	Spherical, equal size	No	No	No	Excellent
DBSCAN	Uniform density	Yes	Yes	No	Good
Agglomerative	Depends on linkage	No	Depends	No	Poor
GMM	Gaussian distributions	No	Elliptical	Yes	Good
Mean Shift	Smooth density	No	Yes	No	Poor

One thing to remember: Clustering is exploratory, not definitive. Run multiple algorithms, compare their outputs, and validate against domain knowledge. The best clustering is the one that produces actionable insights for your specific use case.

pythonmachine-learningscikit-learn