Scikit-Learn Clustering Algorithms — Deep Dive

Technical foundation

Clustering algorithms optimize different objective functions, which is why they find different structures in the same data:

  • K-Means minimizes within-cluster sum of squares (WCSS): Σ Σ ||x - μ_k||²
  • DBSCAN maximizes density-connected components
  • Agglomerative minimizes a linkage criterion (Ward’s = WCSS increment, complete = max inter-cluster distance)
  • GMM maximizes the log-likelihood of a mixture of Gaussians

Understanding these objectives explains each algorithm’s biases: K-Means prefers spherical clusters because WCSS penalizes distance from centroids. DBSCAN prefers uniformly dense regions because its density criterion doesn’t adapt to varying densities.

K-Means: deep implementation

import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Generate clustered data
X, y_true = make_blobs(n_samples=10000, centers=5, n_features=15, random_state=42)

# ALWAYS scale before K-Means — it uses Euclidean distance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(
    n_clusters=5,
    init='k-means++',    # smart initialization, avoids poor convergence
    n_init=10,           # run 10 times with different seeds, keep best
    max_iter=300,
    tol=1e-4,            # convergence threshold
    algorithm='lloyd',   # 'lloyd' (standard) or 'elkan' (faster for low-d)
    random_state=42,
)

kmeans.fit(X_scaled)

print(f"Inertia: {kmeans.inertia_:.0f}")
print(f"Iterations: {kmeans.n_iter_}")
print(f"Cluster sizes: {np.bincount(kmeans.labels_)}")

K-Means++ initialization

Standard random initialization can place centroids in the same cluster, causing poor convergence. K-Means++ selects initial centroids that are spread apart:

  1. Choose the first centroid randomly
  2. For each subsequent centroid, select a point with probability proportional to its squared distance from the nearest existing centroid
  3. Repeat until k centroids are placed

This guarantees an O(log k)-competitive solution and dramatically improves convergence in practice.

Optimal K selection

from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import matplotlib.pyplot as plt

k_range = range(2, 15)
metrics = {'inertia': [], 'silhouette': [], 'calinski': [], 'davies_bouldin': []}

for k in k_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)

    metrics['inertia'].append(km.inertia_)
    metrics['silhouette'].append(silhouette_score(X_scaled, labels))
    metrics['calinski'].append(calinski_harabasz_score(X_scaled, labels))
    metrics['davies_bouldin'].append(davies_bouldin_score(X_scaled, labels))

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].plot(k_range, metrics['inertia'], 'bo-')
axes[0, 0].set_title('Elbow Method (Inertia)')

axes[0, 1].plot(k_range, metrics['silhouette'], 'go-')
axes[0, 1].set_title('Silhouette Score (higher = better)')

axes[1, 0].plot(k_range, metrics['calinski'], 'ro-')
axes[1, 0].set_title('Calinski-Harabasz (higher = better)')

axes[1, 1].plot(k_range, metrics['davies_bouldin'], 'mo-')
axes[1, 1].set_title('Davies-Bouldin (lower = better)')

for ax in axes.flat:
    ax.set_xlabel('Number of clusters (k)')

plt.tight_layout()

Interpreting results: When metrics disagree on optimal K, prefer the K that makes domain sense. A retail customer segmentation with K=47 might score well mathematically but be useless for marketing.

Mini-Batch K-Means for scale

mbk = MiniBatchKMeans(
    n_clusters=5,
    batch_size=1024,
    n_init=10,
    max_iter=300,
    random_state=42,
)

# 10-100x faster than standard K-Means on large datasets
# Slight quality tradeoff (typically <1% silhouette score difference)
mbk.fit(X_scaled)

Mini-Batch K-Means uses random mini-batches instead of the full dataset per iteration, reducing per-iteration cost from O(nkd) to O(bkd) where b << n.

DBSCAN: density-based clustering

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Heuristic for eps: k-distance plot
nn = NearestNeighbors(n_neighbors=10)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])

plt.figure(figsize=(10, 5))
plt.plot(k_distances)
plt.xlabel('Points (sorted by 10-NN distance)')
plt.ylabel('10-NN Distance')
plt.title('K-Distance Plot — Look for the "knee"')
plt.tight_layout()

# The "knee" in this plot suggests a good eps value
eps_value = 1.5  # read from the knee of the plot

dbscan = DBSCAN(
    eps=eps_value,
    min_samples=10,      # rule of thumb: 2 × n_features
    metric='euclidean',
    algorithm='auto',    # 'ball_tree', 'kd_tree', or 'brute'
    n_jobs=-1,
)

labels = dbscan.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise} ({n_noise/len(labels)*100:.1f}%)")

DBSCAN parameter sensitivity

# Systematic exploration of eps and min_samples
from itertools import product

results = []
for eps in np.arange(0.3, 3.0, 0.2):
    for min_samples in [5, 10, 15, 20, 30]:
        db = DBSCAN(eps=eps, min_samples=min_samples)
        labels = db.fit_predict(X_scaled)
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        noise_ratio = (labels == -1).sum() / len(labels)

        if n_clusters >= 2:
            non_noise = labels != -1
            sil = silhouette_score(X_scaled[non_noise], labels[non_noise]) if non_noise.sum() > 1 else -1
        else:
            sil = -1

        results.append({
            'eps': eps, 'min_samples': min_samples,
            'n_clusters': n_clusters, 'noise_ratio': noise_ratio,
            'silhouette': sil
        })

import pandas as pd
results_df = pd.DataFrame(results)
best = results_df.loc[results_df['silhouette'].idxmax()]
print(f"Best: eps={best['eps']:.1f}, min_samples={best['min_samples']:.0f}, "
      f"clusters={best['n_clusters']:.0f}, silhouette={best['silhouette']:.4f}")

Agglomerative clustering

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Compute linkage for dendrogram
Z = linkage(X_scaled[:2000], method='ward', metric='euclidean')

plt.figure(figsize=(14, 6))
dendrogram(Z, truncate_mode='lastp', p=30, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.tight_layout()

# Fit with chosen number of clusters
agg = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward',          # 'ward', 'complete', 'average', 'single'
    metric='euclidean',
    connectivity=None,       # optional sparse connectivity matrix for spatial constraints
)

labels = agg.fit_predict(X_scaled)

Linkage methods compared

LinkageMerging CriterionCluster ShapeRobustness
WardMin variance increaseSpherical, equal sizeHigh
CompleteMax inter-cluster distanceCompact, similar diameterMedium
AverageMean inter-cluster distanceModerateMedium
SingleMin inter-cluster distanceArbitrary (chaining risk)Low

Ward linkage is the default and most robust for general use. Single linkage can chain — creating long, thin clusters — but detects non-convex shapes that Ward misses.

Gaussian Mixture Models

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(
    n_components=5,
    covariance_type='full',  # 'full', 'tied', 'diag', 'spherical'
    n_init=5,
    max_iter=200,
    random_state=42,
)

gmm.fit(X_scaled)
labels = gmm.predict(X_scaled)
probabilities = gmm.predict_proba(X_scaled)  # soft assignments

# Model selection with BIC/AIC
bic_scores = []
aic_scores = []
for k in range(2, 15):
    gm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
    gm.fit(X_scaled)
    bic_scores.append(gm.bic(X_scaled))
    aic_scores.append(gm.aic(X_scaled))

optimal_k_bic = np.argmin(bic_scores) + 2
print(f"Optimal components (BIC): {optimal_k_bic}")

Covariance types control model flexibility:

  • spherical — all clusters are round (like K-Means)
  • diag — axis-aligned ellipses
  • full — arbitrary ellipses (most flexible, most parameters)
  • tied — all clusters share the same shape

Evaluation framework

def evaluate_clustering(X, labels, algorithm_name):
    """Comprehensive clustering evaluation."""
    mask = labels != -1  # exclude noise points for DBSCAN
    X_clean = X[mask]
    labels_clean = labels[mask]

    n_clusters = len(set(labels_clean))
    if n_clusters < 2:
        print(f"{algorithm_name}: Only {n_clusters} cluster(s) found — cannot evaluate")
        return

    metrics = {
        'n_clusters': n_clusters,
        'noise_points': (~mask).sum(),
        'silhouette': silhouette_score(X_clean, labels_clean),
        'calinski_harabasz': calinski_harabasz_score(X_clean, labels_clean),
        'davies_bouldin': davies_bouldin_score(X_clean, labels_clean),
    }

    # Cluster size distribution
    sizes = np.bincount(labels_clean)
    metrics['min_cluster_size'] = sizes.min()
    metrics['max_cluster_size'] = sizes.max()
    metrics['size_ratio'] = sizes.max() / sizes.min()

    print(f"\n{algorithm_name}:")
    for k, v in metrics.items():
        print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

    return metrics

Scaling to large datasets

AlgorithmTime ComplexityMemoryMax Practical Size
K-MeansO(nkdi)O(nk)Millions
Mini-Batch K-MeansO(bkdi)O(bk)Billions
DBSCANO(n log n) with treeO(n)Hundreds of thousands
AgglomerativeO(n³) / O(n²)O(n²)~50K
GMMO(nk²d²i)O(nk)Hundreds of thousands

For datasets exceeding these limits:

  1. Subsample: Cluster a representative sample, then assign remaining points to nearest cluster center
  2. Dimensionality reduction: Apply PCA or UMAP before clustering to reduce feature space
  3. Approximate algorithms: Use MiniBatchKMeans or FAISS for K-Means at billion scale
# Two-stage approach for large datasets
from sklearn.decomposition import PCA

# Stage 1: Reduce dimensions
pca = PCA(n_components=10, random_state=42)
X_reduced = pca.fit_transform(X_scaled)

# Stage 2: Cluster in reduced space
kmeans = MiniBatchKMeans(n_clusters=5, batch_size=2048, random_state=42)
labels = kmeans.fit_predict(X_reduced)

Tradeoffs

AlgorithmAssumptionsHandles NoiseArbitrary ShapesSoft AssignmentsScalability
K-MeansSpherical, equal sizeNoNoNoExcellent
DBSCANUniform densityYesYesNoGood
AgglomerativeDepends on linkageNoDependsNoPoor
GMMGaussian distributionsNoEllipticalYesGood
Mean ShiftSmooth densityNoYesNoPoor

One thing to remember: Clustering is exploratory, not definitive. Run multiple algorithms, compare their outputs, and validate against domain knowledge. The best clustering is the one that produces actionable insights for your specific use case.

pythonmachine-learningscikit-learn

See Also

  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'