Anomaly Detection with Python — Core Concepts

Understand the main anomaly detection algorithms, when to use each one, and how to implement them in Python.

What makes anomaly detection different from classification

In classification, you have labeled examples of each category — spam vs. not-spam, cat vs. dog. In anomaly detection, you typically have mountains of normal data and very few (or zero) examples of anomalies. This asymmetry changes the entire approach. You model what normal looks like and flag deviations.

Types of anomalies

Point anomalies: a single data point is abnormal (a $50,000 transaction on a card that usually sees $50 purchases).
Contextual anomalies: normal in one context, abnormal in another (80°F in July is fine; 80°F in January in Chicago is not).
Collective anomalies: a sequence of points that is abnormal as a group, even if each individual point seems fine (a server making 100 requests per second is normal, but making exactly 100 per second for 24 hours straight is not).

The main algorithms

Statistical methods — Z-score and IQR

The simplest approach: compute how far each point is from the mean. Points beyond a threshold (typically 3 standard deviations) are flagged.

import numpy as np

def zscore_anomalies(data: np.ndarray, threshold: float = 3.0) -> np.ndarray:
    mean = np.mean(data)
    std = np.std(data)
    z_scores = np.abs((data - mean) / std)
    return z_scores > threshold

Works well for simple, normally distributed data. Breaks down with skewed distributions, multiple modes, or high-dimensional data.

Isolation Forest

Instead of modeling normal data, Isolation Forest isolates anomalies directly. The logic: anomalies are rare and different, so they require fewer random splits to isolate in a tree structure.

from sklearn.ensemble import IsolationForest

model = IsolationForest(contamination=0.02, random_state=42)
model.fit(data)
predictions = model.predict(data)  # -1 = anomaly, 1 = normal
scores = model.decision_function(data)  # lower = more anomalous

Isolation Forest handles high-dimensional data well and does not assume any particular distribution. It is the go-to choice for many production systems.

Local Outlier Factor (LOF)

LOF compares the density of points around each observation to the density around its neighbors. Points in sparse regions surrounded by dense neighborhoods are flagged as outliers.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.02)
predictions = lof.fit_predict(data)  # -1 = anomaly

LOF excels at detecting local anomalies — points that are normal globally but unusual in their local neighborhood. It struggles with very high-dimensional data due to the curse of dimensionality.

Autoencoders (deep learning)

Train a neural network to compress and reconstruct normal data. Anomalies produce high reconstruction error because the model never learned to handle them:

from sklearn.preprocessing import StandardScaler

# After training an autoencoder on normal data:
reconstructed = autoencoder.predict(test_data)
reconstruction_error = np.mean((test_data - reconstructed) ** 2, axis=1)
anomalies = reconstruction_error > threshold

Autoencoders are powerful for complex, high-dimensional data (network traffic, sensor readings) but require more data and tuning than simpler methods.

Choosing the right method

Scenario	Recommended approach	Why
Low-dimensional, clean data	Z-score or IQR	Simple, interpretable
Tabular data, unknown distribution	Isolation Forest	No distribution assumptions, fast
Density varies across regions	Local Outlier Factor	Captures local context
High-dimensional, complex patterns	Autoencoder	Learns nonlinear representations
Time series data	Statistical process control or LSTM	Respects temporal ordering

The contamination problem

Most algorithms need a contamination parameter — your estimate of what fraction of data is anomalous. If you guess 1% but the true rate is 5%, you will miss many anomalies. If you guess 10% when the true rate is 0.1%, you will drown in false positives.

When you do not know the contamination rate, start with the algorithm’s anomaly scores and set the threshold interactively by examining the highest-scoring points.

Common misconception

People assume anomaly detection is fully automated — deploy it and forget it. Real systems require constant tuning. What counts as “normal” drifts over time (concept drift). A system deployed in January may produce false positives by March because user behavior changed. Regular retraining and threshold adjustment are essential.

The one thing to remember: Anomaly detection algorithms model “normal” in different ways — statistical, density-based, isolation-based, or reconstruction-based — and the right choice depends on your data’s dimensionality, distribution, and whether anomalies are global or local.

pythondata-scienceanomaly-detectionmachine-learning