Anomaly Detection — Core Concepts

Isolation Forest, Autoencoders for anomaly scoring, statistical methods, the rare event problem, and how real fraud detection systems combine multiple approaches.

The Three Types of Anomaly Detection

Point anomalies: Individual data points that are unusual. A $50,000 transaction on an account that typically has $200 transactions.

Contextual anomalies: Data points that are unusual in context but not inherently. A 60°F temperature is normal in October but anomalous in July. An employee accessing files at 3am is unusual even if they accessed the same files at 2pm.

Collective anomalies: A sequence of individually normal data points that together form an anomalous pattern. No single network packet is suspicious, but 10,000 packets per second to an unusual IP at 3am is.

Most real-world problems involve contextual or collective anomalies, which makes the problem harder than just finding outliers in a feature space.

Statistical Methods

Z-score / Standard Deviation: Flag points more than $k\sigma$ from the mean. Works for Gaussian data; fails for heavy-tailed distributions. Appropriate for simple single-variable monitoring.

Mahalanobis distance: Multivariate generalization of z-score. Accounts for correlations between features: $$d(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$

Where $\mu$ is the mean and $\Sigma$ is the covariance matrix of the training data. Mahalanobis distance is scale-invariant — if feature A has 100x more variance than B, it doesn’t dominate the distance metric.

CUSUM (Cumulative Sum): For streaming time series, detects when a process has shifted from baseline. Appropriate for manufacturing quality control and system monitoring.

Isolation Forest

Liu et al. (2008) proposed Isolation Forest — a tree-based algorithm with linear time complexity.

Key insight: Anomalies are rare and different. In a dataset where most points are clustered together, isolating an anomaly (separating it from all other points) requires fewer splits than isolating a normal point (which is surrounded by similar points).

Algorithm:

Build many isolation trees (like decision trees but splitting randomly, not on best feature)
For each tree: at each node, pick a random feature and a random split value within that feature’s range
Continue until each point is isolated
Anomaly score = average path length to isolation across all trees

Shorter average path length → easier to isolate → more likely to be an anomaly.

Properties:

$O(n \log n)$ training time
$O(1)$ inference per sample
Works well in high dimensions
No distance or density estimation needed
Doesn’t require choosing a distance metric

Default hyperparameters (100 trees, sample size 256) work surprisingly well across datasets.

Autoencoder-Based Detection

Autoencoders learn a compressed representation (encoder) and reconstruction (decoder):

$$\hat{x} = \text{decoder}(\text{encoder}(x))$$

Trained on normal data only, the autoencoder learns to reconstruct normal patterns well. When given an anomalous input, it can’t reconstruct it accurately — the reconstruction error is high.

Anomaly score: $$s(x) = |x - \hat{x}|^2$$

Set a threshold $\tau$: flag points where $s(x) > \tau$.

Advantages: Works for complex data (images, time series, text) where defining “normal” via simple statistics is hard. End-to-end learned representations.

Disadvantages: Autoencoders can sometimes reconstruct anomalies well if they happen to lie in the learned manifold. A VAE (Variational Autoencoder) is often more stable.

Deep SVDD (Ruff et al., 2018): Train a deep network to map normal data to a compact hypersphere in representation space. Anomalies map outside the sphere. More principled than autoencoder reconstruction.

One-Class SVM

Schölkopf et al. (2001): Learn a decision boundary in feature space (via a kernel) that encloses the normal data. New points outside this boundary are anomalies.

$$\min_\nu \frac{1}{\nu n} \sum_i \max(0, 1 - f(x_i)) + \frac{1}{2}|w|^2$$

Where $\nu$ is the expected fraction of anomalies. Works well for small datasets with clean normal data. Scales poorly to large datasets due to $O(n^2)$ kernel matrix.

Real-World Fraud Detection: Multi-Layer Systems

Production fraud detection at companies like PayPal, Visa, and Stripe combines:

Rule-based systems (fast, interpretable): “If transaction country ≠ card country AND amount > $500 → flag.” Runs in milliseconds. Catches known fraud patterns.

ML anomaly scoring (adaptive): Gradient boosting on user behavior features + isolation forest on transaction-level features. Learns patterns from historical fraud.

Graph-based fraud detection: Model the transaction network — users, merchants, devices, IP addresses as nodes. Fraud rings (organized groups making fraudulent transactions) appear as unusual community structures.

LLM/NLP analysis: Merchant name analysis, transaction description parsing, and communication content for account takeover detection.

The output is a fraud score. Scores above threshold → automatic decline. Middle range → additional authentication challenge (2FA). Low score → approve.

Visa processes ~200 million transactions daily with median decision time of 300ms. The ML scoring system contributes to each decision within this window.

One thing to remember: Production anomaly detection is almost never a single algorithm — it’s a layered system where rules catch known patterns quickly, ML scores novelty more generally, and humans review borderline cases and feed findings back to improve the system.

anomaly-detectionisolation-forestautoencoderone-class-svmloffraud-detection