Anomaly Detection — Core Concepts
The Three Types of Anomaly Detection
Point anomalies: Individual data points that are unusual. A $50,000 transaction on an account that typically has $200 transactions.
Contextual anomalies: Data points that are unusual in context but not inherently. A 60°F temperature is normal in October but anomalous in July. An employee accessing files at 3am is unusual even if they accessed the same files at 2pm.
Collective anomalies: A sequence of individually normal data points that together form an anomalous pattern. No single network packet is suspicious, but 10,000 packets per second to an unusual IP at 3am is.
Most real-world problems involve contextual or collective anomalies, which makes the problem harder than just finding outliers in a feature space.
Statistical Methods
Z-score / Standard Deviation: Flag points more than $k\sigma$ from the mean. Works for Gaussian data; fails for heavy-tailed distributions. Appropriate for simple single-variable monitoring.
Mahalanobis distance: Multivariate generalization of z-score. Accounts for correlations between features: $$d(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$
Where $\mu$ is the mean and $\Sigma$ is the covariance matrix of the training data. Mahalanobis distance is scale-invariant — if feature A has 100x more variance than B, it doesn’t dominate the distance metric.
CUSUM (Cumulative Sum): For streaming time series, detects when a process has shifted from baseline. Appropriate for manufacturing quality control and system monitoring.
Isolation Forest
Liu et al. (2008) proposed Isolation Forest — a tree-based algorithm with linear time complexity.
Key insight: Anomalies are rare and different. In a dataset where most points are clustered together, isolating an anomaly (separating it from all other points) requires fewer splits than isolating a normal point (which is surrounded by similar points).
Algorithm:
- Build many isolation trees (like decision trees but splitting randomly, not on best feature)
- For each tree: at each node, pick a random feature and a random split value within that feature’s range
- Continue until each point is isolated
- Anomaly score = average path length to isolation across all trees
Shorter average path length → easier to isolate → more likely to be an anomaly.
Properties:
- $O(n \log n)$ training time
- $O(1)$ inference per sample
- Works well in high dimensions
- No distance or density estimation needed
- Doesn’t require choosing a distance metric
Default hyperparameters (100 trees, sample size 256) work surprisingly well across datasets.
Autoencoder-Based Detection
Autoencoders learn a compressed representation (encoder) and reconstruction (decoder):
$$\hat{x} = \text{decoder}(\text{encoder}(x))$$
Trained on normal data only, the autoencoder learns to reconstruct normal patterns well. When given an anomalous input, it can’t reconstruct it accurately — the reconstruction error is high.
Anomaly score: $$s(x) = |x - \hat{x}|^2$$
Set a threshold $\tau$: flag points where $s(x) > \tau$.
Advantages: Works for complex data (images, time series, text) where defining “normal” via simple statistics is hard. End-to-end learned representations.
Disadvantages: Autoencoders can sometimes reconstruct anomalies well if they happen to lie in the learned manifold. A VAE (Variational Autoencoder) is often more stable.
Deep SVDD (Ruff et al., 2018): Train a deep network to map normal data to a compact hypersphere in representation space. Anomalies map outside the sphere. More principled than autoencoder reconstruction.
One-Class SVM
Schölkopf et al. (2001): Learn a decision boundary in feature space (via a kernel) that encloses the normal data. New points outside this boundary are anomalies.
$$\min_\nu \frac{1}{\nu n} \sum_i \max(0, 1 - f(x_i)) + \frac{1}{2}|w|^2$$
Where $\nu$ is the expected fraction of anomalies. Works well for small datasets with clean normal data. Scales poorly to large datasets due to $O(n^2)$ kernel matrix.
Real-World Fraud Detection: Multi-Layer Systems
Production fraud detection at companies like PayPal, Visa, and Stripe combines:
Rule-based systems (fast, interpretable): “If transaction country ≠ card country AND amount > $500 → flag.” Runs in milliseconds. Catches known fraud patterns.
ML anomaly scoring (adaptive): Gradient boosting on user behavior features + isolation forest on transaction-level features. Learns patterns from historical fraud.
Graph-based fraud detection: Model the transaction network — users, merchants, devices, IP addresses as nodes. Fraud rings (organized groups making fraudulent transactions) appear as unusual community structures.
LLM/NLP analysis: Merchant name analysis, transaction description parsing, and communication content for account takeover detection.
The output is a fraud score. Scores above threshold → automatic decline. Middle range → additional authentication challenge (2FA). Low score → approve.
Visa processes ~200 million transactions daily with median decision time of 300ms. The ML scoring system contributes to each decision within this window.
One thing to remember: Production anomaly detection is almost never a single algorithm — it’s a layered system where rules catch known patterns quickly, ML scores novelty more generally, and humans review borderline cases and feed findings back to improve the system.
See Also
- Python Anomaly Detection How Python spots the weird stuff hiding in your data, explained with simple examples anyone can follow.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.