Anomaly Detection — Deep Dive
Extended Isolation Forest: Fixing Bias
Original Isolation Forest has a known bias toward the center of the data space and along the axes — points in the center or near the axes get lower anomaly scores regardless of their actual density.
The root cause: original Isolation Forest splits always parallel to feature axes. This creates rectangular isolation regions that don’t fit the actual data distribution well.
Extended Isolation Forest (Hariri et al., 2019) uses random hyperplane cuts:
- Pick a random slope vector $n \sim \mathcal{N}(0, I)$ (a normal vector in $\mathbb{R}^d$)
- Pick a random intercept $p$ uniformly between $\min(X \cdot n)$ and $\max(X \cdot n)$
- Split: points with $x \cdot n < p$ go left, others go right
The random hyperplane cuts produce isolation regions aligned with the actual data geometry, not just the axes. This removes the bias toward the center and toward axis-aligned features.
Performance comparison on synthetic anomaly benchmarks: EIF consistently outperforms IF on datasets with clusters of anomalies or anomalies near the data boundary. For high-dimensional data, EIF’s advantage is more pronounced.
Normalizing Flows for Exact Density Estimation
Generative models like normalizing flows can compute the exact likelihood $p(x)$ of any data point. Low-likelihood points are anomalous.
A normalizing flow transforms a simple distribution $p_z(z)$ (e.g., Gaussian) through a sequence of invertible transformations $f_1, f_2, …, f_k$:
$$x = f_k \circ … \circ f_1(z), \quad z \sim p_z(z)$$
Since all transformations are invertible, the data likelihood is:
$$\log p(x) = \log p_z(z) - \sum_{i=1}^k \log \left|\det \frac{\partial f_i}{\partial f_{i-1}}\right|$$
The log-Jacobian term corrects for volume changes during transformations.
RealNVP / GLOW: Use coupling layers — invertible transformations with tractable Jacobians. GLOW (Kingma & Dhariwal, 2018) achieves high-quality image generation while enabling exact likelihood computation.
Anomaly detection application: Train on normal data. At inference, compute $-\log p(x)$ — high negative log-likelihood = anomalous. Unlike autoencoders, this provides a proper probabilistic interpretation.
Limitation: Flows model the full data density, not just the support boundary. Anomalies in the training distribution’s low-density regions (outliers that are still normal) may get high anomaly scores even if they’re not truly anomalous.
Time Series Anomaly Detection with LSTMs
LSTM-based prediction error: Train an LSTM to predict the next value in a time series from recent history. At each timestep: $$\hat{x}_{t+1} = \text{LSTM}(x_1, …, x_t)$$
The prediction error $e_t = |x_t - \hat{x}_t|$ is the anomaly score. A point is anomalous if its actual value deviates significantly from what the model expected.
For multivariate time series, use the full prediction error vector and apply Mahalanobis distance: $$s_t = (e_t - \mu_e)^T \Sigma_e^{-1} (e_t - \mu_e)$$
LSTMAE (LSTM Autoencoder): Encoder LSTM reads the sequence and produces a latent vector; decoder LSTM reconstructs the sequence. Reconstruction error per timestep is the anomaly score. Particularly effective for detecting anomalous patterns (collective anomalies) rather than individual point anomalies.
Transformer-based: TranAD (Tuli et al., 2022) uses a transformer with attention over time — anomalies disrupt the temporal attention patterns in a detectable way. Outperforms LSTM-based methods on several time series benchmarks.
Evaluation Under Extreme Class Imbalance
Standard accuracy is misleading for anomaly detection: if 99.9% of data is normal, a model that predicts “normal” for everything achieves 99.9% accuracy while being completely useless.
Precision and Recall:
- Precision = TP / (TP + FP): Among all flagged anomalies, what fraction were real?
- Recall = TP / (TP + FN): Among all real anomalies, what fraction were caught?
F1 score: Harmonic mean $F1 = 2 \times \text{Precision} \times \text{Recall} / (\text{Precision} + \text{Recall})$
AUPRC (Area Under Precision-Recall Curve): More informative than AUROC for imbalanced problems. AUROC can be high even when precision is very low. AUPRC is directly proportional to the practical utility of the model.
At 1% fraud rate: a model with AUROC=0.95 might have AUPRC=0.3 (poor) because its high recall comes at the cost of extremely low precision (most fraud alerts are false positives).
Average Precision (AP): Scalar summary of AUPRC: $$AP = \sum_k (R_k - R_{k-1}) P_k$$
The business metric: False Positive Rate (FPR) at fixed True Positive Rate (TPR). For credit card fraud, the FPR at 80% TPR matters — if you catch 80% of fraud, how many legitimate transactions do you block?
Explainability in Anomaly Detection
SHAP for isolation forests: TreeSHAP computes exact SHAP values for tree-based models including Isolation Forest. For each flagged anomaly, SHAP shows which features contributed most to the low isolation path length.
Contrastive explanations: “This transaction is anomalous because: amount is 10x higher than typical, country differs from residence, merchant category is unusual.” Generated by comparing the anomalous point to the centroid of normal points and identifying the features that differ most.
LLM-powered anomaly explanation (2023-2024 pattern): Feed anomaly features + contextual information to an LLM, ask it to generate a human-readable explanation. For cybersecurity:
System: You are a security analyst. Below is network event data.
Event: Source IP: 10.1.1.5, Dest IP: 203.0.113.100 (external),
Port: 4444, Bytes: 1.2GB, Duration: 3h, Time: 02:30 AM
Anomaly score: 0.97 (top features: port, time, bytes)
Question: Why might this be suspicious?
The LLM combines domain knowledge with the specific anomaly features to produce actionable explanations — useful for security operations centers where analysts need to decide whether to escalate an alert.
One thing to remember: The practical challenge in anomaly detection is almost never the algorithm — it’s defining what “anomalous” means for your use case, collecting appropriate normal data, and setting thresholds that balance false positives against false negatives given the cost of each type of error in your specific context.
See Also
- Python Anomaly Detection How Python spots the weird stuff hiding in your data, explained with simple examples anyone can follow.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.