Feature Engineering — Deep Dive
Feature Leakage: The Silent Validity Killer
Feature leakage occurs when a feature contains information that wouldn’t be available at prediction time — causing artificially inflated validation metrics and catastrophic production performance.
Direct leakage: The feature directly encodes the target or is computed from the target.
- Including “hospitalized” as a feature when predicting “hospital admission”
- Including post-event data: “amount_refunded” in a churn prediction model (you only know this after the churn event)
- Row index or ID that correlates with data collection batches which correlate with the target
Time leakage: Using data from the future relative to the prediction point.
- Using
total_purchases_lifetimewhen predicting whether a customer will buy — this includes future purchases made after the event you’re predicting - Correct:
total_purchases_prior_to_event
Leakage through preprocessing: Computing normalization statistics (mean, std) on the full dataset including test data. The test data’s statistics influence the training normalization, giving the model indirect information about test examples.
Detection:
- High feature importance for a feature that shouldn’t logically be predictive: suspect leakage
- Validation performance much higher than expected: suspect leakage
- Drop in performance when model is deployed: strong indicator of leakage in validation
Prevention:
- Time-based splits (not random splits) for temporal data
- Fit all preprocessing transformers on training data only, apply to test data
- Careful chronological auditing of each feature’s data collection timeline
Target Encoding with Regularization
Target encoding replaces each category value with the mean target value for that category: $$\text{TE}(c) = \frac{\sum_{i: x_i = c} y_i}{|{i: x_i = c}|}$$
The problem: rare categories with few samples have unreliable estimates. A category with 2 samples and 2 positive examples gets TE = 1.0, but this is likely noise.
Additive smoothing (Bayesian target encoding): $$\text{TE}_{smooth}(c) = \frac{n_c \bar{y}_c + \alpha \bar{y}}{n_c + \alpha}$$
Where $n_c$ is count of category $c$, $\bar{y}_c$ is mean target for $c$, $\bar{y}$ is global mean, and $\alpha$ is a smoothing factor. For small $n_c$, the estimate pulls toward the global mean. For large $n_c$, the category-specific estimate dominates.
K-fold cross-validated target encoding (prevents train-test leakage):
- Split training data into K folds
- For each fold, compute target encoding statistics from the other K-1 folds
- Apply those statistics to encode the held-out fold
- Aggregate across folds to encode all training data
At prediction time, use statistics computed from the full training set.
CatBoost (Yandex) uses a variant called “ordered target encoding” that processes the dataset in temporal order — each row’s target encoding uses only rows that came before it, eliminating train-test leakage by construction.
Entity Embeddings for High-Cardinality Categoricals
For features with thousands or millions of unique values (user IDs, product IDs, zip codes), one-hot encoding is impractical. Entity embeddings learn dense vector representations as part of model training.
Guo & Berkhahn (2016) “Entity Embeddings of Categorical Variables” showed that entity embeddings learned by a neural network for structured tabular data capture meaningful semantic structure. The embedding for “New York” ended up near “Los Angeles” and “Chicago” — capturing geographic and demographic similarity without explicit geographic features.
Implementation:
import torch.nn as nn
class EmbeddingModel(nn.Module):
def __init__(self, n_users, n_items, embed_dim):
super().__init__()
self.user_emb = nn.Embedding(n_users, embed_dim)
self.item_emb = nn.Embedding(n_items, embed_dim)
def forward(self, user_ids, item_ids):
u = self.user_emb(user_ids) # (batch, embed_dim)
v = self.item_emb(item_ids) # (batch, embed_dim)
# Concatenate or dot product with other features
return torch.cat([u, v], dim=1)
Embedding dimension heuristic: $\min(50, \lceil n_{categories} / 2 \rceil)$. A categorical with 1000 values → 50-dimensional embedding. Not a rule but a reasonable starting point.
Transfer learning for embeddings: Embeddings learned for one task (e.g., user behavior prediction) can be transferred to related tasks (e.g., churn prediction). Companies like Spotify and Netflix maintain “universal embeddings” for users and content that are trained on large-scale data and used across many downstream models.
SHAP for Feature Selection and Engineering Insights
SHAP (SHapley Additive exPlanations) values don’t just explain models — they can guide feature engineering.
Feature importance vs. feature interaction detection:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)
# Summary plot: which features matter most
shap.summary_plot(shap_values, X_train)
# Interaction plot: how feature A's impact depends on feature B
shap.dependence_plot("age", shap_values.values, X_train,
interaction_index="income")
When the SHAP dependence plot for “income” shows a non-linear relationship (concave, S-shaped), log transform may help. When the plot for “age” shows a cliff at age 65, a binary feature is_retirement_age may be more useful than age itself.
SHAP interaction values: $\Phi_{ij}$ measures the interaction effect between features $i$ and $j$ — how much the effect of feature $i$ depends on the value of feature $j$. Features with high interaction values are candidates for explicit interaction feature engineering.
Neural Feature Learning: When to Stop Engineering
Deep learning models, particularly for images, text, and audio, learn features from raw data automatically. But for structured/tabular data, the picture is less clear.
Tabular benchmarks (Grinsztajn et al., 2022 “Why Tree-Based Models Still Outperform Deep Learning on Tabular Data”): On a benchmark of 45 tabular datasets, gradient boosting (XGBoost, LightGBM) consistently outperformed deep learning models including TabNet, NODE, and standard MLPs.
Key findings:
- Tree models benefit from engineered features; deep models learn features automatically
- Tree models are robust to irrelevant features; deep models are confused by them
- For <100k rows, tree models nearly always win; for >1M rows, the gap narrows
TabPFN (Hollmann et al., 2023): In-context learning for tabular data — a transformer trained on millions of synthetic tabular datasets can classify new small datasets in one forward pass (no training!). Few-shot learning for tabular ML. Competitive with XGBoost for small datasets.
Practical guidance:
- < 100k rows, tabular structure: XGBoost/LightGBM + manual feature engineering
-
1M rows, tabular structure: deep tabular models become more competitive
- Images, text, audio: deep learning with minimal feature engineering
- Complex relational data: deep learning with entity embeddings
One thing to remember: Feature engineering is the process of encoding domain knowledge into a form the model can use — and the degree to which you need to do it manually vs. let the model discover it is determined by your data modality, dataset size, and whether you have relevant domain knowledge to encode.
See Also
- Python Data Augmentation See how making clever copies of your data teaches a computer to handle surprises it has never seen before.
- Python Feature Engineering Turn raw messy data into clues a computer can actually use to make smart predictions.
- Ab Testing How tech companies run thousands of experiments at once to improve their products — the scientific method applied to everything from button colors to recommendation algorithms.
- Causal Inference Why correlation isn't causation — and the statistical methods scientists use to actually prove that one thing causes another without running a controlled experiment.
- Time Series Forecasting How AI predicts the future from patterns in the past — the technology behind weather forecasts, stock predictions, electricity demand, and your iPhone's battery charge estimate.