Feature Engineering — Deep Dive

Target encoding with regularization, feature leakage detection, embeddings for high-cardinality categoricals, SHAP-based feature selection, and neural feature learning vs. manual engineering.

Feature Leakage: The Silent Validity Killer

Feature leakage occurs when a feature contains information that wouldn’t be available at prediction time — causing artificially inflated validation metrics and catastrophic production performance.

Direct leakage: The feature directly encodes the target or is computed from the target.

Including “hospitalized” as a feature when predicting “hospital admission”
Including post-event data: “amount_refunded” in a churn prediction model (you only know this after the churn event)
Row index or ID that correlates with data collection batches which correlate with the target

Time leakage: Using data from the future relative to the prediction point.

Using total_purchases_lifetime when predicting whether a customer will buy — this includes future purchases made after the event you’re predicting
Correct: total_purchases_prior_to_event

Leakage through preprocessing: Computing normalization statistics (mean, std) on the full dataset including test data. The test data’s statistics influence the training normalization, giving the model indirect information about test examples.

Detection:

High feature importance for a feature that shouldn’t logically be predictive: suspect leakage
Validation performance much higher than expected: suspect leakage
Drop in performance when model is deployed: strong indicator of leakage in validation

Prevention:

Time-based splits (not random splits) for temporal data
Fit all preprocessing transformers on training data only, apply to test data
Careful chronological auditing of each feature’s data collection timeline

Target Encoding with Regularization

Target encoding replaces each category value with the mean target value for that category: $$\text{TE}(c) = \frac{\sum_{i: x_i = c} y_i}{|{i: x_i = c}|}$$

The problem: rare categories with few samples have unreliable estimates. A category with 2 samples and 2 positive examples gets TE = 1.0, but this is likely noise.

Additive smoothing (Bayesian target encoding): $$\text{TE}_{smooth}(c) = \frac{n_c \bar{y}_c + \alpha \bar{y}}{n_c + \alpha}$$

Where $n_c$ is count of category $c$, $\bar{y}_c$ is mean target for $c$, $\bar{y}$ is global mean, and $\alpha$ is a smoothing factor. For small $n_c$, the estimate pulls toward the global mean. For large $n_c$, the category-specific estimate dominates.

K-fold cross-validated target encoding (prevents train-test leakage):

Split training data into K folds
For each fold, compute target encoding statistics from the other K-1 folds
Apply those statistics to encode the held-out fold
Aggregate across folds to encode all training data

At prediction time, use statistics computed from the full training set.

CatBoost (Yandex) uses a variant called “ordered target encoding” that processes the dataset in temporal order — each row’s target encoding uses only rows that came before it, eliminating train-test leakage by construction.

Entity Embeddings for High-Cardinality Categoricals

For features with thousands or millions of unique values (user IDs, product IDs, zip codes), one-hot encoding is impractical. Entity embeddings learn dense vector representations as part of model training.

Guo & Berkhahn (2016) “Entity Embeddings of Categorical Variables” showed that entity embeddings learned by a neural network for structured tabular data capture meaningful semantic structure. The embedding for “New York” ended up near “Los Angeles” and “Chicago” — capturing geographic and demographic similarity without explicit geographic features.

Implementation:

import torch.nn as nn

class EmbeddingModel(nn.Module):
    def __init__(self, n_users, n_items, embed_dim):
        super().__init__()
        self.user_emb = nn.Embedding(n_users, embed_dim)
        self.item_emb = nn.Embedding(n_items, embed_dim)
    
    def forward(self, user_ids, item_ids):
        u = self.user_emb(user_ids)  # (batch, embed_dim)
        v = self.item_emb(item_ids)  # (batch, embed_dim)
        # Concatenate or dot product with other features
        return torch.cat([u, v], dim=1)

Embedding dimension heuristic: $\min(50, \lceil n_{categories} / 2 \rceil)$. A categorical with 1000 values → 50-dimensional embedding. Not a rule but a reasonable starting point.

Transfer learning for embeddings: Embeddings learned for one task (e.g., user behavior prediction) can be transferred to related tasks (e.g., churn prediction). Companies like Spotify and Netflix maintain “universal embeddings” for users and content that are trained on large-scale data and used across many downstream models.

SHAP for Feature Selection and Engineering Insights

SHAP (SHapley Additive exPlanations) values don’t just explain models — they can guide feature engineering.

Feature importance vs. feature interaction detection:

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)

# Summary plot: which features matter most
shap.summary_plot(shap_values, X_train)

# Interaction plot: how feature A's impact depends on feature B
shap.dependence_plot("age", shap_values.values, X_train, 
                     interaction_index="income")

When the SHAP dependence plot for “income” shows a non-linear relationship (concave, S-shaped), log transform may help. When the plot for “age” shows a cliff at age 65, a binary feature is_retirement_age may be more useful than age itself.

SHAP interaction values: $\Phi_{ij}$ measures the interaction effect between features $i$ and $j$ — how much the effect of feature $i$ depends on the value of feature $j$. Features with high interaction values are candidates for explicit interaction feature engineering.

Neural Feature Learning: When to Stop Engineering

Deep learning models, particularly for images, text, and audio, learn features from raw data automatically. But for structured/tabular data, the picture is less clear.

Tabular benchmarks (Grinsztajn et al., 2022 “Why Tree-Based Models Still Outperform Deep Learning on Tabular Data”): On a benchmark of 45 tabular datasets, gradient boosting (XGBoost, LightGBM) consistently outperformed deep learning models including TabNet, NODE, and standard MLPs.

Key findings:

Tree models benefit from engineered features; deep models learn features automatically
Tree models are robust to irrelevant features; deep models are confused by them
For <100k rows, tree models nearly always win; for >1M rows, the gap narrows

TabPFN (Hollmann et al., 2023): In-context learning for tabular data — a transformer trained on millions of synthetic tabular datasets can classify new small datasets in one forward pass (no training!). Few-shot learning for tabular ML. Competitive with XGBoost for small datasets.

Practical guidance:

< 100k rows, tabular structure: XGBoost/LightGBM + manual feature engineering
1M rows, tabular structure: deep tabular models become more competitive
Images, text, audio: deep learning with minimal feature engineering
Complex relational data: deep learning with entity embeddings

One thing to remember: Feature engineering is the process of encoding domain knowledge into a form the model can use — and the degree to which you need to do it manually vs. let the model discover it is determined by your data modality, dataset size, and whether you have relevant domain knowledge to encode.

feature-engineeringtarget-encodingfeature-leakageentity-embeddingsshapautomated-fe

Feature Engineering — Deep Dive

Feature Leakage: The Silent Validity Killer

Target Encoding with Regularization

Entity Embeddings for High-Cardinality Categoricals

SHAP for Feature Selection and Engineering Insights

Neural Feature Learning: When to Stop Engineering

See Also

Related Topics