Feature Engineering — Core Concepts

The techniques that turn raw data into machine learning signal: normalization, encoding categoricals, interaction features, temporal features, and automated feature engineering.

Why Raw Data Is Usually Not Model-Ready

Machine learning models expect numerical inputs in specific ranges. Real-world data rarely arrives this way:

Categorical variables: “Monday”, “Tuesday”, “Wednesday” — can’t subtract or multiply strings
Different scales: Income ($0–$500,000) and age (0–100) on the same scale confuses distance-based models
Skewed distributions: House prices with a long tail break many model assumptions
Temporal data: A timestamp means nothing without extracting “day of week”, “hour”, “time since last event”
Missing values: Most models can’t handle NaN directly

Feature engineering addresses all of these systematically.

Core Transformations

Handling Numeric Features

Normalization/Standardization: Bring features to comparable scales.

Min-max scaling: $x’ = (x - x_{min}) / (x_{max} - x_{min})$ → range [0, 1]. Sensitive to outliers.
Z-score standardization: $x’ = (x - \mu) / \sigma$ → mean 0, std 1. Better for Gaussian-like distributions.
Robust scaling: Use median and IQR instead of mean and std. Better for skewed distributions.

Log transformation: For right-skewed distributions (income, house prices, web traffic): $x’ = \log(x + 1)$. Compresses the tail, makes the distribution more normal, which helps many algorithms.

Winsorization: Clip extreme values to the 1st and 99th percentile. Prevents outliers from dominating feature importance.

Encoding Categorical Variables

One-hot encoding: Create a binary column for each category. “Color: red” → [red=1, blue=0, green=0]. Works when categories are unordered and few in number. Problem: high cardinality (1000 unique cities → 1000 new columns) creates dimensionality explosion.

Label encoding: Map categories to integers. “red”→0, “blue”→1, “green”→2. Only appropriate for ordinal categories (S < M < L < XL). For nominal categories (colors, countries), implies false ordering.

Target encoding: Replace category with the mean target value for that category. “City: New York” → mean conversion rate for New York customers (0.23). Powerful but must be computed from training data only — using test data causes leakage. K-fold target encoding prevents this.

Embeddings: For high-cardinality categoricals (user IDs, product IDs), learn a dense embedding vector as part of model training. Widely used in recommendation systems (Netflix, Spotify) and fraud detection.

Temporal Features

Raw timestamps contain minimal signal. Extracting components reveals patterns:

Cyclic time: hour_of_day, day_of_week, month_of_year — use sine/cosine encoding to capture cyclicity: hour_sin = sin(2π × hour / 24). This makes hour=23 and hour=0 similar (both near midnight), which integer encoding doesn’t.
Lag features: Value at T-1, T-7, T-30. Essential for time series prediction. “What was the value last week at this time?”
Rolling statistics: Mean/max/std over the last N periods. Captures recent trends and volatility.
Time since event: Days since last purchase, hours since last login. Captures recency.

Interaction Features

Model the relationship between two variables explicitly: income × years_employed, room_count / house_sqft, price / competitor_price.

Linear models can’t discover these without being told — a polynomial feature expansion creates all pairwise interactions:

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_interactions = poly.fit_transform(X)

For 10 original features, this creates 45 interaction features. For 100 features, 4950 — the curse of dimensionality means you’ll need feature selection after.

Feature Selection

Not all engineered features are useful. Including irrelevant or redundant features hurts model performance (noise, slower training) and interpretability.

Filter methods: Score each feature independently of the model.

Correlation: Remove features with correlation > 0.9 to each other (redundancy)
Mutual information: Measures statistical dependence between feature and target. Works for non-linear relationships.
Chi-squared test: For categorical features, measures statistical association with the target.

Wrapper methods: Use a model to evaluate feature subsets.

Recursive Feature Elimination (RFE): Train model, remove least important features, repeat until performance degrades. Computationally expensive but effective.

Embedded methods: Feature importance from the model itself.

Tree-based importance: Gradient boosted trees (XGBoost, LightGBM) output feature importance scores based on how often each feature is used for splits and how much it reduces impurity.
L1 regularization (LASSO): Drives irrelevant feature coefficients to zero.

Automated Feature Engineering

Featuretools (Alteryx, open-source): Automatically generates features by applying mathematical primitives across relational tables. Given tables for customers, transactions, and products, it can automatically generate: customer.transactions.MEAN(amount), customer.last_transaction.category, etc. Used by some Kaggle competitors.

Deep feature synthesis (DFS): The algorithm underlying Featuretools — recursively applies “feature primitives” (aggregation functions like mean/max/count and transformation functions like log/month) to entity relationships.

At companies like Uber (Michelangelo’s feature platform) and Airbnb (Chronon), feature engineering is productized: features are computed once and stored in a feature store, then retrieved by any model that needs them.

One thing to remember: The practical rule in data science is “features > models” — a good feature set with a simple model often outperforms a sophisticated model with poor features, because the model can only learn what the features allow it to learn.

feature-engineeringdata-scienceone-hot-encodingnormalizationfeature-selectiontabular-ml