Feature Engineering — Core Concepts
Why Raw Data Is Usually Not Model-Ready
Machine learning models expect numerical inputs in specific ranges. Real-world data rarely arrives this way:
- Categorical variables: “Monday”, “Tuesday”, “Wednesday” — can’t subtract or multiply strings
- Different scales: Income ($0–$500,000) and age (0–100) on the same scale confuses distance-based models
- Skewed distributions: House prices with a long tail break many model assumptions
- Temporal data: A timestamp means nothing without extracting “day of week”, “hour”, “time since last event”
- Missing values: Most models can’t handle NaN directly
Feature engineering addresses all of these systematically.
Core Transformations
Handling Numeric Features
Normalization/Standardization: Bring features to comparable scales.
- Min-max scaling: $x’ = (x - x_{min}) / (x_{max} - x_{min})$ → range [0, 1]. Sensitive to outliers.
- Z-score standardization: $x’ = (x - \mu) / \sigma$ → mean 0, std 1. Better for Gaussian-like distributions.
- Robust scaling: Use median and IQR instead of mean and std. Better for skewed distributions.
Log transformation: For right-skewed distributions (income, house prices, web traffic): $x’ = \log(x + 1)$. Compresses the tail, makes the distribution more normal, which helps many algorithms.
Winsorization: Clip extreme values to the 1st and 99th percentile. Prevents outliers from dominating feature importance.
Encoding Categorical Variables
One-hot encoding: Create a binary column for each category. “Color: red” → [red=1, blue=0, green=0]. Works when categories are unordered and few in number. Problem: high cardinality (1000 unique cities → 1000 new columns) creates dimensionality explosion.
Label encoding: Map categories to integers. “red”→0, “blue”→1, “green”→2. Only appropriate for ordinal categories (S < M < L < XL). For nominal categories (colors, countries), implies false ordering.
Target encoding: Replace category with the mean target value for that category. “City: New York” → mean conversion rate for New York customers (0.23). Powerful but must be computed from training data only — using test data causes leakage. K-fold target encoding prevents this.
Embeddings: For high-cardinality categoricals (user IDs, product IDs), learn a dense embedding vector as part of model training. Widely used in recommendation systems (Netflix, Spotify) and fraud detection.
Temporal Features
Raw timestamps contain minimal signal. Extracting components reveals patterns:
- Cyclic time: hour_of_day, day_of_week, month_of_year — use sine/cosine encoding to capture cyclicity:
hour_sin = sin(2π × hour / 24). This makes hour=23 and hour=0 similar (both near midnight), which integer encoding doesn’t. - Lag features: Value at T-1, T-7, T-30. Essential for time series prediction. “What was the value last week at this time?”
- Rolling statistics: Mean/max/std over the last N periods. Captures recent trends and volatility.
- Time since event: Days since last purchase, hours since last login. Captures recency.
Interaction Features
Model the relationship between two variables explicitly: income × years_employed, room_count / house_sqft, price / competitor_price.
Linear models can’t discover these without being told — a polynomial feature expansion creates all pairwise interactions:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_interactions = poly.fit_transform(X)
For 10 original features, this creates 45 interaction features. For 100 features, 4950 — the curse of dimensionality means you’ll need feature selection after.
Feature Selection
Not all engineered features are useful. Including irrelevant or redundant features hurts model performance (noise, slower training) and interpretability.
Filter methods: Score each feature independently of the model.
- Correlation: Remove features with correlation > 0.9 to each other (redundancy)
- Mutual information: Measures statistical dependence between feature and target. Works for non-linear relationships.
- Chi-squared test: For categorical features, measures statistical association with the target.
Wrapper methods: Use a model to evaluate feature subsets.
- Recursive Feature Elimination (RFE): Train model, remove least important features, repeat until performance degrades. Computationally expensive but effective.
Embedded methods: Feature importance from the model itself.
- Tree-based importance: Gradient boosted trees (XGBoost, LightGBM) output feature importance scores based on how often each feature is used for splits and how much it reduces impurity.
- L1 regularization (LASSO): Drives irrelevant feature coefficients to zero.
Automated Feature Engineering
Featuretools (Alteryx, open-source): Automatically generates features by applying mathematical primitives across relational tables. Given tables for customers, transactions, and products, it can automatically generate: customer.transactions.MEAN(amount), customer.last_transaction.category, etc. Used by some Kaggle competitors.
Deep feature synthesis (DFS): The algorithm underlying Featuretools — recursively applies “feature primitives” (aggregation functions like mean/max/count and transformation functions like log/month) to entity relationships.
At companies like Uber (Michelangelo’s feature platform) and Airbnb (Chronon), feature engineering is productized: features are computed once and stored in a feature store, then retrieved by any model that needs them.
One thing to remember: The practical rule in data science is “features > models” — a good feature set with a simple model often outperforms a sophisticated model with poor features, because the model can only learn what the features allow it to learn.
See Also
- Python Data Augmentation See how making clever copies of your data teaches a computer to handle surprises it has never seen before.
- Python Feature Engineering Turn raw messy data into clues a computer can actually use to make smart predictions.
- Ab Testing How tech companies run thousands of experiments at once to improve their products — the scientific method applied to everything from button colors to recommendation algorithms.
- Causal Inference Why correlation isn't causation — and the statistical methods scientists use to actually prove that one thing causes another without running a controlled experiment.
- Time Series Forecasting How AI predicts the future from patterns in the past — the technology behind weather forecasts, stock predictions, electricity demand, and your iPhone's battery charge estimate.