Scikit-Learn Feature Selection — Core Concepts

Three strategies for picking the right features in scikit-learn — filter, wrapper, and embedded methods explained with practical guidance.

Why feature selection matters

High-dimensional data creates problems that more data alone can’t solve. The curse of dimensionality means that as features increase, the volume of the feature space grows exponentially, making data points sparse. Models need exponentially more samples to maintain performance — or you reduce features.

Beyond statistical concerns, feature selection improves model interpretability, reduces training and inference time, decreases storage costs, and can even improve accuracy by removing noisy or redundant features.

Three categories of feature selection

Filter methods

Filter methods score each feature independently using a statistical test, then keep the top-scoring ones. They’re fast because they don’t involve training a model.

Common approaches in scikit-learn:

SelectKBest — keep the K highest-scoring features
SelectPercentile — keep the top X% of features
VarianceThreshold — remove features with near-zero variance (constant columns)

Scoring functions depend on the problem type:

Classification: chi2, f_classif, mutual_info_classif
Regression: f_regression, mutual_info_regression

Strength: Fast, model-agnostic, good for initial dimensionality reduction. Weakness: Ignores feature interactions — a feature useless alone might be powerful in combination.

Wrapper methods

Wrapper methods evaluate feature subsets by training and scoring a model for each subset. They capture interactions but are computationally expensive.

Scikit-learn provides:

RFE (Recursive Feature Elimination) — trains a model, removes the least important feature, repeats until reaching the desired count
RFECV — RFE with cross-validation to automatically find the optimal number of features
SequentialFeatureSelector — adds or removes features one at a time, evaluating each step with cross-validation

Strength: Considers feature interactions, adapts to the specific model being used. Weakness: Slow for many features, results are model-specific.

Embedded methods

Embedded methods perform feature selection as part of model training. The model itself learns which features matter.

Key examples:

L1 regularization (Lasso) — drives unimportant feature coefficients to exactly zero
Tree-based feature importance — Random Forests and gradient boosting rank features by their contribution to splits
SelectFromModel — wraps any model with feature importance attributes and selects features above a threshold

Strength: Efficient, considers interactions, integrated with training. Weakness: Selection depends on the model — features important for a tree might not be important for a linear model.

Choosing a strategy

Many features (hundreds+), need quick reduction: Start with filter methods to eliminate obvious noise, then apply wrapper or embedded methods to the reduced set.

Moderate features (10-100), need precision: Use RFECV or SequentialFeatureSelector to find the optimal subset.

Using tree-based models: Leverage built-in feature importance via SelectFromModel — it’s fast and naturally fitted to your model.

Need interpretable results: L1 regularization produces sparse models where you can directly read which features have non-zero coefficients.

Common misconception

Feature selection on the full dataset before cross-validation causes data leakage. Feature scores computed on all data (including validation samples) produce optimistically biased results. Always perform feature selection inside the cross-validation loop — either as part of a pipeline or using RFECV which handles this correctly.

Practical workflow

Remove constant and near-constant features (VarianceThreshold)
Check for highly correlated feature pairs — keep one from each pair
Apply filter methods to reduce to a manageable set
Use embedded or wrapper methods for fine selection
Validate that selected features generalize with cross-validation

One thing to remember: The best feature selection strategy depends on your dataset size, feature count, and model type. Start simple with filters, then refine with model-based methods — and always do selection inside your cross-validation loop.

pythonmachine-learningscikit-learn