Scikit-Learn Feature Selection — Core Concepts
Why feature selection matters
High-dimensional data creates problems that more data alone can’t solve. The curse of dimensionality means that as features increase, the volume of the feature space grows exponentially, making data points sparse. Models need exponentially more samples to maintain performance — or you reduce features.
Beyond statistical concerns, feature selection improves model interpretability, reduces training and inference time, decreases storage costs, and can even improve accuracy by removing noisy or redundant features.
Three categories of feature selection
Filter methods
Filter methods score each feature independently using a statistical test, then keep the top-scoring ones. They’re fast because they don’t involve training a model.
Common approaches in scikit-learn:
- SelectKBest — keep the K highest-scoring features
- SelectPercentile — keep the top X% of features
- VarianceThreshold — remove features with near-zero variance (constant columns)
Scoring functions depend on the problem type:
- Classification:
chi2,f_classif,mutual_info_classif - Regression:
f_regression,mutual_info_regression
Strength: Fast, model-agnostic, good for initial dimensionality reduction. Weakness: Ignores feature interactions — a feature useless alone might be powerful in combination.
Wrapper methods
Wrapper methods evaluate feature subsets by training and scoring a model for each subset. They capture interactions but are computationally expensive.
Scikit-learn provides:
- RFE (Recursive Feature Elimination) — trains a model, removes the least important feature, repeats until reaching the desired count
- RFECV — RFE with cross-validation to automatically find the optimal number of features
- SequentialFeatureSelector — adds or removes features one at a time, evaluating each step with cross-validation
Strength: Considers feature interactions, adapts to the specific model being used. Weakness: Slow for many features, results are model-specific.
Embedded methods
Embedded methods perform feature selection as part of model training. The model itself learns which features matter.
Key examples:
- L1 regularization (Lasso) — drives unimportant feature coefficients to exactly zero
- Tree-based feature importance — Random Forests and gradient boosting rank features by their contribution to splits
- SelectFromModel — wraps any model with feature importance attributes and selects features above a threshold
Strength: Efficient, considers interactions, integrated with training. Weakness: Selection depends on the model — features important for a tree might not be important for a linear model.
Choosing a strategy
Many features (hundreds+), need quick reduction: Start with filter methods to eliminate obvious noise, then apply wrapper or embedded methods to the reduced set.
Moderate features (10-100), need precision: Use RFECV or SequentialFeatureSelector to find the optimal subset.
Using tree-based models: Leverage built-in feature importance via SelectFromModel — it’s fast and naturally fitted to your model.
Need interpretable results: L1 regularization produces sparse models where you can directly read which features have non-zero coefficients.
Common misconception
Feature selection on the full dataset before cross-validation causes data leakage. Feature scores computed on all data (including validation samples) produce optimistically biased results. Always perform feature selection inside the cross-validation loop — either as part of a pipeline or using RFECV which handles this correctly.
Practical workflow
- Remove constant and near-constant features (
VarianceThreshold) - Check for highly correlated feature pairs — keep one from each pair
- Apply filter methods to reduce to a manageable set
- Use embedded or wrapper methods for fine selection
- Validate that selected features generalize with cross-validation
One thing to remember: The best feature selection strategy depends on your dataset size, feature count, and model type. Start simple with filters, then refine with model-based methods — and always do selection inside your cross-validation loop.
See Also
- Python Sklearn Custom Transformers How to teach scikit-learn new tricks by building your own data transformation steps — no PhD required.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.