Scikit-Learn Feature Selection — Deep Dive
Technical foundation
Feature selection addresses the bias-variance tradeoff from the feature dimension. Adding irrelevant features increases model variance (more parameters to estimate from the same data) without reducing bias. Removing them constrains the hypothesis space, reducing variance at minimal bias cost — provided you don’t remove informative features.
Formally, for a linear model with p features and n samples, the expected prediction error scales as σ²p/n. Reducing p directly reduces this term, which is why feature selection often improves generalization even when the dropped features carry weak signal.
Filter methods in depth
VarianceThreshold — the first pass
from sklearn.feature_selection import VarianceThreshold
# Remove features with variance below 0.01
# For binary features, variance = p(1-p), so threshold=0.8*(1-0.8)=0.16
# removes binary features where >80% of values are the same class
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X)
print(f"Features: {X.shape[1]} → {X_reduced.shape[1]}")
print(f"Removed: {(~selector.get_support()).sum()} constant/near-constant features")
Statistical tests with SelectKBest
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
# ANOVA F-test — tests linear relationship between feature and target
selector_f = SelectKBest(score_func=f_classif, k=20)
X_f = selector_f.fit_transform(X, y)
# Mutual information — captures nonlinear dependencies
selector_mi = SelectKBest(score_func=mutual_info_classif, k=20)
X_mi = selector_mi.fit_transform(X, y)
# Compare selected features
import numpy as np
f_selected = set(np.where(selector_f.get_support())[0])
mi_selected = set(np.where(selector_mi.get_support())[0])
print(f"Overlap: {len(f_selected & mi_selected)} / 20")
print(f"Only in F-test: {f_selected - mi_selected}")
print(f"Only in MI: {mi_selected - f_selected}")
When they disagree: Features selected by mutual information but not F-test have nonlinear relationships with the target. Features selected by F-test but not MI may have weak linear effects that MI’s estimation noise obscures. In practice, take the union for safety.
Mutual information estimation details
mutual_info_classif uses k-nearest-neighbor estimation, which has a random_state parameter and can produce different results across runs. For reproducible selection:
from sklearn.feature_selection import mutual_info_classif
# Average across multiple runs to stabilize MI estimates
mi_scores = np.array([
mutual_info_classif(X, y, random_state=seed)
for seed in range(10)
]).mean(axis=0)
Wrapper methods
Recursive Feature Elimination with CV
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
step=1, # remove 1 feature per iteration
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring='f1_weighted',
min_features_to_select=5,
n_jobs=-1,
)
rfecv.fit(X, y)
print(f"Optimal features: {rfecv.n_features_}")
print(f"Selected: {np.where(rfecv.support_)[0]}")
# Plot number of features vs CV score
plt.figure(figsize=(10, 5))
n_features_range = range(rfecv.min_features_to_select, X.shape[1] + 1)
plt.plot(n_features_range, rfecv.cv_results_['mean_test_score'])
plt.fill_between(
n_features_range,
rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'],
rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'],
alpha=0.2
)
plt.xlabel('Number of Features')
plt.ylabel('CV F1 Score')
plt.title('RFECV Feature Selection')
plt.tight_layout()
Performance tip: For datasets with 100+ features, set step=0.1 to remove 10% of remaining features per iteration instead of one at a time. This reduces from 100 iterations to ~23 with minimal impact on selection quality.
Sequential Feature Selector
from sklearn.feature_selection import SequentialFeatureSelector
# Forward selection — starts empty, adds best feature each step
sfs_forward = SequentialFeatureSelector(
RandomForestClassifier(n_estimators=100, random_state=42),
n_features_to_select=15,
direction='forward',
scoring='f1_weighted',
cv=5,
n_jobs=-1,
)
# Backward selection — starts full, removes worst feature each step
sfs_backward = SequentialFeatureSelector(
RandomForestClassifier(n_estimators=100, random_state=42),
n_features_to_select=15,
direction='backward',
scoring='f1_weighted',
cv=5,
n_jobs=-1,
)
Forward selection is cheaper when selecting few features from many. Backward selection is cheaper when removing few features. The break-even point is roughly k = p/2 where k is target features and p is total features.
Embedded methods
L1 selection with SelectFromModel
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
# L1 regularization drives unimportant coefficients to exactly zero
l1_selector = SelectFromModel(
LogisticRegression(penalty='l1', C=0.1, solver='saga', max_iter=5000, random_state=42),
threshold='mean', # keep features with importance > mean importance
)
l1_selector.fit(X, y)
X_l1 = l1_selector.transform(X)
print(f"L1 selected {X_l1.shape[1]} features")
# Inspect which features survived
selected_mask = l1_selector.get_support()
coefficients = l1_selector.estimator_.coef_[0]
for i, (selected, coef) in enumerate(zip(selected_mask, coefficients)):
if selected:
print(f" Feature {i}: coefficient = {coef:.4f}")
Tree-based importance selection
from sklearn.ensemble import GradientBoostingClassifier
gb_selector = SelectFromModel(
GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42),
threshold='1.5*median', # keep features above 1.5x median importance
prefit=False,
)
gb_selector.fit(X, y)
X_gb = gb_selector.transform(X)
print(f"GB selected {X_gb.shape[1]} features")
Leak-free pipeline integration
Feature selection must happen inside cross-validation to prevent data leakage:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# CORRECT: selection inside pipeline, inside CV
pipe = Pipeline([
('variance', VarianceThreshold(threshold=0.01)),
('kbest', SelectKBest(f_classif, k=20)),
('model', RandomForestClassifier(n_estimators=200, random_state=42)),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='f1_weighted')
print(f"Leak-free CV: {scores.mean():.4f} ± {scores.std():.4f}")
# WRONG: selection before CV leaks validation data into feature scores
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X, y) # LEAK: sees all data including future validation
scores = cross_val_score(RandomForestClassifier(), X_selected, y, cv=5)
# These scores are optimistically biased
Correlation-based redundancy removal
Scikit-learn doesn’t provide this directly, but it’s a critical preprocessing step:
import pandas as pd
def remove_correlated_features(X, threshold=0.95):
"""Remove features with pairwise correlation above threshold."""
if isinstance(X, np.ndarray):
X = pd.DataFrame(X)
corr_matrix = X.corr().abs()
upper_triangle = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [col for col in upper_triangle.columns
if any(upper_triangle[col] > threshold)]
print(f"Removing {len(to_drop)} highly correlated features")
return X.drop(columns=to_drop), to_drop
X_clean, dropped = remove_correlated_features(pd.DataFrame(X), threshold=0.95)
Wrap this as a custom transformer (see python-sklearn-custom-transformers) to include it in pipelines.
Stability selection
A robust approach that runs feature selection on many bootstrap samples and keeps features that are consistently selected:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
def stability_selection(X, y, n_bootstrap=100, threshold=0.6, random_state=42):
"""Select features that appear in >threshold fraction of bootstrap runs."""
rng = np.random.RandomState(random_state)
n_samples, n_features = X.shape
selection_counts = np.zeros(n_features)
for i in range(n_bootstrap):
# Bootstrap sample
indices = rng.choice(n_samples, size=n_samples, replace=True)
X_boot, y_boot = X[indices], y[indices]
# L1 selection on bootstrap
selector = SelectFromModel(
LogisticRegression(penalty='l1', C=0.1, solver='saga', max_iter=3000, random_state=i)
)
selector.fit(X_boot, y_boot)
selection_counts += selector.get_support().astype(int)
selection_freq = selection_counts / n_bootstrap
stable_features = np.where(selection_freq >= threshold)[0]
return stable_features, selection_freq
stable_features, freqs = stability_selection(X, y)
print(f"Stable features: {stable_features}")
Features that survive stability selection are genuinely informative — they’re not artifacts of a particular data split.
Tradeoffs summary
| Method | Speed | Interactions | Leak Risk | Best For |
|---|---|---|---|---|
| VarianceThreshold | O(np) | No | None | Quick cleanup |
| SelectKBest | O(np) | No | If outside CV | Initial reduction |
| RFECV | O(p² × CV) | Yes | Low (built-in CV) | Finding optimal count |
| SequentialFeatureSelector | O(pk × CV) | Yes | Low | Small target feature sets |
| SelectFromModel (L1) | O(model fit) | Partial | If outside CV | Sparse interpretable models |
| SelectFromModel (trees) | O(model fit) | Yes | If outside CV | Tree-based pipelines |
One thing to remember: Feature selection is itself a modeling decision that can overfit. Always evaluate selection within cross-validation, and consider stability selection when you need features you can trust across different data samples.
See Also
- Python Sklearn Custom Transformers How to teach scikit-learn new tricks by building your own data transformation steps — no PhD required.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.