Scikit-Learn Feature Selection — Deep Dive

Production feature selection pipelines in scikit-learn — from statistical filters to recursive elimination, with code for leak-free evaluation.

Technical foundation

Feature selection addresses the bias-variance tradeoff from the feature dimension. Adding irrelevant features increases model variance (more parameters to estimate from the same data) without reducing bias. Removing them constrains the hypothesis space, reducing variance at minimal bias cost — provided you don’t remove informative features.

Formally, for a linear model with p features and n samples, the expected prediction error scales as σ²p/n. Reducing p directly reduces this term, which is why feature selection often improves generalization even when the dropped features carry weak signal.

Filter methods in depth

VarianceThreshold — the first pass

from sklearn.feature_selection import VarianceThreshold

# Remove features with variance below 0.01
# For binary features, variance = p(1-p), so threshold=0.8*(1-0.8)=0.16
# removes binary features where >80% of values are the same class
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X)

print(f"Features: {X.shape[1]} → {X_reduced.shape[1]}")
print(f"Removed: {(~selector.get_support()).sum()} constant/near-constant features")

Statistical tests with SelectKBest

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# ANOVA F-test — tests linear relationship between feature and target
selector_f = SelectKBest(score_func=f_classif, k=20)
X_f = selector_f.fit_transform(X, y)

# Mutual information — captures nonlinear dependencies
selector_mi = SelectKBest(score_func=mutual_info_classif, k=20)
X_mi = selector_mi.fit_transform(X, y)

# Compare selected features
import numpy as np
f_selected = set(np.where(selector_f.get_support())[0])
mi_selected = set(np.where(selector_mi.get_support())[0])
print(f"Overlap: {len(f_selected & mi_selected)} / 20")
print(f"Only in F-test: {f_selected - mi_selected}")
print(f"Only in MI: {mi_selected - f_selected}")

When they disagree: Features selected by mutual information but not F-test have nonlinear relationships with the target. Features selected by F-test but not MI may have weak linear effects that MI’s estimation noise obscures. In practice, take the union for safety.

Mutual information estimation details

mutual_info_classif uses k-nearest-neighbor estimation, which has a random_state parameter and can produce different results across runs. For reproducible selection:

from sklearn.feature_selection import mutual_info_classif

# Average across multiple runs to stabilize MI estimates
mi_scores = np.array([
    mutual_info_classif(X, y, random_state=seed)
    for seed in range(10)
]).mean(axis=0)

Wrapper methods

Recursive Feature Elimination with CV

from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt

rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    step=1,               # remove 1 feature per iteration
    cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring='f1_weighted',
    min_features_to_select=5,
    n_jobs=-1,
)

rfecv.fit(X, y)

print(f"Optimal features: {rfecv.n_features_}")
print(f"Selected: {np.where(rfecv.support_)[0]}")

# Plot number of features vs CV score
plt.figure(figsize=(10, 5))
n_features_range = range(rfecv.min_features_to_select, X.shape[1] + 1)
plt.plot(n_features_range, rfecv.cv_results_['mean_test_score'])
plt.fill_between(
    n_features_range,
    rfecv.cv_results_['mean_test_score'] - rfecv.cv_results_['std_test_score'],
    rfecv.cv_results_['mean_test_score'] + rfecv.cv_results_['std_test_score'],
    alpha=0.2
)
plt.xlabel('Number of Features')
plt.ylabel('CV F1 Score')
plt.title('RFECV Feature Selection')
plt.tight_layout()

Performance tip: For datasets with 100+ features, set step=0.1 to remove 10% of remaining features per iteration instead of one at a time. This reduces from 100 iterations to ~23 with minimal impact on selection quality.

Sequential Feature Selector

from sklearn.feature_selection import SequentialFeatureSelector

# Forward selection — starts empty, adds best feature each step
sfs_forward = SequentialFeatureSelector(
    RandomForestClassifier(n_estimators=100, random_state=42),
    n_features_to_select=15,
    direction='forward',
    scoring='f1_weighted',
    cv=5,
    n_jobs=-1,
)

# Backward selection — starts full, removes worst feature each step
sfs_backward = SequentialFeatureSelector(
    RandomForestClassifier(n_estimators=100, random_state=42),
    n_features_to_select=15,
    direction='backward',
    scoring='f1_weighted',
    cv=5,
    n_jobs=-1,
)

Forward selection is cheaper when selecting few features from many. Backward selection is cheaper when removing few features. The break-even point is roughly k = p/2 where k is target features and p is total features.

Embedded methods

L1 selection with SelectFromModel

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

# L1 regularization drives unimportant coefficients to exactly zero
l1_selector = SelectFromModel(
    LogisticRegression(penalty='l1', C=0.1, solver='saga', max_iter=5000, random_state=42),
    threshold='mean',  # keep features with importance > mean importance
)

l1_selector.fit(X, y)
X_l1 = l1_selector.transform(X)
print(f"L1 selected {X_l1.shape[1]} features")

# Inspect which features survived
selected_mask = l1_selector.get_support()
coefficients = l1_selector.estimator_.coef_[0]
for i, (selected, coef) in enumerate(zip(selected_mask, coefficients)):
    if selected:
        print(f"  Feature {i}: coefficient = {coef:.4f}")

Tree-based importance selection

from sklearn.ensemble import GradientBoostingClassifier

gb_selector = SelectFromModel(
    GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42),
    threshold='1.5*median',  # keep features above 1.5x median importance
    prefit=False,
)

gb_selector.fit(X, y)
X_gb = gb_selector.transform(X)
print(f"GB selected {X_gb.shape[1]} features")

Leak-free pipeline integration

Feature selection must happen inside cross-validation to prevent data leakage:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# CORRECT: selection inside pipeline, inside CV
pipe = Pipeline([
    ('variance', VarianceThreshold(threshold=0.01)),
    ('kbest', SelectKBest(f_classif, k=20)),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42)),
])

scores = cross_val_score(pipe, X, y, cv=5, scoring='f1_weighted')
print(f"Leak-free CV: {scores.mean():.4f} ± {scores.std():.4f}")

# WRONG: selection before CV leaks validation data into feature scores
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X, y)  # LEAK: sees all data including future validation
scores = cross_val_score(RandomForestClassifier(), X_selected, y, cv=5)
# These scores are optimistically biased

Correlation-based redundancy removal

Scikit-learn doesn’t provide this directly, but it’s a critical preprocessing step:

import pandas as pd

def remove_correlated_features(X, threshold=0.95):
    """Remove features with pairwise correlation above threshold."""
    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X)

    corr_matrix = X.corr().abs()
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )

    to_drop = [col for col in upper_triangle.columns
                if any(upper_triangle[col] > threshold)]

    print(f"Removing {len(to_drop)} highly correlated features")
    return X.drop(columns=to_drop), to_drop

X_clean, dropped = remove_correlated_features(pd.DataFrame(X), threshold=0.95)

Wrap this as a custom transformer (see python-sklearn-custom-transformers) to include it in pipelines.

Stability selection

A robust approach that runs feature selection on many bootstrap samples and keeps features that are consistently selected:

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def stability_selection(X, y, n_bootstrap=100, threshold=0.6, random_state=42):
    """Select features that appear in >threshold fraction of bootstrap runs."""
    rng = np.random.RandomState(random_state)
    n_samples, n_features = X.shape
    selection_counts = np.zeros(n_features)

    for i in range(n_bootstrap):
        # Bootstrap sample
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        X_boot, y_boot = X[indices], y[indices]

        # L1 selection on bootstrap
        selector = SelectFromModel(
            LogisticRegression(penalty='l1', C=0.1, solver='saga', max_iter=3000, random_state=i)
        )
        selector.fit(X_boot, y_boot)
        selection_counts += selector.get_support().astype(int)

    selection_freq = selection_counts / n_bootstrap
    stable_features = np.where(selection_freq >= threshold)[0]
    return stable_features, selection_freq

stable_features, freqs = stability_selection(X, y)
print(f"Stable features: {stable_features}")

Features that survive stability selection are genuinely informative — they’re not artifacts of a particular data split.

Tradeoffs summary

Method	Speed	Interactions	Leak Risk	Best For
VarianceThreshold	O(np)	No	None	Quick cleanup
SelectKBest	O(np)	No	If outside CV	Initial reduction
RFECV	O(p² × CV)	Yes	Low (built-in CV)	Finding optimal count
SequentialFeatureSelector	O(pk × CV)	Yes	Low	Small target feature sets
SelectFromModel (L1)	O(model fit)	Partial	If outside CV	Sparse interpretable models
SelectFromModel (trees)	O(model fit)	Yes	If outside CV	Tree-based pipelines

One thing to remember: Feature selection is itself a modeling decision that can overfit. Always evaluate selection within cross-validation, and consider stability selection when you need features you can trust across different data samples.

pythonmachine-learningscikit-learn