Feature Selection with RFECV in Python

Introduction

One of the most common questions raised by users of the Recursive Feature Elimination with Cross-Validation (RFECV) method, an algorithm available in Python's Scikit-learn library, concerns the discrepancy in the length of 'cv_results_['mean_test_score']' and the difference between the initial and final number of features selected. Through two specific examples, this article aims to clarify these concerns and illustrate how RFECV operates.

Decoding RFECV Operations

In a typical case where a user begins with 174 features and ends up with an optimal number of 89 features, they might notice that the length of 'cv_results_['mean_test_score']' is 145, which doesn't align with their expectation of 85 (174-89). The question arises: Why is there a difference?

To clarify this, it's crucial to remember that RFECV not only eliminates one feature at a time but also performs cross-validation at each step to estimate the model's performance with different subsets of features. Consequently, it evaluates the model multiple times during the feature selection process, which explains the length of 'cv_results_['mean_test_score']'.

As the RFECV algorithm begins the feature elimination process, it evaluates the model's performance using cross-validation and records the mean test score for each step. After eliminating some features, the process enters the second step, where the number of features may differ from the initial 174. This process continues until the stopping criterion, either the 'min_features_to_select' or the point at which the performance does not improve significantly, is reached. 

In this way, the length of 'cv_results_['mean_test_score']' can be seen as the number of steps taken during the feature elimination process rather than being equal to the difference between the initial and final number of features.

Example of discrepancies in RFECV output

To better illustrate this process, let's consider a second example where a user starts with 150 features. The user sets 'min_features_to_select=3', expecting the RFECV to select at least 3 features. However, the result indicates the selection of 4 features, while the length of 'cv_results_['std_test_score']' is 148. 

Here, the length of 'cv_results_['std_test_score']' indicates that RFECV evaluated the model's performance at 148 different steps (with different subsets of features). The additional feature selection from the expected minimum is due to the algorithm striving for better performance during cross-validation.

An Illustrative Example with RFECV

To further demystify RFECV's operations, I created a simple example using 10 features. The code, as presented below, provides an illustration of the RFECV process:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

# Generate synthetic data with 10 features and 100 samples
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# Define the estimator and RFECV parameters
estimator = LogisticRegression(solver='lbfgs')
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=44)
step_size = 1
min_features_to_select = 3

# Create the RFECV object and fit it to the data
rfecv = RFECV(estimator=estimator, step=step_size, cv=cv, scoring='accuracy', 
              min_features_to_select=min_features_to_select, verbose=0)
rfecv.fit(X, y)

# Get the optimal number of features selected
optimal_num_features = rfecv.n_features_

# Get the mean test scores during the feature selection process
mean_test_scores = rfecv.cv_results_['mean_test_score']

# Print the results
print("Optimal number of features selected:", optimal_num_features)
print("Number of steps in RFECV:", len(mean_test_scores))

The results reveal that the RFECV process selects at least three features, as set by 'min_features_to_select=3'. The number of steps in RFECV depends on how many features are eliminated during the cross-validation process at each step.

The provided code is a clear and concise demonstration of the Recursive Feature Elimination with Cross-Validation (RFECV) process in Python. It begins by generating synthetic data with 10 features and 100 samples using the 'make_classification' function. Following this, a Logistic Regression estimator and a Stratified K-Fold Cross-Validation (CV) are defined, setting the stage for the RFECV process. The RFECV is then initialized with the previously defined estimator and CV, a step size of 1, and a minimum feature selection count of 3. After fitting the RFECV to the data, the optimal number of features selected by the algorithm and the mean test scores during the feature selection process are extracted.

The final results, printed to the console, reveal the optimal number of features selected and the total number of steps in the RFECV process. This code serves as a simplified demonstration of the RFECV's functionality and operational steps in feature selection.

Conclusion

By addressing the concerns surrounding the operations of RFECV in Python, my contributions aim to clarify the algorithm's functionality. The length of 'cv_results_['mean_test_score']' is not merely a reflection of the initial and final feature count difference but instead denotes the number of steps taken during the feature elimination process. Furthermore, RFECV's selection of features more than the minimum set is a testament to the algorithm's continuous pursuit of better performance during cross-validation. Understanding these aspects of RFECV can significantly aid users in implementing this powerful tool for feature selection in machine learning.


Similar Articles