Recursive Feature Elimination with Scikit Learn
Datasets used to train classification and regression algorithms are high dimensional in nature — this means that they contain many features or attributes. In textual datasets each feature is a word and as you can imagine the vocabulary used in the dataset can be very large. Not all features however, contribute to the prediction variable. Removing features of low importance can improve accuracy, and reduce both model complexity and overfitting. Training time can also be reduced for very large datasets. In this blog post performing Recursive Feature Elimination (RFE) with Scikit Learn will be covered.
Not all features are created equally so rank and eliminate!
Recursive Feature Elimination
Recursive Feature Elimination (RFE) as its title suggests recursively removes features, builds a model using the remaining attributes and calculates model accuracy. RFE is able to work out the combination of attributes that contribute to the prediction on the target variable (or class). Scikit Learn does most of the heavy lifting just import RFE from sklearn.feature_selection and pass any classifier model to the RFE() method with the number of features to select. Using familiar Scikit Learn syntax, the .fit() method must then be called.
In the example code below the iris dataset is used to illustrate the use of RFE. The iris dataset has 4 features (or attributes namely ‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’ & ‘petal width (cm)’). ref.support_ returns an array with boolean values to indicate whether an attribute was selected using RFE e.g. for the iris dataset this array is [False True True True]. ref.ranking_ returns an array with positive integer values to indicate the attribute ranking with a lower score indicating a higher ranking e.g. the array for the iris dataset is [2 1 1 1] which means that sepal width, petal length and petal width all rank higher than sepal length.
from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE
from sklearn import datasetsdataset = datasets.load_iris()svm = LinearSVC()
# create the RFE model for the svm classifier
# and select attributes
rfe = RFE(svm, 3)
rfe = rfe.fit(dataset.data, dataset.target)
# print summaries for the selection of attributes
print(rfe.support_)
print(rfe.ranking_)
Scikit Learn makes everything really simple.
RFE is really that simple with Scikit Learn however it may take a while to run if the dataset has many attributes. In a future blog posts, feature ranking with information gain and performing RFE with cross-validation will be covered. Stay tuned…..