Feature Selection

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

You can learn more about feature selection with scikit-learn in the article Feature selection.

import pandas as pd
from sklearn.model_selection import train_test_split
from dataidea.datasets import loadDataset
data = loadDataset('../assets/demo_cleaned.csv', 
                    inbuilt=False, file_type='csv')
data = pd.get_dummies(data, columns=['gender'], 
                      dtype='int', drop_first=True)
data.head(n=5)
age marital_status address income income_category job_category gender_m
0 55 1 12 72.0 3.0 3 0
1 56 0 29 153.0 4.0 3 1
2 24 1 4 26.0 2.0 1 1
3 45 0 9 76.0 4.0 2 1
4 44 1 17 144.0 4.0 3 1

Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

Many different statistical test scan be used with this selection method. For example the ANOVA F-value method is appropriate for numerical inputs and categorical data. This can be used via the f_classif() function. We will select the 4 best features using this method in the example below.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import chi2
X = data.drop('marital_status', axis=1)
y = data.marital_status
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
X_train_numeric = X_train[['age', 'income', 'address']].copy()
test = SelectKBest(score_func=f_classif, k=2)
fit = test.fit(X_train_numeric, y_train)
scores = fit.scores_
features = fit.transform(X_train_numeric)
selected_indices = test.get_support(indices=True)

print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores:  [3.73495613 0.40565654 0.50368697]
Selected Features Indices:  [0 2]
X = data[['age', 'address']].copy()
y = data.income
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
test = SelectKBest(score_func=f_regression, k=1) # Select top 5 features, adjust k as needed

# Fit the selector to the data
fit = test.fit(X_train_numeric, y_train)

# get scores
test_scores = fit.scores_

# summarize selected features
features = fit.transform(X_train_numeric)

# Get the selected feature indices
selected_indices = fit.get_support(indices=True)

print('Feature Scores: ', test_scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores:  [0.00660376 0.0464015  2.0207761 ]
Selected Features Indices:  [2]
X = data[['gender_m', 'income_category', 'job_category']].copy()
y = data.marital_status
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
test = SelectKBest(score_func = chi2, k=2)
fit = test.fit(X_train, y_train)
scores = fit.scores_
features = fit.transform(X_train)
selected_indices = fit.get_support(indices=True)

print('Feature Scores: ', test_scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores:  [0.00660376 0.0464015  2.0207761 ]
Selected Features Indices:  [0 1]

Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# feature extraction
model = LogisticRegression()
rfe = RFE(model)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
Num Features: 1
Selected Features: [False  True False]
Feature Ranking: [3 1 2]

Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
# feature extraction
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X, y)
print(model.feature_importances_)
[0.10237303 0.52467525 0.37295172]
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)
0.46938775510204084
X.head(n=3)
gender_m income_category job_category
0 0 3.0 3
1 1 4.0 3
2 1 2.0 1
rfc.fit(X_train[['income_category', 'job_category', 'gender_m']], y_train)
rfc.score(X_test[['income_category',    'job_category', 'gender_m']], y_test)
0.4489795918367347
  • f_classif is most applicable where the input features are continuous and the outcome is categorical.
  • f_regression is most applicable where the input features are continuous and the outcome is continuous.
  • chi2 is best for when the both the input and outcome are categorical.
Back to top