Identifying important features using Python

10 min readJul 30, 2023

Introduction

Features are the foundation on which every machine-learning model is built. Different machine-learning paradigms use different terminologies for features such as annotations, attributes, auxiliary information, etc. Nonetheless, features are an essential ingredient in building an ML model. This covers unsupervised, supervised, self-supervised, decision-making, and even graph ML. All these techniques use features to reduce the loss function value when predicting their goals. In this article, we will look at features from the lens of supervised models, such as classification and regression.

Anything from the input data or attributes, except for the label, might be deemed a feature. However, which feature will help you most reduce the loss function? This is when the significance of features comes into play. In the following sections, we will define feature importance and discuss the various use situations in which recognizing significant features can be beneficial. We will then examine the various feature significance strategies, ranging from simple techniques like correlation and ablation to more sophisticated techniques like Gini importance and SHAP values.

What is feature importance?

Feature importance is a concept in machine learning that refers to the relative significance or contribution of each input variable (also known as features or predictors) in predicting the target variable or outcome. It provides insights into which features influence the model’s predictions most. Different methods could be used to calculate the significance of features. What we are looking for in these algorithms is to output a list of features along with corresponding importance values.

Why do you need to identify important features?

As we saw in the introduction, the key reason to identify critical features is to understand the proper set of features that would help achieve the best performance. A key benefit of solving this problem is that it could help reduce the number of features in the model by helping us focus on features of high enough importance. Reducing the number of features directly reduces training and inference costs and time.
Another critical reason to evaluate the importance of features is to increase the explainability of models. With most ML use cases moving to deep learning, models’ opacity has increased significantly. Feature importance and feature coverage provide a deeper understanding of the why and how of ML models. Such reverse engineering into the model’s workings is helpful for debugging and executive reporting.

Types of features/models

Features could be of different types. Let’s start with dense features, which are numeric, i.e., integers and floats. The number of words, characters, likes, replies, weight, height, etc., are good examples of dense features. Next, there are categorical features, usually represented as small one-hot vectors. Features such as topic, gender, age group, etc., fall under categorical features. Then, there are sparse features such as language id embeddings, position id embeddings, text embeddings, etc. Most feature-importance algorithms deal very well with dense and categorical features. However, they struggle with sparse features. In the next section, where we describe the different feature importance algorithms, we will detail how well they work with different types of features. Similarly, as we will see in the upcoming sections, only some methods are model agnostic. Some methods work for particular types of models. Some work for linear models and tree models, but some approaches generalize well and work for all models.

How can we calculate the importance of features?

In this section, we will cover the different methods of calculating the importance of features. We will cover very rudimentary methods, along with quite sophisticated algorithms. We will also look at different ways to implement feature importance using Python libraries. We will be using the diabetes dataset from sklearn to demonstrate the different algorithms listed below. The dataset has 10 dense features. The goal is to identify the order of importance of the different features available.

Regression coefficients

The idea here is to utilize the coefficients of the linear models to capture the value each feature adds to the prediction. In this approach, the features would be fit to a linear regression model to predict the target label. The absolute values of the coefficients generated by the linear model for each feature indicate the importance of the corresponding features. Larger absolute values suggest a more substantial influence. Here is an example implementation of ridge regression on the diabetes prediction task from sklearn with different numerical features. Coefficient-based importance is used to calculate the importance of the different features.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

diabetes = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(
  diabetes.data, diabetes.target, random_state=0)

model = Ridge(alpha=1e-2).fit(X_train, y_train)

# Fetch the importances from model.coef_
for i in range(len(model.coef_)):
  print(f"{diabetes.feature_names[i]:<8}"
  f"{abs(model.coef_[i]):.3f}")

The table shows that BMI and s5 were identified as the most important features by the regression coefficient method.

Histogram with feature importance values of each feature

As highlighted, this method is helpful in calculating the feature importance of linear models. This has the same pros and cons as correlation. Explainability is a pro for this method, but this only supports complex feature types.

Correlation with label

The most straightforward way to evaluate the significance of a feature is its correlation with the target variable. The higher the absolute magnitude of the correlation, the more important that feature is. There are multiple ways to calculate the correlation. The formula and code we have covered in our examples are based on Pearson correlation.

In this formula, we calculate the correlation of two variables, X and Y, where the mean is the average value of the corresponding variable and n refers to the number of instances of both X and Y variables. Here is a sample implementation of calculating importance using correlation with the target label.

from sklearn.datasets import load_diabetes
from numpy import cov
from numpy import transpose


diabetes = load_diabetes()
for i in range(len(transpose(diabetes.data))):
  covariance = cov(transpose(diabetes.data)[i], diabetes.target)
  # Fetch the importances in covariance[0][1]
  print(f"{diabetes.feature_names[i]:<8}"
          f"{abs(covariance[0][1]):.3f}")

The table shows that BMI and s5 have the highest correlation with the target label and are the most critical features per the Pearson correlation method.

Histogram with correlation-based feature importance values of each feature

The main benefit of correlation-based importance is the simplicity and explainability of the method. This method is independent of the type of model you will build. A major caveat of correlation-based importance is that it does not overlap between features. Another con of this method is that it does not work very well for sparse features.

Feature Ablation

Feature ablation refers to removing a set of features from the model. A simple way to determine the importance of a feature is to see the drop in the model’s performance (measured by target metrics such as auc-roc, auc-pr, precision, and recall) when the feature is removed. The feature whose removal leads to the most significant drop in target performance is the most important. Feature importance could be sorted by decreasingly sorting the increase in losses.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn import metrics
import numpy as np
diabetes = load_diabetes()


def loss_calc(data):
  X_train, X_val, y_train, y_val = train_test_split(
    data, diabetes.target, test_size=0.33, random_state=0)
  model = Ridge(alpha=1e-2).fit(X_train, y_train)
  y_predict = model.predict(X_val)
  return np.sum(np.absolute(y_predict - y_val))


all_loss = loss_calc(diabetes.data)


for i in range(len(diabetes.feature_names)):
  data = []
  for j in range(len(np.transpose(diabetes.data))):
    if i != j:
      data.append(np.transpose(diabetes.data)[j])
  # Difference in loss determines the importance
  loss_feature = loss_calc(np.transpose(np.array(data)))
  print(f"{diabetes.feature_names[i]:<8}"
          f"{all_loss - loss_feature:.3f}")

Feature Importance table shows BMI has the highest loss increase aka highest importance in feature ablation method with ridge regression.

Histogram with loss difference values of each feature

There are several pros to this method. This method works very well for all types of features and models. You don’t have to worry about feature overlap. It is easy to understand and explain as well. However, a major con of this approach is its complexity. You will need to run training for each feature ablation, and if you have many features, this would mean significant resource and time consumption.

Permutation Importance

This technique involves randomly shuffling the values of a feature and measuring the resulting decrease in model performance. The more significant the drop in performance, the more critical the feature. This approach is model-agnostic and can be used with various machine-learning algorithms. Here is an example of implementing ridge regression on the diabetes prediction task with different numerical features. Permutation importance is used to calculate the importance of the different features.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.inspection import permutation_importance


diabetes = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(
diabetes.data, diabetes.target, random_state=0)


model = Ridge(alpha=1e-2).fit(X_train, y_train)
model.score(X_val, y_val)


# Fetch the importances
r = permutation_importance(model, X_val, y_val, n_repeats=30, random_state=0)


for i in r.importances_mean.argsort()[::-1]:
  print(f"{diabetes.feature_names[i]:<8}"
          f"{r.importances_mean[i]:.3f}"
          f" +/- {r.importances_std[i]:.3f}")

The table shows feature s5 with the highest importance per the permutation importance algorithm and BMI as the second highest.

Histogram with permutation feature importance values of each feature

This approach shares the same pros and cons as the feature ablation. Most online feature importance libraries have a version of this. Hence, it is easy to import and use in Python.

Tree based importance

Tree-based feature importance is a technique used to determine the importance of features in tree-based machine learning models, such as random forests and gradient boosting algorithms (e.g., XGBoost, LightGBM). The critical component in the calculation of tree-based feature importance is Gini importance or Mean Decrease impurity. This method measures the importance of a feature based on how much it reduces the impurity in the tree nodes. The Gini importance of a feature is computed as the sum of the impurities decreases across all nodes in the tree that split on that feature. Features with higher Gini importance indicate a greater ability to differentiate the target variable. Here, we have demonstrated a hypothetical scenario for using Sklearn libraries to fetch impurity-based importance for tree-based models.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import numpy as np


diabetes = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(
diabetes.data, diabetes.target, random_state=0)


# Fit a Random Forest Model
forest = RandomForestRegressor(random_state=0)
forest.fit(X_train, y_train)


# Fetch the importances
importances = forest.feature_importances_


for i in range(len(importances)):
  print(f"{diabetes.feature_names[i]:<8}"
          f"{importances[i]:.3f}")

The table demonstrates BMI and s5 as the top two features in the random forest-based Gini importance method.

Histogram with Gini feature importance values of each feature

As the method describes, this feature importance technique is naturally built for tree-based models. Also, it does not work well for embedding and sparse features but will work fine for dense and categorical features.

SHAP

SHAP (SHapley Additive exPlanations) is a unified framework for interpreting the predictions of machine learning models. SHAP values provide a way to assign contributions to each feature in a prediction, indicating how much each feature contributes to the final prediction compared to the average prediction. Different combinations are formed using the rest of the features for a given feature, and the model’s performance is evaluated for those combinations with and without the given feature. These differences in performance with and without the feature under consideration could be the marginal contributions of the feature. The SHAP value is a weighted sum of such marginal contributions. SHAP-based feature importance provides a comprehensive and model-agnostic approach to understanding the contribution of each feature to the model’s predictions. It allows you to identify which features are important and how they interact with other features, providing a more nuanced understanding of the model’s behavior.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
import shap # v0.39.0
import pandas as pd
import numpy as np


shap.initjs()
diabetes = load_diabetes(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
diabetes['data'].iloc[:, :10], diabetes['target'], test_size=0.2, random_state=1)


model = RandomForestRegressor(random_state=42).fit(X_train, y_train)


explainer = shap.Explainer(model)
shap_test = explainer(X_test)
shap_df = pd.DataFrame(shap_test.values, columns=shap_test.feature_names,   
  index=X_test.index)
# Fetch the importances
shap_df = shap_df.apply(np.abs).mean().sort_values(ascending=False)
print(shap_df)

Similar to other feature importance methods SHAP is also showing that BMI and s5 are the two most important features and have an importance value much larger than other features.

Histogram with SHAP-based feature importance values of each feature

Conclusion

Based on the experiments run with the diabetes data case study, BMI and s5 have the highest feature importance across multiple methods. On the other hand, the ablation study shows that removing age and s3 features leads to decreased losses, implying better performance without these features. But it is also important to note that all the features carry a lower but significant importance per other algorithms, indicating all the features are important for performance. SHAP, which does weighted ablation-based estimation in different combination settings, demonstrates non-zero importance for all the features. Each method listed in this article has its pros and cons. It is always advisable to try out different feature-importance strategies to understand the model and data in depth. Hopefully, you thoroughly understand how to identify valuable features, why they’re helpful, and how you can use them.

Reference

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
Menze, B.H., Kelm, B.M., Masuch, R. et al. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10, 213 (2009). https://doi.org/10.1186/1471-2105-10-213
arXiv:1705.07874 [cs.AI]

WRITER at MLearning.ai // EEG AI Prediction // Personal AI Art Model

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com