Skip to content

A library for feature selection for gradient boosting models using regression on feature Shapley values

License

Notifications You must be signed in to change notification settings

transferwise/shap-select

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

shap-select implements a heuristic for fast feature selection, for tabular regression and classification models.

The basic idea is running a linear or logistic regression of the target on the Shapley values of the original features, on the validation set, discarding the features with negative coefficients, and ranking/filtering the rest according to their statistical significance. For motivation and details, refer to our research paper see the example notebook

Earlier packages using Shapley values for feature selection exist, the advantages of this one are

  • Regression on the validation set to combat overfitting
  • Only a single fit of the original model needed
  • A single intuitive hyperparameter for feature selection: statistical significance
  • Bonferroni correction for multiclass classification
  • Address collinearity of (Shapley value) features by repeated (linear/logistic) regression

Usage

from shap_select import shap_select
# Here model is any model supported by the shap library, fitted on a different (train) dataset
# Task can be regression, binary, or multiclass
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05)
  feature name t-value stat.significance coefficient selected
0 x5 20.211299 0.000000 1.052030 1
1 x4 18.315144 0.000000 0.952416 1
2 x3 6.835690 0.000000 1.098154 1
3 x2 6.457140 0.000000 1.044842 1
4 x1 5.530556 0.000000 0.917242 1
5 x6 2.390868 0.016827 1.497983 1
6 x7 0.901098 0.367558 2.865508 0
7 x8 0.563214 0.573302 1.933632 0
8 x9 -1.607814 0.107908 -4.537098 -1

Citation

If you use shap-select in your research, please cite our paper:

@misc{kraev2024shapselectlightweightfeatureselection,
      title={Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression}, 
      author={Egor Kraev and Baran Koseoglu and Luca Traverso and Mohammed Topiwalla},
      year={2024},
      eprint={2410.06815},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.06815}, 
}

About

A library for feature selection for gradient boosting models using regression on feature Shapley values

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages