Advertisement

All Relevant Feature Selection Methods and Applications

  • Witold R. Rudnicki
  • Mariusz Wrzesień
  • Wiesław Paja
Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 584)

Abstract

All-relevant feature selection is a relatively new sub-field in the domain of feature selection. The chapter is devoted to a short review of the field and presentation of the representative algorithm. The problem of all-relevant feature selection is first defined, then key algorithms are described. Finally the Boruta algorithm, under development at ICM, University of Warsaw, is explained in a greater detail and applied both to a collection of synthetic and real-world data sets. It is shown that algorithm is both sensitive and selective. The level of falsely discovered relevant variables is low—on average less than one falsely relevant variable is discovered for each set. The sensitivity of the algorithm is nearly 100 % for data sets for which classification is easy, but may be smaller for data sets for which classification is difficult, nevertheless, it is possible to increase the sensitivity of the algorithm at the cost of increased computational effort without adversely affecting the false discovery level. It is achieved by increasing the number of trees in the random forest algorithm that delivers the importance estimate in Boruta.

Keywords

All-relevant feature selection Strong and weak relevance Feature importance Boruta Random forest 

Notes

Acknowledgments

Computations were partially performed at the Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Poland, grant G34-5. Authors would like to thank Mr. Rafał Niemiec for technical help.

References

  1. 1.
    Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
  2. 2.
    Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)CrossRefMATHGoogle Scholar
  3. 3.
    Draminski, M., Kierczak, M., Koronacki, J., Komorowski, J.: Monte Carlo feature selection and interdependency discovery in supervised classification. In: Koronacki, J. (ed.) Advances in Machine Learning II. SCI, vol. 263, pp. 371–385. Springer (2010)Google Scholar
  4. 4.
    Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte Carlo feature selection for supervised classification. Bioinformatics 24(1), 110–117 (2008)CrossRefGoogle Scholar
  5. 5.
    Gunduz, N., Fokoue, E.: UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets/Turkiye+Student+Evaluation (2013)
  6. 6.
    Huynh-Thu, V.A., Wehenkel, L., Geurts, P.: Exploiting tree-based variable importances to selectively identify relevant variables. In: JMLR: Workshop and Conference Proceedings, vol. 4, pp. 60–73 (2008)Google Scholar
  7. 7.
    Kohavi, R., John, G.: Wrappers for feature subset selection. Artif. Intell. 97(1), 273–324 (1997)CrossRefMATHGoogle Scholar
  8. 8.
    Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36(11), 1–13 (2010)Google Scholar
  9. 9.
    Kursa, M.B., Jankowski, A., Rudnicki, W.R.: Boruta—a system for feature selection. Fundam. Inform. 101(4), 271–285 (2010)MathSciNetGoogle Scholar
  10. 10.
    Leisch, F., Dimitriadou, E.: mlbench: machine learning benchmark problems. R package version 2.1–1 (2010)Google Scholar
  11. 11.
    Liaw, A., Wiener, M.: Classification and regression by random forest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/
  12. 12.
    Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V.: Quantitative structure-activity relationship models for ready biodegradability of chemicals. J. Chem. Inf. Model. 53(4), 867–878 (2013)CrossRefGoogle Scholar
  13. 13.
    Nilsson, R., Peña, J.M., Björkegren, J., Tegnér, J.: Detecting multivariate differentially expressed genes. BMC Bioinform. 8, 150 (2007)Google Scholar
  14. 14.
    Rudnicki, W.R., Kierczak, M., Koronacki, J., Komorowski, J.: A statistical method for determining importance of variables in an information system. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H., Slowinski, R. (eds.) Rough Sets and Current Trends in Computing, vol. 4259/2006, pp. 557–566. Springer, Berlin/Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Stoppiglia, H., Dreyfus, G., Dubois, R., Oussar, Y.: Ranking a random feature for variable and feature selection. J. Mach. Learn. Res. 3(7–8), 1399–1414 (2003)MATHGoogle Scholar
  16. 16.
    Team, R.C.: R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria (2012). http://www.R-project.org/
  17. 17.
    Tuv, E., Borisov, A., Torkkola, K.: Feature selection using ensemble based ranking against artificial contrasts. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp. 2181–2186. IEEE (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Witold R. Rudnicki
    • 1
  • Mariusz Wrzesień
    • 2
  • Wiesław Paja
    • 2
  1. 1.Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of WarsawWarsawPoland
  2. 2.Faculty of Applied ITUniversity of Information Technology and ManagementRzeszówPoland

Personalised recommendations