A Statistical Method for Determining Importance of Variables in an Information System

  • Witold R. Rudnicki
  • Marcin Kierczak
  • Jacek Koronacki
  • Jan Komorowski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4259)

Abstract

A new method for estimation of attributes’ importance for supervised classification, based on the random forest approach, is presented. Essentially, an iterative scheme is applied, with each step consisting of several runs of the random forest program. Each run is performed on a suitably modified data set: values of each attribute found unimportant at earlier steps are randomly permuted between objects. At each step, apparent importance of an attribute is calculated and the attribute is declared unimportant if its importance is not uniformly better than that of the attributes earlier found unimportant. The procedure is repeated until only attributes scoring better than the randomized ones are retained. Statistical significance of the results so obtained is verified. This method has been applied to 12 data sets of biological origin. The method was shown to be more reliable than that based on standard application of a random forest to assess attributes’ importance.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cowell, R.G., Dawid, A.P., Lauritzen, S.L., Spiegelharter, D.J.: Probabilistic networks and expert systems. Springer, New York (1999)MATHGoogle Scholar
  2. 2.
    Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1996)MATHGoogle Scholar
  3. 3.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group, Monterey (1984)MATHGoogle Scholar
  4. 4.
    Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)MATHGoogle Scholar
  5. 5.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)MATHGoogle Scholar
  6. 6.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco (1988)Google Scholar
  7. 7.
    Pawlak, Z.: Information systems theoretical foundations. Inf. Syst. 6, 205–218 (1981); Rough Set TheoryMATHCrossRefGoogle Scholar
  8. 8.
    Komorowski, J., Oehrn, A., Skowron, A.: ROSETTA Rough Sets. In: Klsgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 554–559. Oxford University Press, Oxford (2002)Google Scholar
  9. 9.
    Bazan, J.G., Szczuka, M.S.: RSES and rSESlib - A collection of tools for rough set computations. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 106–113. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  10. 10.
    Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991); Rough Set TheoryMATHGoogle Scholar
  11. 11.
    Ågotnes, T., Komorowski, J., Løken, T.: Taming Large Rule Models in Rough Set Approaches. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS, vol. 1704, pp. 193–203. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  12. 12.
    Makosa, E.: Rule Tuning, MSc Thesis, The Linnaeus Center for Bioinformatics, Uppsala University (2005)Google Scholar
  13. 13.
    Nguyen, H.S., Nguyen, S.H.: Pattern extraction from data. Fundamenta Informaticae 34, 129–144 (1998)MATHMathSciNetGoogle Scholar
  14. 14.
    Nguyen, H.S., Skowron, A., Synak, P.: Discovery of data patterns with applications to decomposition and classfification problems. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems, pp. 55–97. Physica-Verlag, Heidelberg (1998)Google Scholar
  15. 15.
    Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001), Also see the bibliography at: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm MATHCrossRefGoogle Scholar
  16. 16.
    Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156. Morgan Kauffman, San Francisco (1996), Also see the bibliography at: http://www.cs.princeton.edu/~schapire/boost.html Google Scholar
  17. 17.
    Duentsch, I., Gediga, G.: Uncertainty Measures of Rough Set Prediction. Artif. Intell. 106, 109–137 (1998)MATHCrossRefGoogle Scholar
  18. 18.
    Duentsch, I., Gediga, G.: Statistical evaluation of rough set dependency analysis. Int. J. Hum.-Comput. Stud. 46, 589–604 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Witold R. Rudnicki
    • 1
  • Marcin Kierczak
    • 2
  • Jacek Koronacki
    • 3
  • Jan Komorowski
    • 1
    • 2
  1. 1.ICMWarsaw UniversityWarsawPoland
  2. 2.The Linnaeus Centre for BioinformaticsUppsala UniversityUppsalaSweden
  3. 3.Institute of Computer SciencePolish Academy of SciencesWarsawPoland

Personalised recommendations