Feature Selection and Classification for Small Gene Sets

  • Gregor Stiglic
  • Juan J. Rodriguez
  • Peter Kokol
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5265)


Random Forests, Support Vector Machines and k-Nearest Neighbors are successful and proven classification techniques that are widely used for different kinds of classification problems. One of them is classification of genomic and proteomic data that is known as a problem with extremely high dimensionality and therefore demands suited classification techniques. In this domain they are usually combined with gene selection techniques to provide optimal classification accuracy rates. Another reason for reducing the dimensionality of such datasets is their interpretability. It is much easier to interpret a small set of ranked genes than 20 or 30 thousands of unordered genes. In this paper we present a classification ensemble of decision trees called Rotation Forest and evaluate its classification performance on small subsets of ranked genes for 14 genomic and proteomic classification problems. An important feature of Rotation Forest is demonstrated – i.e. robustness and high classification accuracy using small sets of genes.


Gene expression analysis machine learning feature selection ensemble of classifiers 


  1. 1.
    Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)CrossRefGoogle Scholar
  2. 2.
    Vapnik, V.: Statistical learning theory. John Wiley and Sons, New York (1998)Google Scholar
  3. 3.
    Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international Conference on Machine Learning (ICML 2006), vol. 148, pp. 161–168 (2006)Google Scholar
  4. 4.
    Wang, L., Chu, F., Xie, W.: Accurate Cancer Classification Using Expressions of Very Few Genes. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4(1), 40–53 (2007)CrossRefPubMedGoogle Scholar
  5. 5.
    Rodríguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(10), 1619–1630 (2006)CrossRefPubMedGoogle Scholar
  6. 6.
    John, G.H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 121–129 (1994)Google Scholar
  7. 7.
    Symons, S., Nieselt, K.: Data Mining Microarray Data – Comprehensive Benchmarking of Feature Selection and Classification Methods. Pre-print,
  8. 8.
    Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2005)Google Scholar
  9. 9.
    Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of International Conference on Machine Learning (ICML1992), pp. 249–256 (1992)Google Scholar
  10. 10.
    Kononenko, I.: Estimating attributes: analysis and extension of relief. In: Proceedings of European Conference on Machine Learning (ICML1994), pp. 171–182 (1994)Google Scholar
  11. 11.
    Robnik-Sikonja, M., Kononenko, I.: An adaptation of Relief for attribute estimation in regression. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML 1997), pp. 296–304 (1997)Google Scholar
  12. 12.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2002)CrossRefGoogle Scholar
  13. 13.
    Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Dietterich, T.G.: Ensemble Learning. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks, pp. 405–408. The MIT Press, Cambridge (2002)Google Scholar
  15. 15.
    Platt, J.: Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)Google Scholar
  16. 16.
    Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)CrossRefGoogle Scholar
  17. 17.
    Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)Google Scholar
  18. 18.
    Kent Ridge Biomedical Data Set Repository: Google Scholar
  19. 19.
    Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 2002 99, 6562–6566 (2002)CrossRefGoogle Scholar
  20. 20.
    Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11(1), 86–92 (1940)CrossRefGoogle Scholar
  21. 21.
    Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1, 80–83 (1945)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Gregor Stiglic
    • 1
    • 2
  • Juan J. Rodriguez
    • 3
  • Peter Kokol
    • 1
    • 2
  1. 1.Faculty of Health SciencesUniversity of MariborMariborSlovenia
  2. 2.Faculty of Electrical Engineering and Computer ScienceUniversity of MariborMariborSlovenia
  3. 3.University of BurgosBurgosSpain

Personalised recommendations