Knowledge and Information Systems

, Volume 51, Issue 3, pp 1067–1090 | Cite as

Can classification performance be predicted by complexity measures? A study using microarray data

  • L. Morán-Fernández
  • V. Bolón-Canedo
  • A. Alonso-Betanzos
Regular Paper


Data complexity analysis enables an understanding of whether classification performance could be affected, not by algorithm limitations, but by intrinsic data characteristics. Microarray datasets based on high numbers of gene expressions combined with small sample sizes represent a particular challenge for machine learning researchers. This type of data also has other particularities that may negatively affect the generalization capacity of classifiers, such as overlaps between classes and class imbalance. Making use of several complexity measures, we analyzed the intrinsic complexity of several microarray datasets with and without feature selection and then explored the connection with the empirical results obtained by four widely used classifiers. Experimental results for 21 binary and multiclass datasets demonstrate that a correlation exists between microarray data complexity and the classification error rates.


Data complexity measures Classification Microarray data Feature selection Filters 



This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research project TIN2015-65069-C2-1-R), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project GRC2014/035). V. Bolón-Canedo acknowledges Xunta de Galicia postdoctoral funding (grant ED481B 2014/164-0).


  1. 1.
    Ho TK, Basu M (2006) Data complexity in pattern recognition. Springer, BerlinzbMATHGoogle Scholar
  2. 2.
    Piatetsky G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5CrossRefGoogle Scholar
  3. 3.
    Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282(19):111–135CrossRefGoogle Scholar
  4. 4.
    Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517CrossRefGoogle Scholar
  5. 5.
    Bernadó-Mansilla E, Ho TK (2005) Domain of competence of XCS classifier system in complexity measures space. IEEE Trans Evol Comput 9(1):82–104CrossRefGoogle Scholar
  6. 6.
    Sánchez JS, Mollineda RA, Sotoca JM (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10(3):189–201MathSciNetCrossRefGoogle Scholar
  7. 7.
    Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: First edition of the Iberian conference on pattern recognition and image analysis (ibPRIA, 2005) Lecture Notes in Computer Science. Springer, Berlin, pp 3523Google Scholar
  8. 8.
    Macià N, Bernadó-Mansilla E, Orriols-Puig A, Ho TK (2013) Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recogn 46(3):1054–1066CrossRefGoogle Scholar
  9. 9.
    Lorena AC, Costa IG, Spolaôr N, de Souto MC (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1):33–42CrossRefGoogle Scholar
  10. 10.
    Okun O, Priisalu H (2009) Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif Intel Med 45(2):151–162CrossRefGoogle Scholar
  11. 11.
    Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. J Sci 286(5439):531–537CrossRefGoogle Scholar
  12. 12.
    Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the computacional systems bioinformatics conference, pp 523–529Google Scholar
  13. 13.
    Wanga Y, Tetkoa IV, Hallb MA, Frankb E, Faciusa A, Mayera KFX, Mewesa HW (2005) Gene selection from microarray data for cancer classification-a machine learning approach. Comput Biol Chem 29:37–46CrossRefGoogle Scholar
  14. 14.
    Xing E, Jordan M, Karp R (2001) Feature selection for high-dimensional genomic microarray data. In: Proceedings of the 18th internacional conference on machine learning, pp 601–608Google Scholar
  15. 15.
    Data Complexity Library in C++. [Online]. Available:
  16. 16.
    Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P (2004) Molecular biology of the cell. Garland ScienceGoogle Scholar
  17. 17.
    Novianti PW, Jong VL, Roes KCB, Eijkemans MJC (2015) Factors affecting the accuracy of a class prediction model in gene expression data. BMC Bioinformatics 16(1):199Google Scholar
  18. 18.
    Braga-Neto U (2007) Fads and fallacies in the name of small-sample microarray classification—a highlight of misunderstanding and erroneous usage in the applications of genomic signal processing. Sig Process Mag IEEE 24(1):91–99CrossRefGoogle Scholar
  19. 19.
    Moreno-Torres JG, Raeder T, Alaiz-Rodríguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530CrossRefGoogle Scholar
  20. 20.
    Broad Institute. Cancer program data sets. [Online]. Available
  21. 21.
    Technology Agency for Sciency and Research. Kent ridge bio-medical dataset repository. [Online]. Available:
  22. 22.
    Arizona State University. Feature selection datasets. [Online]. Available:
  23. 23.
    Statnikov A, Aliferis C, Tsardinos I. Gems: gene expression model selector. [Online]. Available:
  24. 24.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRefGoogle Scholar
  25. 25.
    Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artifical intelligence 3(22):41–46Google Scholar
  26. 26.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  27. 27.
    Ross Quinlan J (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, New YorkGoogle Scholar
  28. 28.
    Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66Google Scholar
  29. 29.
    Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Recent advances and emerging challenges of feature selection in the context of big data. Knowl Based SystGoogle Scholar
  30. 30.
    Hall M (1999) Correlation-based feature selection for machine learning. PhD thesisGoogle Scholar
  31. 31.
    Dash M, Liu H (2003) Consistency-based search in feature selection. J Artif Intel 151(1–2):155–176MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Supplementary material. [Online]. Available:
  33. 33.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305zbMATHGoogle Scholar
  34. 34.
    Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2011) Feature selection and classification in multiple class datasets: an application to KDD Cup 99 dataset. Expert Syst Appl 38(5):5947–5957CrossRefGoogle Scholar
  35. 35.
    Boulesteix A-L, Hable R, Lauer S, Eugster MJA (2015) A statistical framework for hypothesis testing in real data comparison studies. Am Stat 69(3):201–212MathSciNetCrossRefGoogle Scholar
  36. 36.
    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30MathSciNetzbMATHGoogle Scholar
  37. 37.
    Navarro FFG (2011) Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains. PhD thesis, Universitat Politècnica de CatalunyaGoogle Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • L. Morán-Fernández
    • 1
  • V. Bolón-Canedo
    • 1
  • A. Alonso-Betanzos
    • 1
  1. 1.Laboratory for Research and Development in Artificial Intelligence (LIDIA), Computer Science DepartmentUniversity of A CoruñaA CoruñaSpain

Personalised recommendations