Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets

  • Ivan G. Costa
  • Ana C. Lorena
  • Liciana R. M. P. y Peres
  • Marcilio C. P. de Souto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5676)


Supervised Machine Learning methods have been successfully applied for performing gene expression based cancer diagnosis. Characteristics intrinsic to cancer gene expression data sets, such as high dimensionality, low number of samples and presence of noise makes the classification task very difficult. Furthermore, limitations in the classifier performance may often be attributed to characteristics intrinsic to a particular data set.

This paper presents an analysis of gene expression data sets for cancer diagnosis using classification complexity measures. Such measures consider data geometry, distribution and linear separability as indications of complexity of the classification task. The results obtained indicate that the cancer data sets investigated are formed by mostly linearly separable non-overlapping classes, supporting the good predictive performance of robust linear classifiers, such as SVMs, on the given data sets. Furthermore, we found two complexity indices, which were good indicators for the difficulty of gene expression based cancer diagnosis.


Cancer gene expression classification Machine Learning data set complexity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alberts, B., Al, E.: Molecular Biology of the Cell. Garland Science (2002)Google Scholar
  2. 2.
    Bernadó-Mansilla, E., Maciá-Antonilez, N.: Modeling problem transformation based on data complexity. In: Angulo, C., Godo, L. (eds.) Artificial Intelligence Research and Development, pp. 133–139. IOS Press, Amsterdam (2007)Google Scholar
  3. 3.
    de Souto, M.C.P., Costa, I.G., de Araujo, D.S.A., Ludermir, T.B., Schliep, A.: Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497+ (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97(457), 77–87 (2002)CrossRefGoogle Scholar
  5. 5.
    Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. American Statistical Association 97(457), 77–87 (2002)CrossRefGoogle Scholar
  6. 6.
    Dupuy, A., Simon, R.: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer Institute 99(2), 147–157 (2007)CrossRefGoogle Scholar
  7. 7.
    Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 209–217 (1998)Google Scholar
  8. 8.
    Friedman, H., Rafsky, L.C.: Multivariate generalization of the wald-wolfowitz and smirnov two-sample tests. Ann. Statist. 7, 697–717 (1979)CrossRefGoogle Scholar
  9. 9.
    Giraud-Carrier, C., Vilalta, R., Brazdil, P.: Introduction to the special issue on meta-learning. Mach. Learn. 54(3), 187–193 (2004)CrossRefGoogle Scholar
  10. 10.
    Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)CrossRefPubMedGoogle Scholar
  11. 11.
    Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: Data mining, inference and prediction. Springer, New York (2001)CrossRefGoogle Scholar
  12. 12.
    Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 289–300 (2002)CrossRefGoogle Scholar
  13. 13.
    Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.N., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S.C., Hoffman, E., Jedlicka, A.E., Kawasaki, E., Martinez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Ye, S.Q., Yu, W.: Multiple-laboratory comparison of microarray platforms. Nat. Methods 2(5), 345–350 (2005)CrossRefPubMedGoogle Scholar
  14. 14.
    Kleinbaum, D.G., Klein, M.: Logistic Regression, 2nd edn. Springer, Heidelberg (2005)Google Scholar
  15. 15.
    Lorena, A.C., Costa, I.G., de Souto, M.C.P.: On the complexity of gene expression classification data sets. In: Proc. of the 8th International Conference on Hybrid Intelligent Systems, pp. 825–830. IEEE Computer Society Press, Los Alamitos (2008)Google Scholar
  16. 16.
    Lottaz, C., Kostka, D., Markowetz, F., Spang, R.: Computational diagnostics with gene expression profiles. Methods Mol. Biol. 453, 281–296 (2008)CrossRefPubMedGoogle Scholar
  17. 17.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICMC 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)Google Scholar
  18. 18.
    Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)Google Scholar
  19. 19.
    Monti, S., et al.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn 52, 91–118 (2003)CrossRefGoogle Scholar
  20. 20.
    Okun, O., Priisalu, H.: Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artificial Intelligence in Medicine 45(2-3), 151–162 (2009)CrossRefPubMedGoogle Scholar
  21. 21.
    Quackenbush, J.: Computational analysis of cDNA microarray data. Nature Reviews 6(2), 418–428 (2001)CrossRefGoogle Scholar
  22. 22.
    Ramaswamy, S., et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA 98, 15149–15154 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Rosemblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, New York (1962)Google Scholar
  24. 24.
    Slonim, D.: From patterns to pathways: gene expression data analysis comes of age. Nature Genetics 32, 502–508 (2002)CrossRefPubMedGoogle Scholar
  25. 25.
    Smith, F.: Pattern classifier design by linear programming. IEEE Transactions on Computers 17(4), 367–372 (1968)CrossRefGoogle Scholar
  26. 26.
    Sokal, R., Rohlf, F.: Biometry. W. H. Freeman and Company, New York (1995)Google Scholar
  27. 27.
    Spang, R.: Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. BIOSILICO 1(2), 64–68 (2003)CrossRefGoogle Scholar
  28. 28.
    Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5), 631–643 (2005)CrossRefPubMedGoogle Scholar
  29. 29.
    van’t Veer, L.J., Bernards, R.: Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature 452(7187), 564–570 (2008)CrossRefGoogle Scholar
  30. 30.
    Vapnik, V.N.: The nature of Statistical learning theory. Springer, New York (1995)CrossRefGoogle Scholar
  31. 31.
    Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar
  32. 32.
    Yeang, C.H., et al.: Molecular classification of multiple tumor types. In: Proc. 9th Int. Conf. on Intelligent Systems in Molecular Biology, vol. 1, pp. 316–322 (2001)Google Scholar
  33. 33.
    Zucknick, M., Richardson, S., Stronach, E.: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Statist. Appl. in Genetics and Molec. Biol. 7(1), 1–31 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Ivan G. Costa
    • 1
  • Ana C. Lorena
    • 2
  • Liciana R. M. P. y Peres
    • 2
  • Marcilio C. P. de Souto
    • 3
  1. 1.Center of InformaticsFederal University of PernambucoRecifeBrazil
  2. 2.Center of Mathematics, Computation and CognitionABC Fed. Univ.Brazil
  3. 3.Dept. of Informatics and Applied MathematicsFed. Univ. of Rio Grande do NorteBrazil

Personalised recommendations