Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models

  • Minca Mramor
  • Gregor Leban
  • Janez Demšar
  • Blaž Zupan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3581)

Abstract

In the paper we study the properties of cancer gene expression data sets from the perspective of classification and tumor diagnosis. Our findings and case studies are based on several recently published data sets. We find that these data sets typically include a subset of about 100 highly discriminating features of which predictive power can be further enhanced by exploring their interactions. This finding speaks against often used univariate feature selection methods, and may explain the superior performance of support vector machines recently reported in the related work. We argue that a much simpler technique that directly finds visualizations with clear separation of diagnostic classes may be used instead. Furthermore, it may perform better in inference of an understandable classifier that includes only a few relevant features.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Golub, T.R., Slonim, D.K., Tamayo, P., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  2. 2.
    Shipp, M.A., Ross, K.N., Tamayo, P., et al.: Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine 8, 68–74 (2002)CrossRefGoogle Scholar
  3. 3.
    Nutt, C.L., Mani, D.R., Betensky, R.A., et al.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63, 1602–1607 (2003)Google Scholar
  4. 4.
    Khan, J., Wei, J.S., Ringnér, M., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, vol. 7(6), pp. 673–679 (2001)Google Scholar
  5. 5.
    Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 33–46 (2004)Google Scholar
  6. 6.
    Su, A.I., Welsh, J.B., Sapinoso, L.M., et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61, 7388–7393 (2001)Google Scholar
  7. 7.
    Fu, L.M., Fu-Liu, C.S.: Multi-class cancer subtype classification based on gene expression signatures with reliability analysis. FEBS Letters 561, 186–190 (2004)CrossRefGoogle Scholar
  8. 8.
    Gamberger, D., Lavrac, N., Zelezny, F., Tolar, J.: Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. Journal of Biomedical Informatics 37, 269–284 (2004)CrossRefGoogle Scholar
  9. 9.
    Wang, Y., Tetko, I.V., Hall, M.A., et al.: Gene selection from microarray data for cancer classification–a machine learning approach. Computational Biology and Chemistry 29, 37–46 (2005)MATHCrossRefGoogle Scholar
  10. 10.
    Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the Ninth International Conference on Machine Learning, pp. 249–256 (1992)Google Scholar
  11. 11.
    Kononenko, I., Simec, E.: Induction of decision trees using relieff. Mathematical and statistical methods in artificial intelligence. Springer, Heidelberg (1995)Google Scholar
  12. 12.
    Brunsdon, C., Fotheringham, A.S., Charlton, M.: An investigation of methods for visualising highly multivariate datasets. Case Studies of Visualization in the Social Sciences, pp. 55–80 (1998)Google Scholar
  13. 13.
    Leban, G., Bratko, I., Petrovic, U., Curk, T., Zupan, B.: Vizrank: finding informative data projections in functional genomics by machine learning. Bioinformatics 21, 413–414 (2005)CrossRefGoogle Scholar
  14. 14.
    Singh, D., Febbo, P.G., Ross, K., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209 (2002)CrossRefGoogle Scholar
  15. 15.
    Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., et al.: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 30, 41–47 (2001)CrossRefGoogle Scholar
  16. 16.
    Sikonja, M.R., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53, 23–69 (2003)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Minca Mramor
    • 1
  • Gregor Leban
    • 1
  • Janez Demšar
    • 1
  • Blaž Zupan
    • 1
    • 2
  1. 1.Faculty of Computer and Information ScienceUniversity of LjubljanaLjubljanaSlovenia
  2. 2.Department of Molecular and Human GeneticsBaylor College of MedicineHoustonU.S.A.

Personalised recommendations