Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models
In the paper we study the properties of cancer gene expression data sets from the perspective of classification and tumor diagnosis. Our findings and case studies are based on several recently published data sets. We find that these data sets typically include a subset of about 100 highly discriminating features of which predictive power can be further enhanced by exploring their interactions. This finding speaks against often used univariate feature selection methods, and may explain the superior performance of support vector machines recently reported in the related work. We argue that a much simpler technique that directly finds visualizations with clear separation of diagnostic classes may be used instead. Furthermore, it may perform better in inference of an understandable classifier that includes only a few relevant features.
Unable to display preview. Download preview PDF.
- 3.Nutt, C.L., Mani, D.R., Betensky, R.A., et al.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63, 1602–1607 (2003)Google Scholar
- 4.Khan, J., Wei, J.S., Ringnér, M., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, vol. 7(6), pp. 673–679 (2001)Google Scholar
- 5.Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 33–46 (2004)Google Scholar
- 6.Su, A.I., Welsh, J.B., Sapinoso, L.M., et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61, 7388–7393 (2001)Google Scholar
- 10.Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the Ninth International Conference on Machine Learning, pp. 249–256 (1992)Google Scholar
- 11.Kononenko, I., Simec, E.: Induction of decision trees using relieff. Mathematical and statistical methods in artificial intelligence. Springer, Heidelberg (1995)Google Scholar
- 12.Brunsdon, C., Fotheringham, A.S., Charlton, M.: An investigation of methods for visualising highly multivariate datasets. Case Studies of Visualization in the Social Sciences, pp. 55–80 (1998)Google Scholar