Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification
Gene expression levels are useful in discriminating between cancer and normal examples and/or between different types of cancer. In this chapter, ensembles of k-nearest neighbors are employed for gene expression based cancer classification. The ensembles are created by randomly sampling subsets of genes, assigning each subset to a k-nearest neighbor (k-NN) to perform classification, and finally, combining k-NN predictions with majority vote. Selection of subsets is governed by the statistical dependence between dataset complexity and classification error, confirmed by the copula method, so that least complex subsets are preferred since they are associated with more accurate predictions. Experiments carried out on six gene expression datasets show that our ensemble scheme is superior to a single best classifier in the ensemble and to the redundancy-based filter, especially designed to remove irrelevant genes.
KeywordsEnsemble of classifiers k-nearest neighbor gene expression cancer classification dataset complexity copula bolstered error
Unable to display preview. Download preview PDF.
- 2.Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR (2002) Nature 415:436–442CrossRefGoogle Scholar
- 4.Sima C, Attoor S, Braga-Neto U, Lowey J, Suh E, Dougherty ER (2005) Error estimation confounds feature selection in expression-based classification. In: Proc IEEE Int Workshop Genomic Sign Proc and Stat, Newport, Rhode IslandGoogle Scholar
- 7.Dudoit S, Fridlyand J (2003) Classification in microarray experiments. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall∖CRC Press, Boca RatonGoogle Scholar
- 8.Yu L (2008) Feature selection for genomic data analysis. In Liu H, Motoda H (eds) Computational methods of feature selection. Chapman & Hall∖CRC, Boca RatonGoogle Scholar
- 9.Sklar A (1959) Fonctions de répartition à n dimensions et leurs marges. Publications of the Institute of Statistics, University of ParisGoogle Scholar
- 10.Nelsen RB (2006) An inroduction to copulas. Springer Science+Business Media, New YorkGoogle Scholar
- 12.Zar JH (1999) Biostatistical analysis. Prentice Hall, Upper Saddle RiverGoogle Scholar
- 13.Gandrillon O (2004) Guide to the gene expression data. In: Proc ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp 116–120Google Scholar