Statistical Analysis on Microarray Data: Selection of Gene Prognosis Signatures

Part of the Applied Bioinformatics and Biostatistics in Cancer Research book series (ABB)


Microarrays are being increasingly used in cancer research for a better understanding of the molecular variations among tumours or other biological conditions. They allow for the measurement of tens of thousands of transcripts simultaneously in one single experiment. The problem of analysing these data sets becomes non-standard and represents a challenge for both statisticians and biologists, as the dimension of the feature space (the number of genes or transcripts) is much greater than the number of tissues. Therefore, the selection of marker genes among thousands to diagnose a cancer type is of crucial importance and can help clinicians to develop gene-expression-based diagnostic tests to guide therapy in cancer patients. In this chapter, we focus on the classification and the prediction of a sample given some carefully chosen gene expression profiles. We review some state-of-the-art machine learning approaches to perform gene selection: recursive feature elimination, nearest-shrunken centroids and random forests. We discuss the difficulties that can be encountered when dealing with microarray data, such as selection bias, multiclass and unbalanced problems. The three approaches are then applied and compared on a typical cancer gene expression study.


Feature Selection Random Forest Acute Myeloid Leukaemia Acute Lymphoblastic Leukaemia Feature Selection Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Aha DW and Bankert RL (1995) A comparative evaluation of sequential feature selection algorithms. In: Learning from data: artificial intelligence and statistics V. Springer, New York, pp 199–206Google Scholar
  2. Al-Shahrour F, Diaz-Uriarte R, Dopazo J (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20:578–580PubMedCrossRefGoogle Scholar
  3. Ambroise C, McLachlan G (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99:6562–6566PubMedCrossRefGoogle Scholar
  4. Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033Google Scholar
  5. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140Google Scholar
  6. Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  7. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees, The Wadsworth statistics/probability series, Belmont, CAGoogle Scholar
  8. Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P (2005) Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 28:171–182PubMedCrossRefGoogle Scholar
  9. Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167CrossRefGoogle Scholar
  10. Buyse M, Loi S, van‘t Veer L, Viale G, Delorenzi M, Glas A, Saghatchian d’Assignies M, Bergh J, Lidereau R, Ellis P (2006) Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98:1183–1192Google Scholar
  11. Chen C, Liaw A, Breiman L (2004) Using random forests to learn unbalanced data, Department of Statistics, University of BerkeleyGoogle Scholar
  12. Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines: and other kernel-based learning methods, Cambridge University Press, New YorkGoogle Scholar
  13. Dabney A, Storey J (2005) Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays. UW Biostatistics Working Paper Series, Article 267Google Scholar
  14. Dennis G Jr, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4:Article R60Google Scholar
  15. Diaz-Uriarte R (2007) GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 7:Article 328Google Scholar
  16. Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics 7:Article S12Google Scholar
  17. Dudoit S, Fridlyand J (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87CrossRefGoogle Scholar
  18. Efron B (1979) Bootstrapping methods: another look at the jackknife. Ann Stat 7:1–26CrossRefGoogle Scholar
  19. Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78:316–331CrossRefGoogle Scholar
  20. Efron B, Tibshirani R (1997) Improvements on cross-validation: the. 632 + bootstrap method. J Am Stat Assoc 92:548–560CrossRefGoogle Scholar
  21. Eitrich T, Lang B (2006) Efficient optimization of support vector machine learning parameters for unbalanced datasets. J Comput Appl Math 196: 425–436CrossRefGoogle Scholar
  22. Gevaert O, Smet F, Timmerman D, Moreau Y, Moor B (2006) Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 22:184–190CrossRefGoogle Scholar
  23. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537PubMedCrossRefGoogle Scholar
  24. Guo Y, Hastie T, Tibshirani R (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8:86–100PubMedCrossRefGoogle Scholar
  25. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182CrossRefGoogle Scholar
  26. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Support vector machine with recursive feature selection. Mach Learn 46:389–422CrossRefGoogle Scholar
  27. Izmirlian G (2004) Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann New York Acad Sci 1020: 154–174CrossRefGoogle Scholar
  28. John G, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, Morgan KaufmannGoogle Scholar
  29. Kim H, Pang S, Je H, Kim D, Yang Bang S (2003) Constructing support vector machine ensemble. Pattern Recogn 36:2757–2767CrossRefGoogle Scholar
  30. Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273–324CrossRefGoogle Scholar
  31. Lê Cao K-A, Bonnet A, Gadat S (2009) Multiclass classification and gene selection with a stochastic algorithm. Comput Stat Data Anal 53:3601–3615CrossRefGoogle Scholar
  32. Lê Cao K-A, Goncalves O, Besse P, Gadat S (2007) Selection of biologically relevant genes with a wrapper stochastic algorithm. Stat Appl Genetics Mol Biol 6:Article 29Google Scholar
  33. Lee Y, Lee C (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19:1132–1139PubMedCrossRefGoogle Scholar
  34. Li C, Tseng G, Wong W (2003) Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall, New York, pp 1–34Google Scholar
  35. Liaw A, Wiener M (2003) Classification and regression by randomForest. R News 2/3:18–22Google Scholar
  36. McLachlan G (1977) A note on the choice of a weighting function to give an efficient method for estimating the probability of misclassification. Pattern Recogn 9:147–149CrossRefGoogle Scholar
  37. McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New YorkCrossRefGoogle Scholar
  38. McLachlan G, Chevelu J, Zhu J (2008) Correcting for selection bias via cross-validation in the classification of microarray data. In: Balakrishnan N, Pena E, Silvapulle MJ (eds) Beyond parametrics in Interdisciplinary research: Festschrift in Honor of Professor Paranab K. Sen. Hayward, Vol 1. IMS Collections, California, pp 364–376Google Scholar
  39. McLachlan G, Do K, Ambroise C (2004) Analyzing microarray gene expression data. Wiley-Interscience, New YorkCrossRefGoogle Scholar
  40. McLachlan G, Ng S-K (2008) Expert networks with mixed continuous and categorical feature variables: a location modeling approach. In: Peters H, Vogel M (eds) Machine learning research progress. Hauppauge, New York, pp 1–14Google Scholar
  41. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell M (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:284–288CrossRefGoogle Scholar
  42. Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 365:488–492CrossRefGoogle Scholar
  43. Mundra P, Rajapakse J (2007) SVM-RFE with relevancy and redundancy criteria for gene selection. Lect Notes Comp Sci 4774:242–252CrossRefGoogle Scholar
  44. Nuyten D, van de Vijver M (2008) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer and normal tissue toxicity. In: Seminars in radiation oncology, pp 105–114Google Scholar
  45. Prasad A, Iverson L, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199CrossRefGoogle Scholar
  46. Qiao X, Liu Y (2008) Adaptive weighted learning for unbalanced multicategory classification. Biometrics (in press)Google Scholar
  47. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98:15149–15154PubMedCrossRefGoogle Scholar
  48. Reunanen J (2003) Overfitting in making comparisons between variable selection methods. J Mach Learn Res 3:1371–1382CrossRefGoogle Scholar
  49. Statnikov A, Aliferis C, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631–643PubMedCrossRefGoogle Scholar
  50. Svetnik V, Liaw A, Tong C, Culberson J, Sheridan R, Feuston B (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comp Sci 43:1947–1958Google Scholar
  51. Tang Y, Zhang Y, Huang Z (2007) Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE ACM Trans Comput Biol Bioinformatics 4:365–389CrossRefGoogle Scholar
  52. Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 99:6567–6572PubMedCrossRefGoogle Scholar
  53. van‘t Veer L, Dai H, Van de Vijver M, He Y, Hart A, Mao M, Peterse H, Van der Kooy K, Marton M, Witteveen A (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536Google Scholar
  54. Vapnik V (2000) The nature of statistical learning theory, Springer, New YorkGoogle Scholar
  55. Wang S, Zhu J (2007) Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 23:972–979PubMedCrossRefGoogle Scholar
  56. Weston J, Watkins C (1999) Multi-class support vector machines. In: Proceedings ESANN, Brussels, BelgiumGoogle Scholar
  57. Wood I, Visscher P, Mengersen K (2007) Classification based upon gene expression data: bias and precision of error rates. Bioinformatics 23:1363–1370PubMedCrossRefGoogle Scholar
  58. Yeang C, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin R, Angelo M, Reich M, Lander E, Mesirov J, Golub T (2001) Molecular classification of multiple tumor types. Bioinformatics 17:316–322Google Scholar
  59. Yousef M, Jung S, Showe L, Showe M (2007) Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8:144PubMedCrossRefGoogle Scholar
  60. Zhou X, Tuck D (2007) MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23:1106–1114PubMedCrossRefGoogle Scholar
  61. Zhu J, McLachlan G, Ben-Tovim Jones L, Wood I (2008) On selection biases with prediction rules formed from gene expression data. J Stat Plann Infer 138:374–386CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Mathematics and Institute for Molecular BioscienceUniversity of QueenslandQueenslandAustralia

Personalised recommendations