Gene Selection Strategies in Microarray Expression Data: Applications to Case-Control Studies

  • Gustavo A. StolovitzkyEmail author
Part of the Topics in Biomedical Engineering International Book Series book series (ITBE)


Over the last decade we have witnessed the rise of the gene expression array assay as a new experimental paradigm to study the cellular state at the whole genome scale. This technology has allowed considerable progress in the identification of markers associated with human disease mechanisms, and in the molecular characterization of diseases such as cancer, by careful characterization of genes involved directly or indirectly in the disease. A typical gene expression experiment provides scientists with an enormous amount of data. Analysis of these data, and interpretation of the ensuing results, have attracted the attention of many researchers, who have developed new ways of interrogating the expression data. In this chapter we will review some of these recent efforts, emphasizing the need to make use of batteries of methods rather than one method in particular, as well as the need to properly validate results with independent data sets. The application of DNA array technology for use in disease diagnostics will be exemplified in the case of chronic lymphocytic leukemia.


Support Vector Machine Chronic Lymphocytic Leukemia Singular Value Decomposition Follicular Lymphoma Gene Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

8. References

  1. 1.
    Stolovitzky G. 2003. Gene selection in microarray data: the elephant, the blind men and our algorithms. Curr Opin Struct Biol 13:370–376.PubMedCrossRefGoogle Scholar
  2. 2.
    ArrayExpress database on World Wide Web: Scholar
  3. 3.
    Stanford Microarray database on World Wide Web: Scholar
  4. 4.
    GenomeWeb Gene Expression and Microarrays on World Wide Web: Scholar
  5. 5.
    YF Leung’s Microarray Links on World Wide Web: Scholar
  6. 6.
    Bibliography on Microarray Data Analysis on World Wide Web: Scholar
  7. 7.
    Slonim DK. 2002. From patterns to pathways: gene expression data analysis comes of age. Nature Genet 32:502–508.PubMedCrossRefGoogle Scholar
  8. 8.
    Chaussabel D, Sher A. 2002. Mining microarray expression data by literature profiling. Genome Biol 3:RESEARCH0055.Google Scholar
  9. 9.
    Khatri P, Draghici S, Ostermeier GC, Krawetz SA. 2002. Profiling gene expression using ontoexpress. Genomics 79:266–270.PubMedCrossRefGoogle Scholar
  10. 10.
    Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC. 2003. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet 34:267–273.PubMedCrossRefGoogle Scholar
  11. 11.
    Clarke PA, te Poele R, Wooster R, Workman P. 2001. Gene expression microarray analysis in cancer biology, pharmacology, and drug development: progress and potential. Biochem Pharmacol 62:1311–1336.PubMedCrossRefGoogle Scholar
  12. 12.
    Carr KM, Bittner M, Trent JM. 2003. Gene-expression profiling in human cutaneous melanoma. Oncogene 22:3076–3080.PubMedCrossRefGoogle Scholar
  13. 13.
    Salter AH, Nilsson KC. 2003. Informatics and multivariate analysis of toxicogenomics data. Curr Opin Drug Discov Devel 6:117–122.PubMedGoogle Scholar
  14. 14.
    Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR. 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415:436–442.PubMedCrossRefGoogle Scholar
  15. 15.
    Service RF. 2003. Genetics and medicine: recruiting genes, proteins for a revolution in diagnostics. Science 300:236–239.PubMedCrossRefGoogle Scholar
  16. 16.
    Ardekani AM, Petricoin III EF, Hackette JL. 2003. Molecular diagnostics: an FDA perspective. Expert Rev Mol Diagn 3:129–140.PubMedCrossRefGoogle Scholar
  17. 17.
    Quackenbush J. 2002. Microarray data normalization and transformation. Nature Genet 32:496–501.PubMedCrossRefGoogle Scholar
  18. 18.
    Pan W. 2002. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18:546–554.PubMedCrossRefGoogle Scholar
  19. 19.
    Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB. 2002. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18:1454–1461.PubMedCrossRefGoogle Scholar
  20. 20.
    Pan W, Lin J, Le CT. 2002. How many replicates of arrays are required to detect gene expression changes in microarray experiments? a mixture model approach. Genome Biol 3:research0022.Google Scholar
  21. 21.
    Li J, Liu H, Downing JR, Yeoh AE, Wong L. 2003. Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics 19:71–78.PubMedCrossRefGoogle Scholar
  22. 22.
    Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537.PubMedCrossRefGoogle Scholar
  23. 23.
    Liu H, Li J, Wong L. 2002. A comparative study on feature selection and classification methods using gene expression profiles and proteomics patterns. Genome Informatics 13:51–60.PubMedGoogle Scholar
  24. 24.
    Li W, Yang Y. 2002. Zipf’s law in importance of genes for cancer classification using microarray data. J Theor Biol 219:539–551.PubMedCrossRefGoogle Scholar
  25. 25.
    Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK. 2003. Gene selection: a Bayesian variable selection approach. Bioinformatics 19:90–97.PubMedCrossRefGoogle Scholar
  26. 26.
    Tu Y, Stolovitzky G, Klein U. 2002. Quantitative noise analysis for gene expression microarray experiments. Proc Natl Acad Sci USA 99:14031–1436.PubMedCrossRefGoogle Scholar
  27. 27.
    Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV. 2000. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA 97:8409–8414.PubMedCrossRefGoogle Scholar
  28. 28.
    Alter O, Brown PO, Botstein D. 2000. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106.PubMedCrossRefGoogle Scholar
  29. 29.
    Alter O, Brown PO, Botstein D. 2003. Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356.PubMedCrossRefGoogle Scholar
  30. 30.
    Nielsen TO, West RB, Linn SC, Alter O, Knowling MA, O’Connell JX, Zhu S, Fero M, Sherlock G, Pollack JR, Brown PO, Botstein D, van de Rijn M. 2002. Molecular characterisation of soft tissue tumours: a gene expression study. Lancet 359:1301–1307.PubMedCrossRefGoogle Scholar
  31. 31.
    Misra J, Schmitt W, Hwang D, Hsiao LL, Gullans S, Stephanopoulos G. 2002. Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res 12:1112–1120.PubMedCrossRefGoogle Scholar
  32. 32.
    Kluger Y, Basri R, Chang JT, Gerstein M. 2003. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13:703–716.PubMedCrossRefGoogle Scholar
  33. 33.
    Liebermeister W. 2002. Linear modes of gene expression determined by independent component analysis. Bioinformatics 18:51–60.PubMedCrossRefGoogle Scholar
  34. 34.
    Antoniadis A, Lambert-Lacroix S, Leblanc F. 2003. Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19:563–570.PubMedCrossRefGoogle Scholar
  35. 35.
    Bicciato S, Luchini A, Di Bello C. 2003. PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics 19:571–578.PubMedCrossRefGoogle Scholar
  36. 36.
    Lazzeroni L, Owen A. 2002. Plaid models for gene expression data. Statistica Sinica 12:61–86.Google Scholar
  37. 37.
    Cheng Y, Church GM. 2000. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 8:93–103.PubMedGoogle Scholar
  38. 38.
    Getz G, Levine E, Domany E. 2000. Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA 97:12079–12084.PubMedCrossRefGoogle Scholar
  39. 39.
    Califano A, Stolovitzky G, Tu Y. 2000. Analysis of gene expression microarrays for phenotype classification. Proc Int Conf Intell Syst Mol Biol 8:75–85.PubMedGoogle Scholar
  40. 40.
    Dettling M, Buhlmann P. 2002. Supervised clustering of genes. Genome Biol 3:RESEARCH0069.Google Scholar
  41. 41.
    Deutsch JM. 2003. Evolutionary algorithms for finding optimal gene sets in microarray prediction. Bioinformatics 19:45–52.PubMedCrossRefGoogle Scholar
  42. 42.
    Kim S, Dougherty ER, Barrera J, Chen Y, Bittner ML, Trent JM. 2002. Strong feature sets from small samples. J Comput Biol 9:127–146.PubMedCrossRefGoogle Scholar
  43. 43.
    Klein U, Tu Y, Stolovitzky GA, Keller JL, Haddad Jr J, Miljkovic V, Cattoretti G, Califano A, Dalla-Favera R. 2003. Transcriptional analysis of the B cell germinal center reaction. Proc Natl Acad Sci USA 100:2639–2644.PubMedCrossRefGoogle Scholar
  44. 44.
    Kuppers R, Klein U, Schwering I, Distler V, Brauninger A, Cattoretti G, Tu Y, Stolovitzky GA, Califano A, Hansmann ML, Dalla-Favera R. 2003. Identification of Hodgkin and Reed-Sternberg cell-specific genes by gene expression profiling. J Clin Invest 111:529–537.PubMedCrossRefGoogle Scholar
  45. 45.
    Jelinek DF, Tschumper RC, Stolovitzky GA, Iturria SJ, Tu Y, Lepre J, Shah N, Kay NE. 2003. Identification of a global gene expression signature of B-chronic lymphocytic leukemia. Mol Cancer Res 1:346–361.PubMedGoogle Scholar
  46. 46.
    Lepre J, Rice JJ, Tu Y, Stolovitzky G. 2004. Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data. Bioinformatics 7:1033–1044.CrossRefGoogle Scholar
  47. 47.
    Martinez-Climent JA, Alizadeh AA, Segraves R, Blesa D, Rubio-Moscardo F, Albertson DG, Garcia-Conde J, Dyer MJ, Levy R, Pinkel D, Lossos IS. 2003. Transformation of follicular lymphoma to diffuse large cell lymphoma is associated with a heterogeneous set of DNA copy number and gene expression alterations. Blood 101:3109–3117.PubMedCrossRefGoogle Scholar
  48. 48.
    Lossos IS, Alizadeh AA, Diehn M, Warnke R, Thorstenson Y, Oefner PJ, Brown PO, Botstein D, Levy R. 2002. Transformation of follicular lymphoma to diffuse large-cell lymphoma: alternative patterns with increased or decreased expression of c-myc and its regulated genes. Proc Natl Acad Sci USA 99:8886–8891.PubMedCrossRefGoogle Scholar
  49. 49.
    Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR. 2002. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Med 8:68–74.PubMedCrossRefGoogle Scholar
  50. 50.
    Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100:9440–9405.PubMedCrossRefGoogle Scholar
  51. 51.
    Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–14868.PubMedCrossRefGoogle Scholar
  52. 52.
    Klein U, Tu Y, Stolovitzky GA, Mattioli M, Cattoretti G, Husson H, Freedman A, Inghirami G, Cro L, Baldini L, Neri A, Califano A, Dalla-Favera R. 2001. Gene expression profiling of B cell chronic lymphocytic leukemia reveals a homogeneous phenotype related to memory B cells. J Exp Med 194:1625–1638.PubMedCrossRefGoogle Scholar
  53. 53.
    Rosenwald A, Alizadeh AA, Widhopf G, Simon R, Davis RE, Yu X, Yang L, Pickeral OK, Rassenti LZ, Powell J, Botstein D, Byrd JC, Grever MR, Cheson BD, Chiorazzi N, Wilson WH, Kipps TJ, Brown PO, Staudt LM. 2001. Relation of gene expression phenotype to immunoglobulin mutation genotype in B cell chronic lymphocytic leukemia. J Exp Med 194:1639–1647.PubMedCrossRefGoogle Scholar
  54. 54.
    Baldi P, Brunak S. 2001. Bioinformatics, the machine learning approach. MIT Press, Cambridge.Google Scholar
  55. 55.
    Hastie T, Tibshirani R, Friedman JH. 2001. The elements of statistical learning. Springer, New York.Google Scholar
  56. 56.
    Mateos A, Dopazo J, Jansen R, Tu Y, Gerstein M, Stolovitzky G. 2002. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res 12:1703–1715.PubMedCrossRefGoogle Scholar
  57. 57.
    Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16:906–914.PubMedCrossRefGoogle Scholar
  58. 58.
    Vapnik V. 1998. Statistical learning theory. Wiley-Interscience, New York.Google Scholar

Copyright information

© Springer Inc. 2006

Authors and Affiliations

  1. 1.IBM Computational Biology CenterYorktown Heights

Personalised recommendations