Abstract
Progress in mapping the genome and developments in array technologies have provided large amounts of information for delineating the roles of genes involved in complex diseases and quantitative traits. Since complex phenotypes are determined by a network of interrelated biological traits typically involving multiple inter-correlated genetic and environmental factors that interact in a hierarchical fashion, microarrays hold tremendous latent information. The analysis of microarray data is, however, still a bottleneck. In this paper, we review the recent advances in statistical analyses for associating phenotypes with molecular events underpinning microarray experiments. Classical statistical procedures to analyze phenotypes in genetics are reviewed first, followed by descriptions of the statistical procedures for linking molecular events to measured gene expression phenotypes (microarray-based gene expression) and observed phenotypes such as diseases status. These statistical procedures include (1) prior analysis, such as data quality controls, and normalization analyses for minimizing the effects of experimental artifacts and random noise; (2) gene selections and differentiation procedures based on inferential statistics for the class comparisons; (3) dynamic temporal patterns analysis through exploratory statistics such as unsupervised clustering and supervised classification and predictions; (4) assessing the reliability of microarray studies using real-time PCR and the reproducibility issues from many studies and multiple platforms. In addition, the post analysis to associate the discovered patterns of gene expression to pathway and functional analysis for selected genes are also considered in order to increase our understanding of interconnected gene processes.
Similar content being viewed by others
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Contr 19:716–723
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A 97:10101–10106
Bailar JC (1997) The promise and problems of meta-analysis (editorial). N Engl J Med 337:559–561
Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17(6):509–519
Ball RD (2001) Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion. Genetics 159:1351–1364
Barczak A, Rodriguez MW, Hanspers K, Koth LL, Tai YC, Bolstad BM, Speed TP, Erle DJ (2003) Spotted long oligonucleotide arrays for human gene expression analysis. Genome Res 13:1775–1785
Beal MJ, Falciani FL, Ghahramani Z, Rangel C, Wild D (2005) A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21:349–356
Beaumont MA, Rannala B (2004) The Bayesian revolution in genetics. Nat Rev Genet 5:251–261
Beissbarth T, Speed TP (2004) GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20(9):1464–1465
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Black MA, Doerge RW (2002) Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics 18(12):1609–1616
Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 33(Suppl):228–237
Breslin T, Eden P, Krogh M (2004) Comparing functional annotation analyses with Catmap. BMC Bioinformatics 5:193
Broet P, Richardson S, Radvanyi F (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol 9(4):671–683
Broman KW, Speed TP (2002) A model selection approach for the identification of quantitative trait loci in experimental crosses (with discussion). J R Stat Soc B 64:641–656 (731–775)
Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Junior MA, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using supported vector machines. Proc Natl Acad Sci U S A 97(1):262–267
Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait mapping. Genetics 138:963–971
Cui X, Churchill G (2003) Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 4:210
Darvasi A (2003) Genomics—gene expression meets genetics. Nature 422:269–270
Datta S, Datta S et al (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19:459–466
Do K, Muller P, Tang F (2005) A Bayesian mixture model for differential gene expression. J R Stat Soc C 54(3):627–644
Doerge RW (2002) Mapping and analysis of quantitative trait loci in experimental populations. Nat Rev Genet 3:43–52
Doerge RW, Churchill GA (1996) Permutation tests for multiple loci affecting a quantitative character. Genetics 142:285–294
Doerge RW, Zeng ZB, Weir BS (1997) Statistical issues in the search for genes affecting quantitative traits in experimental populations. Stat Sci 12:195–219
Dudoit S, Fridlyand J (2002) A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biol 3:RESEARCH0036
Dudoit S, Fridlyand J, Speed T (2002a) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87
Dudoit S, Yang YH, Speed TP, Callow MJ (2002b) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139
Durbin BP, Hardin JS, Hawkins DM, Rocke DM (2002) A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18:105–110
Dysvik THB et al (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32(3)
Edwards J, Page GP, Gadbury G, Heo M, Kayo T, Weindruch R, Allison D (2005) Empirical Bayes estimation of gene specific effects in microarray research. Funct Integr Genomics 5:32–39
Efron B (1996) Empirical Bayes methods for combining likelihoods (with discussion). J Am Stat Assoc 91:538–565
Efron B, Morris C (1975) Data analysis using Stein’s estimator and its generalization. J Am Stat Assoc 70(350):311–319
Efron B, Tibshirani R, Goss V, Chu G (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96(456):1151–1160
Ehm MG, Kimmel M, Cottingham RW (1996) Error detection for genetic data, using likelihood methods. Am J Hum Genet 58:225–234
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868
Gasch AP, Eisen MB (2002) Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol 3(11)
Ghosh D (2005) Nonparametric methods for analyzing replication origins in genomewide data. Funct Integr Genomics 5:18–27 (5:28–31)
Ghosh D, Chonnaiyan A (2003a) Covariate adjustment in the analysis of microarray data from clinical studies. Funct Integr Genomics 5:18–27
Ghosh D, Barette T, Rhodes D (2003b) Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct Integr Genomics 3:180–188
Glynn D, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4:R60
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caliguiri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Hastiel T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown PO (2000) Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1(2):research0003.1–research0003.21
Hoeting J, Madigan D, Raftery A, Volinsky C (1999) Bayesian model averaging: a tutorial. Stat Sci 14(4):382–417
Hosack D, Hosack A, Sherman BT, Lane HC, Lempicki RA (2003) Identifying biological themes within lists of genes with EASE. Genome Biol 4:R70
Ibrahim JG, Chen M-H, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc 3:88–99
Ibrahim AFM, Hedley PE, Cardle L, Kruger W, Marshall DF, Muehlbauer GJ, Waugh R (2005) A comparative analysis of transcript abundance using SAGE and Affymetrix arrays. Funct Integr Genomics 5:163–174
Irizarry RA, Bolstad BM, Collin F, Cope L, Hobbs B, Speed T (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31(4):e15
Irizarry RA et al (2005) Multiple-laboratory comparison of microarray platforms. Nat Methods 2:345–349
Jolliffe IT (1986) Principal component analysis. Springer, Berlin Heidelberg New York
Jolliffe IT, Uddin M (2003) A modified principal component technique base on the lasso. J Comput Graph Stat 12:531–547
Keller AD, Schummer M, Hood L, Ruzzo WL (2000) Bayesian classification of DNA array expression data. Technical Report UW-CSE-2000-08-01
Kerr M, Churchill (2000) Analysis of variance for gene expression microarray data. J Comput Biol 7:819–837
Kerr M, Churchill G (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A 97:8961–8965
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679
Larkin JE et al (2005) Independence and reproducibility across microarray platforms. Nat Methods 2:337–343
Lee ML, Whitmore GA (2002) Power and sample size for DNA microarray studies. Stat Med 21:3543–3570
Lee MT, Kuo FC, Whitmore GA, Sklar J (2000) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A 97(18):9834–9839
Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14:1085–1094
Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A 98(1):31–36
Li Y, Campbell C, Tipping M (2002) Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics 18:1332–1339
Liang Y, Kelemen A (2004) Hierarchical Bayesian neural network for gene expression temporal patterns. J Stat Appl Genet Mol Biol 3(1) Article 20
Liang Y, Tayo B, Cai X, Kelemen A (2005) Differential and trajectory methods for time course gene expression data. Bioinformatics 20(13):3009–3016
Liu B (1998) Statistical genomics: linkage, mapping and QTL analysis. CRC, Boca Raton
Lonnstedt I, Speed T (2002) Replicated microarray data. Stat Sin 12(1):31–46
Luan Y, Li H (2004) Model-based methods for identifying periodically regulated genes based on the time course microarray gene expression data. Bioinformatics 20:332–339 (01)
MacKay DJC (1995) Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Comput Neural Syst 6(3):469–505
McShane LM, Radmacher MD, Friedlin B, Yu R, Li MC, Simon R (2002) Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18:1462–1469
Merlise AC (1998) Bayesian model averaging and model search strategies. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds) Bayesian statistics 6. Oxford University Press, Oxford
Members of the Toxicogenomics Research Consortium (2005) Standardizing global gene expression analysis between laboratories and across platforms. Nat Methods 2:351–356
Monti S, Tamayo P, Mesirov JP, Golub TR (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 2003 52(1–2):91–118
Morris CN (1983) Parametric empirical Bayes inference: theory and applications. J Am Stat Assoc 78:47–65
Neal SH, Madhusmita M, Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff VF (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexicity. Proc Natl Acad Sci U S A 97:8409–8414
Nettleton D, Doerge RW (2000) Accounting for variability in the use of permutation testing to detect quantitative trait loci. Biometrics 56:52–58
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios, improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52
Nimgaonkar A, Sanoudou D, Butte AJ, Haslett JN, Kunkel LM, Beggs AH, Kohane IS (2003) Reproducibility of gene expression across generations of Affymetrix microarrays. BMC Bioinformatics 4(1):27
Ooi CH, Tan P (2003) Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19(1):37–44
Pan W (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18(4):546–549
Pan W, Lin J, Le CT (2002) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol 3(5):Research0022
Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E (2004) A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res 10:2922–2927
Pochet N, Smet F, Suykens J, De Moor J (2004) Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics 20:3185–3195
Ramoni MF, Sebastiani P, Kohane IS (2002) Cluster analysis of gene expression dynamics. Proc Natl Acad Sci U S A 99(14):9121–9126
Rhodes D, Yu J, Shanker K et al (2004) Large scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 101(25):9309–9314
Robbins H (1955) An empirical Bayes approach to statistics. In: Proceedings of the 3rd Berkeley symposium on mathematical statistics and probability, 1. University of California Press, Berkeley, CA, pp 157–164
Romualdi C, Campanaro S, Campagna D, Celegato B, Cannata N, Toppo S, Valle G, Lanfranchi G (2003) Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Hum Mol Genet 12(8):823–836
Sham P (2001) Statistics in human genetics. Oxford University Press, Oxford
Sillanpaa MJ, Arjas E (2002) Model choice in gene mapping: what and why. Trends Genet 18:301–307
Simon R, Radmacher MD, Dobbin K, McShane LM (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1):14–18
Slonim DK (2002) From patterns to pathways: gene expression data analysis comes of age. Nat Genet Suppl 32:502
Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3(1) Article 3
Smyth GK, Speed T (2003) Normalization of cDNA microarray data. In: Carter D (ed) Methods: selecting candidate genes for DNA array screens. Application to neuroscience. Elsevier, Amsterdam, pp 265–273
Smyth GK, Michaud J, Scott HS (2005) Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 21(9):2067–2075
Spiegelhalter et al (2002) Bayesian measures of model complexity and fit. J R Stat Soc B 583–639
Stone M (1974) Cross-validatory choice and assessment of statistical predictions (with discussion). J R Stat Soc B 36:111–147
Storey JD (2003) The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 31(6):2013–2035
Szabo A, Boucher K, Carrol WL, Klebanov LB, Tsodikov AD, Yakovlev AY (2002) Variable selection and pattern recognition with gene expression data generated by the microarray technology. Math Biosci 176:71–98
Tamayo T, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96:2907–2912
Tan PK, Downey TJ, Spitznagel EL Jr, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC (2003) Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 31(19)
Troyanskaya OG, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Troyanskaya OG, Garber ME, Brown P, Botstein D, Altman RB (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18(11):1454–1461
Troyanskaya OG, Dolinski K, Owen AO, Altman RB, Botstein D (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A 100:8348–8353
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 29(12):2549–2557
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 9:5116–5121
Visscher PM, Thompson R, Haley CS (1996) Confidence intervals in QTL mapping by bootstrapping. Genetics 143:1013–1020
Wall ME, Dyck PA, Brettin TS (2001) SVDMAN—Singular value decomposition analysis of microarray data. Bioinformatics 17:566–568
Weir B (1996) Genetic data analysis II. Sinauer, Sunderland
West M (2000) Bayesian regression analysis in the "Large p, Small n" paradigm. Technical Report 00-22, Institute of Statistics and Decision Sciences, Duke University, CSE-2000-08-01
Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules R (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J Comp Biol 8(6):625–637
Wuju L, Momiao X (2002) Tclass: tumor classification system based on gene expression profile. Bioinformatics 18:325–326
Xiao Y, Frisina R, Gordon A, Klebanov L, Yakovlev A (2004) Multivariate search for differentially expressed gene combinations. BMC Bioinformatics 5(1):164
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP (2002) Normalization for cDNA microarray data. Nucleic Acids Res 30(4):e15
Yeung KY, Ruzzo WL (2001) Principal component analysis for clustering gene expression data. Bioinformatics 17(9):763–774
Yeung et al (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4:R34.1–R34.17
Zien A, Fluck J, Zimmer R, Lengauer T (2003) Microarrays: how many do you need? J Comput Biol 10(3–4):653–667
Zou H, Hastie T et al (2004) Sparse principal component analysis. J Comput Graph Stat (in press)
Acknowledgements
The authors thank Rudi Appels for the critical reading of, and comments on, this manuscript. This work was supported in part by the National Institute of General Medical Sciences Grant (no. 5P20GM67650-02).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liang, Y., Kelemen, A. Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments. Funct Integr Genomics 6, 1–13 (2006). https://doi.org/10.1007/s10142-005-0006-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10142-005-0006-z