Abstract
An exciting biological advancement over the past few years is the use of microarray technologies to measure simultaneously the expression levels of thousands of genes. The bottleneck now is how to extract useful information from the resulting large amounts of data. An important and common task in analyzing microarray data is to identify genes with altered expression under two experimental conditions. We propose a nonparametric statistical approach, called the mixture model method (MMM), to handle the problem when there are a small number of replicates under each experimental condition. Specifically, we propose estimating the distributions of a t -type test statistic and its null statistic using finite normal mixture models. A comparison of these two distributions by means of a likelihood ratio test, or simply using the tail distribution of the null statistic, can identify genes with significantly changed expression. Several methods are proposed to effectively control the false positives. The methodology is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle ear infection.
Similar content being viewed by others
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) 2nd international symposium on information theory. Akademiai Kiado, Budapest, pp 267–281
Allison DB, Gadbury GL, Heo M, Fernandez J, Lee K-C, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39:1–20
Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang W (2001) Identifying differentially expressed genes in cDNA microarray experiments. J Comput Biol 8:639–659
Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17:509–519
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
Biernacki C, Govaert G (1999) Choosing models in model-based clustering and discriminant analysis. J Stat Comput Simul 64:49–71
Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185–193
Botstein D, Brown P (1999) Exploring the new world of the genome with DNA microarrays. Nat Genet Suppl 21:33–37
Broet P, Richardson S, Radvanyi F (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol 9:671–683
Chen Y, Dougherty ER, Bittner ML (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics 2:364–367
Chu G, Narasimhan B, Tibshirani R, Tusher V (2003) SAM users guide and technical document (SAM 1.21). http://www-stat.stanford.edu/~tibs/SAM/index.html
Chuaqui RF, Bonner RF, Best CJM, et al (2002) Post-analysis follow-up and validation of microarray experiments. Nat Genet Suppl 32:509–514
Churchill GA (2002) Fundamentals of experimental design for cDNA microarrays. Nat Genet 32:490–495
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139
Efron B, Tibshirani R, Goss V, Chu G (2000) Microarrays and their use in a comparative experiment. http://www-stat.stanford.edu/~tibs/research.html
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Fraley C, Raftery AE (1998) How many clusters? Which clustering methods?—Answers via model-based cluster analysis. Comput J 41:578–588
Friemert C, Erfle V, Strauss G (1998) Preparation of radiolabeled cDNA probes with high specific activity for rapid screening of gene expression. Methods Mol Cell Biol 1:143–153
Guo X, Qi H, Verfaillie CM, Pan W (2003) Statistical significance analysis of longitudinal gene expression data. Bioinformatics (in press). Available at http://www.biostat.umn.edu/cgi-bin/rrs?print+2003
Halfon MS, Michelson AM (2002) Exploring genetic regulatory networks in metazoan development: methods and models. Physiol Genomics 10:131–143
Huang X, Pan W (2002) Comparing three methods for variance estimation with duplicated high density oligonucleotide arrays. Funct Integr Genom 2:126–133
Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:S96–S104
Ibrahim JG, Chen M-H, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc 97:88–99
Ideker T, Thorsson V, Siehel AF, Hood LE (2000) Testing for differentially-expressed genes by maximum likelihood analysis of microarray data. J Comput Biol 7:805–817
Irizarry RA, Hobbs B, Colin F, Beazer-Barclay YD, Antonellis K, Scherf U, Speed TP (2003) Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics (in press)
Kendziorski CM, Newton MA, Lan H, Gould MN (2002) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med (in press) Available at http://www.biostat.wisc.edu/ ~ kendzior/
Kerr MK, Churchill GA (2001) Experimental design for gene expression microarrays. Biostatistics 2:183–202
Kerr MK, Martin M, Churchill GA (2000) Analysis of variance for gene expression microarray data. J Computal Biol 7:819–837
Kooperberg C, Sipione S, LeBlanc ML, Strand AD, Cattaneo E, Olson JM (2002) Evaluating test-statistics to select interesting genes in microarray experiments. Hum Mol Genet 11:2223–2232
Lander ES (1999) Array of hope. Nat Genet Suppl 21:3–4
Lee M-LT, Kuo FC, Whitmore GA, Sklar J (2000) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci 97:9834–9839
Lee M-LT, Lu W, Whitmore GA, Beier D (2002) Models for microarray gene expression data. J Biopharmaceut Stat 12:1–19
Lehmann EL (1986) Theory of point estimation. Wiley, New York
Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci 98:31–36
Li H, Luan Y, Hong F, Li Y (2002) Statistical methods for analysis of time course gene expression data. Frontiers Biosci 7:a90–a98
Lin Y, Nadler ST, Attie AD, Yandell BS (2001) Mining for low-abundance transcripts in microarray data. http://www.stat.wisc.edu/ ~ yilin/
Lonnstedt I, Speed T (2002) Replicated microarray data. Stat Sin 12:31–46
McLachlan GL (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36:318–324
McLachlan GL, Basford KE (1988) Mixture models: inference and applications to clustering. Dekker, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Naef F, Socci ND, Magnasco M (2003) A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations Bioinformatics 19:178–184
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52
Newton MA, Noueiry A, Sarkar D, Ahlquist P (2003) Detecting differential gene expression with a semiparametric hierarchical mixture method. Technical report 1074, Department of Statistics, UW Madison. http://www.stat.wisc.edu/ ~ newton/papers/publications/
Nguyen DV, Arpat AB, Wang N, Carroll RJ (2002) DNA microarray experiments: biological and technical aspects. Biometrics 58:701–717
Pan W (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 12:546–554
Pan W (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics (in press) http://www.biostat.umn.edu/cgi-bin/rrs?print+2002
Pan W, Lin J, Le C (2002a) Model-based cluster analysis of microarray gene expression data. Genome Biol 3(2):research009.1–research009.8
Pan W, Lin J, Le C (2002b) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol 3(5):research0022.1–research0022.10
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University Press, New York
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501
Rocke DM, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8:557–570
Schwartz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464
Smyth GK, Yang YH, Speed T (2002) Statistical issues in cDNA microarray data analysis. http://www.stat.Berkeley.EDU/users/terry/zarray/Html/papersindex.html
Storey JD (2001) The positive false discovery rate: a Bayesian interpretation and the q-value. Technical Report, Department of Statistics, Stanford University, Stanford, Calif.
Strand AD, Olson JM, Kooperberg C (2002) Estimating the statistical significance of gene expression changes observed with oligonucleotide arrays. Hum Mol Genet 11:2207–2221
Thomas JG, Olson JM, Tapscott SJ, Zhao LP (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11:1227–1236
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York
Troyanskaya OG, Garber ME, Brown PO, et al (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18:1454–1461
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98:5116–5121
Valafar F (2002) Pattern recognition techniques in microarray data analysis—a survey. Ann NY Acad Sci 980:41–64
Yang YH, Buckley MJ, Dudoit S, Speed TP (2002a) Comparison of methods for image analysis on cDNA microarray data. J Comput Graph Stat 11:108–136
Yang YH, et al (2002b) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30:e15
Zhao Y, Pan W (2003) Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments. Bioinformatics (in press) http://www.biostat.umn.edu/cgi-bin/rrs?print+2002
Zhou Y, Abagyan R (2002) Match-only integral distribution (MOID) algorithm for high-density oligonucleotide array analysis. BMC Bioinformatics 3:3
Acknowledgements
W.P. was supported by an NIH grant (R01-HL65462) and a Minnesota Medical Foundation grant. The authors are grateful to two referees for many helpful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pan, W., Lin, J. & Le, C.T. A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics 3, 117–124 (2003). https://doi.org/10.1007/s10142-003-0085-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10142-003-0085-7