Functional & Integrative Genomics

, Volume 3, Issue 3, pp 117–124 | Cite as

A mixture model approach to detecting differentially expressed genes with microarray data

Original Paper


An exciting biological advancement over the past few years is the use of microarray technologies to measure simultaneously the expression levels of thousands of genes. The bottleneck now is how to extract useful information from the resulting large amounts of data. An important and common task in analyzing microarray data is to identify genes with altered expression under two experimental conditions. We propose a nonparametric statistical approach, called the mixture model method (MMM), to handle the problem when there are a small number of replicates under each experimental condition. Specifically, we propose estimating the distributions of a t -type test statistic and its null statistic using finite normal mixture models. A comparison of these two distributions by means of a likelihood ratio test, or simply using the tail distribution of the null statistic, can identify genes with significantly changed expression. Several methods are proposed to effectively control the false positives. The methodology is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle ear infection.


Likelihood ratio Permutation Normal mixtures SAM 



W.P. was supported by an NIH grant (R01-HL65462) and a Minnesota Medical Foundation grant. The authors are grateful to two referees for many helpful comments and suggestions.


  1. Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) 2nd international symposium on information theory. Akademiai Kiado, Budapest, pp 267–281Google Scholar
  2. Allison DB, Gadbury GL, Heo M, Fernandez J, Lee K-C, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39:1–20CrossRefGoogle Scholar
  3. Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang W (2001) Identifying differentially expressed genes in cDNA microarray experiments. J Comput Biol 8:639–659PubMedGoogle Scholar
  4. Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17:509–519PubMedGoogle Scholar
  5. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300Google Scholar
  6. Biernacki C, Govaert G (1999) Choosing models in model-based clustering and discriminant analysis. J Stat Comput Simul 64:49–71Google Scholar
  7. Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185–193CrossRefPubMedGoogle Scholar
  8. Botstein D, Brown P (1999) Exploring the new world of the genome with DNA microarrays. Nat Genet Suppl 21:33–37Google Scholar
  9. Broet P, Richardson S, Radvanyi F (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol 9:671–683CrossRefPubMedGoogle Scholar
  10. Chen Y, Dougherty ER, Bittner ML (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics 2:364–367CrossRefGoogle Scholar
  11. Chu G, Narasimhan B, Tibshirani R, Tusher V (2003) SAM users guide and technical document (SAM 1.21). Scholar
  12. Chuaqui RF, Bonner RF, Best CJM, et al (2002) Post-analysis follow-up and validation of microarray experiments. Nat Genet Suppl 32:509–514CrossRefGoogle Scholar
  13. Churchill GA (2002) Fundamentals of experimental design for cDNA microarrays. Nat Genet 32:490–495CrossRefPubMedGoogle Scholar
  14. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38Google Scholar
  15. Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139Google Scholar
  16. Efron B, Tibshirani R, Goss V, Chu G (2000) Microarrays and their use in a comparative experiment. Scholar
  17. Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160CrossRefGoogle Scholar
  18. Fraley C, Raftery AE (1998) How many clusters? Which clustering methods?—Answers via model-based cluster analysis. Comput J 41:578–588Google Scholar
  19. Friemert C, Erfle V, Strauss G (1998) Preparation of radiolabeled cDNA probes with high specific activity for rapid screening of gene expression. Methods Mol Cell Biol 1:143–153Google Scholar
  20. Guo X, Qi H, Verfaillie CM, Pan W (2003) Statistical significance analysis of longitudinal gene expression data. Bioinformatics (in press). Available at Scholar
  21. Halfon MS, Michelson AM (2002) Exploring genetic regulatory networks in metazoan development: methods and models. Physiol Genomics 10:131–143PubMedGoogle Scholar
  22. Huang X, Pan W (2002) Comparing three methods for variance estimation with duplicated high density oligonucleotide arrays. Funct Integr Genom 2:126–133CrossRefGoogle Scholar
  23. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:S96–S104PubMedGoogle Scholar
  24. Ibrahim JG, Chen M-H, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc 97:88–99CrossRefGoogle Scholar
  25. Ideker T, Thorsson V, Siehel AF, Hood LE (2000) Testing for differentially-expressed genes by maximum likelihood analysis of microarray data. J Comput Biol 7:805–817PubMedGoogle Scholar
  26. Irizarry RA, Hobbs B, Colin F, Beazer-Barclay YD, Antonellis K, Scherf U, Speed TP (2003) Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics (in press)Google Scholar
  27. Kendziorski CM, Newton MA, Lan H, Gould MN (2002) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med (in press) Available at ~ kendzior/Google Scholar
  28. Kerr MK, Churchill GA (2001) Experimental design for gene expression microarrays. Biostatistics 2:183–202CrossRefGoogle Scholar
  29. Kerr MK, Martin M, Churchill GA (2000) Analysis of variance for gene expression microarray data. J Computal Biol 7:819–837CrossRefGoogle Scholar
  30. Kooperberg C, Sipione S, LeBlanc ML, Strand AD, Cattaneo E, Olson JM (2002) Evaluating test-statistics to select interesting genes in microarray experiments. Hum Mol Genet 11:2223–2232CrossRefPubMedGoogle Scholar
  31. Lander ES (1999) Array of hope. Nat Genet Suppl 21:3–4Google Scholar
  32. Lee M-LT, Kuo FC, Whitmore GA, Sklar J (2000) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci 97:9834–9839PubMedGoogle Scholar
  33. Lee M-LT, Lu W, Whitmore GA, Beier D (2002) Models for microarray gene expression data. J Biopharmaceut Stat 12:1–19CrossRefGoogle Scholar
  34. Lehmann EL (1986) Theory of point estimation. Wiley, New YorkGoogle Scholar
  35. Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci 98:31–36PubMedGoogle Scholar
  36. Li H, Luan Y, Hong F, Li Y (2002) Statistical methods for analysis of time course gene expression data. Frontiers Biosci 7:a90–a98Google Scholar
  37. Lin Y, Nadler ST, Attie AD, Yandell BS (2001) Mining for low-abundance transcripts in microarray data. ~ yilin/Google Scholar
  38. Lonnstedt I, Speed T (2002) Replicated microarray data. Stat Sin 12:31–46Google Scholar
  39. McLachlan GL (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36:318–324Google Scholar
  40. McLachlan GL, Basford KE (1988) Mixture models: inference and applications to clustering. Dekker, New YorkGoogle Scholar
  41. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New YorkGoogle Scholar
  42. Naef F, Socci ND, Magnasco M (2003) A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations Bioinformatics 19:178–184Google Scholar
  43. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52PubMedGoogle Scholar
  44. Newton MA, Noueiry A, Sarkar D, Ahlquist P (2003) Detecting differential gene expression with a semiparametric hierarchical mixture method. Technical report 1074, Department of Statistics, UW Madison. ~ newton/papers/publications/Google Scholar
  45. Nguyen DV, Arpat AB, Wang N, Carroll RJ (2002) DNA microarray experiments: biological and technical aspects. Biometrics 58:701–717PubMedGoogle Scholar
  46. Pan W (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 12:546–554CrossRefGoogle Scholar
  47. Pan W (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics (in press) Scholar
  48. Pan W, Lin J, Le C (2002a) Model-based cluster analysis of microarray gene expression data. Genome Biol 3(2):research009.1–research009.8Google Scholar
  49. Pan W, Lin J, Le C (2002b) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol 3(5):research0022.1–research0022.10PubMedGoogle Scholar
  50. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University Press, New YorkGoogle Scholar
  51. Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501CrossRefPubMedGoogle Scholar
  52. Rocke DM, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8:557–570PubMedGoogle Scholar
  53. Schwartz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464Google Scholar
  54. Smyth GK, Yang YH, Speed T (2002) Statistical issues in cDNA microarray data analysis. http://www.stat.Berkeley.EDU/users/terry/zarray/Html/papersindex.htmlGoogle Scholar
  55. Storey JD (2001) The positive false discovery rate: a Bayesian interpretation and the q-value. Technical Report, Department of Statistics, Stanford University, Stanford, Calif.Google Scholar
  56. Strand AD, Olson JM, Kooperberg C (2002) Estimating the statistical significance of gene expression changes observed with oligonucleotide arrays. Hum Mol Genet 11:2207–2221CrossRefPubMedGoogle Scholar
  57. Thomas JG, Olson JM, Tapscott SJ, Zhao LP (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11:1227–1236PubMedGoogle Scholar
  58. Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New YorkGoogle Scholar
  59. Troyanskaya OG, Garber ME, Brown PO, et al (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18:1454–1461CrossRefPubMedGoogle Scholar
  60. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98:5116–5121PubMedGoogle Scholar
  61. Valafar F (2002) Pattern recognition techniques in microarray data analysis—a survey. Ann NY Acad Sci 980:41–64PubMedGoogle Scholar
  62. Yang YH, Buckley MJ, Dudoit S, Speed TP (2002a) Comparison of methods for image analysis on cDNA microarray data. J Comput Graph Stat 11:108–136CrossRefGoogle Scholar
  63. Yang YH, et al (2002b) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30:e15PubMedGoogle Scholar
  64. Zhao Y, Pan W (2003) Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments. Bioinformatics (in press) Scholar
  65. Zhou Y, Abagyan R (2002) Match-only integral distribution (MOID) algorithm for high-density oligonucleotide array analysis. BMC Bioinformatics 3:3CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag 2003

Authors and Affiliations

  1. 1.Division of Biostatistics, School of Public HealthUniversity of Minnesota, A460 Mayo, MMC 303MinneapolisUSA
  2. 2.Department of Otolaryngology, School of MedicineUniversity of MinnesotaMinneapolisUSA

Personalised recommendations