Skip to main content
Log in

A mixture model approach to detecting differentially expressed genes with microarray data

  • Original Paper
  • Published:
Functional & Integrative Genomics Aims and scope Submit manuscript

Abstract

An exciting biological advancement over the past few years is the use of microarray technologies to measure simultaneously the expression levels of thousands of genes. The bottleneck now is how to extract useful information from the resulting large amounts of data. An important and common task in analyzing microarray data is to identify genes with altered expression under two experimental conditions. We propose a nonparametric statistical approach, called the mixture model method (MMM), to handle the problem when there are a small number of replicates under each experimental condition. Specifically, we propose estimating the distributions of a t -type test statistic and its null statistic using finite normal mixture models. A comparison of these two distributions by means of a likelihood ratio test, or simply using the tail distribution of the null statistic, can identify genes with significantly changed expression. Several methods are proposed to effectively control the false positives. The methodology is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle ear infection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1a–c.
Fig. 2a, b.
Fig. 3.

Similar content being viewed by others

References

  • Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) 2nd international symposium on information theory. Akademiai Kiado, Budapest, pp 267–281

  • Allison DB, Gadbury GL, Heo M, Fernandez J, Lee K-C, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39:1–20

    Article  Google Scholar 

  • Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang W (2001) Identifying differentially expressed genes in cDNA microarray experiments. J Comput Biol 8:639–659

    CAS  PubMed  Google Scholar 

  • Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17:509–519

    CAS  PubMed  Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300

    Google Scholar 

  • Biernacki C, Govaert G (1999) Choosing models in model-based clustering and discriminant analysis. J Stat Comput Simul 64:49–71

    Google Scholar 

  • Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185–193

    Article  CAS  PubMed  Google Scholar 

  • Botstein D, Brown P (1999) Exploring the new world of the genome with DNA microarrays. Nat Genet Suppl 21:33–37

    CAS  Google Scholar 

  • Broet P, Richardson S, Radvanyi F (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol 9:671–683

    Article  CAS  PubMed  Google Scholar 

  • Chen Y, Dougherty ER, Bittner ML (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics 2:364–367

    Article  CAS  Google Scholar 

  • Chu G, Narasimhan B, Tibshirani R, Tusher V (2003) SAM users guide and technical document (SAM 1.21). http://www-stat.stanford.edu/~tibs/SAM/index.html

  • Chuaqui RF, Bonner RF, Best CJM, et al (2002) Post-analysis follow-up and validation of microarray experiments. Nat Genet Suppl 32:509–514

    Article  CAS  Google Scholar 

  • Churchill GA (2002) Fundamentals of experimental design for cDNA microarrays. Nat Genet 32:490–495

    Article  CAS  PubMed  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    Google Scholar 

  • Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139

    Google Scholar 

  • Efron B, Tibshirani R, Goss V, Chu G (2000) Microarrays and their use in a comparative experiment. http://www-stat.stanford.edu/~tibs/research.html

  • Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160

    Article  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which clustering methods?—Answers via model-based cluster analysis. Comput J 41:578–588

    Google Scholar 

  • Friemert C, Erfle V, Strauss G (1998) Preparation of radiolabeled cDNA probes with high specific activity for rapid screening of gene expression. Methods Mol Cell Biol 1:143–153

    Google Scholar 

  • Guo X, Qi H, Verfaillie CM, Pan W (2003) Statistical significance analysis of longitudinal gene expression data. Bioinformatics (in press). Available at http://www.biostat.umn.edu/cgi-bin/rrs?print+2003

  • Halfon MS, Michelson AM (2002) Exploring genetic regulatory networks in metazoan development: methods and models. Physiol Genomics 10:131–143

    CAS  PubMed  Google Scholar 

  • Huang X, Pan W (2002) Comparing three methods for variance estimation with duplicated high density oligonucleotide arrays. Funct Integr Genom 2:126–133

    Article  CAS  Google Scholar 

  • Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:S96–S104

    PubMed  Google Scholar 

  • Ibrahim JG, Chen M-H, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc 97:88–99

    Article  Google Scholar 

  • Ideker T, Thorsson V, Siehel AF, Hood LE (2000) Testing for differentially-expressed genes by maximum likelihood analysis of microarray data. J Comput Biol 7:805–817

    CAS  PubMed  Google Scholar 

  • Irizarry RA, Hobbs B, Colin F, Beazer-Barclay YD, Antonellis K, Scherf U, Speed TP (2003) Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics (in press)

  • Kendziorski CM, Newton MA, Lan H, Gould MN (2002) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med (in press) Available at http://www.biostat.wisc.edu/ ~ kendzior/

  • Kerr MK, Churchill GA (2001) Experimental design for gene expression microarrays. Biostatistics 2:183–202

    Article  Google Scholar 

  • Kerr MK, Martin M, Churchill GA (2000) Analysis of variance for gene expression microarray data. J Computal Biol 7:819–837

    Article  CAS  Google Scholar 

  • Kooperberg C, Sipione S, LeBlanc ML, Strand AD, Cattaneo E, Olson JM (2002) Evaluating test-statistics to select interesting genes in microarray experiments. Hum Mol Genet 11:2223–2232

    Article  CAS  PubMed  Google Scholar 

  • Lander ES (1999) Array of hope. Nat Genet Suppl 21:3–4

    CAS  Google Scholar 

  • Lee M-LT, Kuo FC, Whitmore GA, Sklar J (2000) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci 97:9834–9839

    CAS  PubMed  Google Scholar 

  • Lee M-LT, Lu W, Whitmore GA, Beier D (2002) Models for microarray gene expression data. J Biopharmaceut Stat 12:1–19

    Article  Google Scholar 

  • Lehmann EL (1986) Theory of point estimation. Wiley, New York

  • Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci 98:31–36

    CAS  PubMed  Google Scholar 

  • Li H, Luan Y, Hong F, Li Y (2002) Statistical methods for analysis of time course gene expression data. Frontiers Biosci 7:a90–a98

    CAS  Google Scholar 

  • Lin Y, Nadler ST, Attie AD, Yandell BS (2001) Mining for low-abundance transcripts in microarray data. http://www.stat.wisc.edu/ ~ yilin/

  • Lonnstedt I, Speed T (2002) Replicated microarray data. Stat Sin 12:31–46

    Google Scholar 

  • McLachlan GL (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36:318–324

    Google Scholar 

  • McLachlan GL, Basford KE (1988) Mixture models: inference and applications to clustering. Dekker, New York

    Google Scholar 

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

  • Naef F, Socci ND, Magnasco M (2003) A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations Bioinformatics 19:178–184

  • Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52

    CAS  PubMed  Google Scholar 

  • Newton MA, Noueiry A, Sarkar D, Ahlquist P (2003) Detecting differential gene expression with a semiparametric hierarchical mixture method. Technical report 1074, Department of Statistics, UW Madison. http://www.stat.wisc.edu/ ~ newton/papers/publications/

  • Nguyen DV, Arpat AB, Wang N, Carroll RJ (2002) DNA microarray experiments: biological and technical aspects. Biometrics 58:701–717

    PubMed  Google Scholar 

  • Pan W (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 12:546–554

    Article  Google Scholar 

  • Pan W (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics (in press) http://www.biostat.umn.edu/cgi-bin/rrs?print+2002

  • Pan W, Lin J, Le C (2002a) Model-based cluster analysis of microarray gene expression data. Genome Biol 3(2):research009.1–research009.8

    Google Scholar 

  • Pan W, Lin J, Le C (2002b) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol 3(5):research0022.1–research0022.10

    PubMed  Google Scholar 

  • Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University Press, New York

  • Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501

    Article  CAS  PubMed  Google Scholar 

  • Rocke DM, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8:557–570

    CAS  PubMed  Google Scholar 

  • Schwartz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464

    Google Scholar 

  • Smyth GK, Yang YH, Speed T (2002) Statistical issues in cDNA microarray data analysis. http://www.stat.Berkeley.EDU/users/terry/zarray/Html/papersindex.html

  • Storey JD (2001) The positive false discovery rate: a Bayesian interpretation and the q-value. Technical Report, Department of Statistics, Stanford University, Stanford, Calif.

    Google Scholar 

  • Strand AD, Olson JM, Kooperberg C (2002) Estimating the statistical significance of gene expression changes observed with oligonucleotide arrays. Hum Mol Genet 11:2207–2221

    Article  CAS  PubMed  Google Scholar 

  • Thomas JG, Olson JM, Tapscott SJ, Zhao LP (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11:1227–1236

    CAS  PubMed  Google Scholar 

  • Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York

  • Troyanskaya OG, Garber ME, Brown PO, et al (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18:1454–1461

    Article  CAS  PubMed  Google Scholar 

  • Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98:5116–5121

    CAS  PubMed  Google Scholar 

  • Valafar F (2002) Pattern recognition techniques in microarray data analysis—a survey. Ann NY Acad Sci 980:41–64

    CAS  PubMed  Google Scholar 

  • Yang YH, Buckley MJ, Dudoit S, Speed TP (2002a) Comparison of methods for image analysis on cDNA microarray data. J Comput Graph Stat 11:108–136

    Article  Google Scholar 

  • Yang YH, et al (2002b) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30:e15

    PubMed  Google Scholar 

  • Zhao Y, Pan W (2003) Modified nonparametric approaches to detecting differentially expressed genes in replicated microarray experiments. Bioinformatics (in press) http://www.biostat.umn.edu/cgi-bin/rrs?print+2002

  • Zhou Y, Abagyan R (2002) Match-only integral distribution (MOID) algorithm for high-density oligonucleotide array analysis. BMC Bioinformatics 3:3

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

W.P. was supported by an NIH grant (R01-HL65462) and a Minnesota Medical Foundation grant. The authors are grateful to two referees for many helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Pan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, W., Lin, J. & Le, C.T. A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics 3, 117–124 (2003). https://doi.org/10.1007/s10142-003-0085-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10142-003-0085-7

Keywords

Navigation