Skip to main content
Log in

A Bayesian mixture model for metaanalysis of microarray studies

  • Original Paper
  • Published:
Functional & Integrative Genomics Aims and scope Submit manuscript

Abstract

The increased availability of microarray data has been calling for statistical methods to integrate findings across studies. A common goal of microarray analysis is to determine differentially expressed genes between two conditions, such as treatment vs control. A recent Bayesian metaanalysis model used a prior distribution for the mean log-expression ratios that was a mixture of two normal distributions. This model centered the prior distribution of differential expression at zero, and separated genes into two groups only: expressed and nonexpressed. Here, we introduce a Bayesian three-component truncated normal mixture prior model that more flexibly assigns prior distributions to the differentially expressed genes and produces three groups of genes: up and downregulated, and nonexpressed. We found in simulations of two and five studies that the three-component model outperformed the two-component model using three comparison measures. When analyzing biological data of Bacillus subtilis, we found that the three-component model discovered more genes and omitted fewer genes for the same levels of posterior probability of differential expression than the two-component model, and discovered more genes for fixed thresholds of Bayesian false discovery. We assumed that the data sets were produced from the same microarray platform and were prescaled.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: reguralized t test and statistical inferences of gene changes. Bioinformatics 17:509–519

    Article  PubMed  CAS  Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 85:289–300

    Google Scholar 

  • Bhowmick D, Davison AC, Goldstein DR, Ruffieux Y (2006) A Laplace mixture model for identification of differential expression in microarray experiments. Biostatistics 7:630–641

    Article  PubMed  Google Scholar 

  • Broët P, Richardson S, Radvanyi F (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol 9:671–683

    Article  PubMed  Google Scholar 

  • Choi JK, Yu U, Kim S, Yoo OJ (2003) Combining multiple microarray studies and modeling inter-study variation. Bioinformatics Suppl 19:i84–i90

    Google Scholar 

  • Conlon EM, Eichenberger P, Liu JS (2004) Determining and analyzing differentially expressed genes from cDNA microarray experiments with complementary designs. JMVA 90:1–18

    Google Scholar 

  • Conlon EM, Song JJ, Liu JS (2006) Bayesian models for pooling microarray studies with multiple sources of replications. BMC Bioinformatics 7:247

    Article  PubMed  Google Scholar 

  • Conlon EM, Song JJ, Liu A (2007) Bayesian meta-analysis models for microarray data: a comparative study. BMC Bioinformatics 8:80

    Article  PubMed  Google Scholar 

  • Do KA, Müller P, Tang F (2005) A Bayesian mixture model for differential gene expression. J R Stat Soc C 54:627–644

    Article  Google Scholar 

  • Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139

    Google Scholar 

  • Efron B, Tibshirani R, Storey JD, Tusher VG (2001) Empirical Bayes Analysis of a Microarray Experiment. J Am Stat Assoc 96:1151–1160

    Article  Google Scholar 

  • Eichenberger P, Jensen ST, Conlon EM, van Ooij C, Silvaggi J, Gonzalez-Pastor JE, Fujita M, Ben-Yehuda S, Stragier P, Liu JS, Losick R (2003) The sigmaE regulon and the identification of additional sporulation genes in Bacillus subtilis. J Mol Biol 327:945–972

    Article  PubMed  CAS  Google Scholar 

  • Genovese C, Wasserman L (2002) Operating characteristics and extensions of the false discovery rate procedure. J R Stat Soc B 64:499–518

    Article  Google Scholar 

  • Genovese C, Wasserman L (2003) Bayesian and frequentist multiple testing. In: Bernardo JM, Bayarri JM, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian statistics 7. Oxford University Press, Oxford, pp 145–162

    Google Scholar 

  • Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM (2003) Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct Integr Genomics 3:180–188

    Article  PubMed  CAS  Google Scholar 

  • Gottardo R, Pannucci JA, Kuske CR, Brettin T (2003) Statistical analysis of microarray data: a Bayesian approach. Biostatistics 4:597–620

    Article  PubMed  Google Scholar 

  • Hu P, Greenwood CMT, Beyene J (2005a) Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models. BMC Bioinformatics 6:128

    Article  PubMed  Google Scholar 

  • Hu J, Zou F, Wright FA (2005b) Practical FDR-based sample size calculations in microarray experiments. Bioinformatics 21:3264–3272

    Article  PubMed  CAS  Google Scholar 

  • Ibrahim JG, Chen M-H, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc 97:88–99

    Article  Google Scholar 

  • Ishwaran H, Rao JS (2003) Detecting differentially expressed genes in microarrays using Bayesian model selection. J Am Stat Assoc 98:438–455

    Article  Google Scholar 

  • Ishwaran H, Rao JS (2005) Spike and slab gene selection for multipgroup microarray data. J Am Stat Assoc 100:764–780

    Article  CAS  Google Scholar 

  • Jarvinen AK, Hautaniemi S, Edgren H, Auvinen P, Saarela J, Kallioniemi OP, Monni O (2004) Are data from different gene expression microarray platforms comparable? Genomics 83:1164–1168

    Article  PubMed  CAS  Google Scholar 

  • Jiang H, Deng Y, Chen H, Tao L, Sha Q, Chen J, Tsai C, Zhang S (2004) Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 5:81

    Article  PubMed  Google Scholar 

  • Jung YY, Oh MS, Shin DW, Kang SH, Oh HS (2006) Identifying differentially expressed genes in meta-analysis via Bayesian model-based clustering. Biom J 48:435–450

    Article  PubMed  Google Scholar 

  • Kendziorski CM, Newton MA, Lan H, Gould, MN (2003) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med 22:3899–3914

    Article  PubMed  CAS  Google Scholar 

  • Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS (2002) Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18:405–412

    Article  PubMed  CAS  Google Scholar 

  • Liu JS (2001) Monte Carlo Strategies in Scientific Computing. Springer, New York

    Google Scholar 

  • Lönnstedt I, Britton T (2005) Hierarchical bayes models for cDNA microarray gene expression. Biostatistics 6:279–291

    Article  PubMed  Google Scholar 

  • Lönnstedt I, Speed TP (2002) Replicated microarray data. Stat Sin 12:31–46

    Google Scholar 

  • Lunn DJ (2003) WBDevDJLTruncatedNormal documentation. Imperial College School of Medicine, London. http://www.winbugs-development.org.uk. Cited 13 Sep 2003

  • Mah N, Thelin A, Lu T, Nikolaus S, Kuhbacher T, Gurbuz Y, Eickhoff H, Kloppel G, Lehrach H, Mellgard B, Costello CM, Schreiber S (2004) A comparison of oligonucleotide and cDNA-based microarray systems. Physiol Genomics 16:361–370

    Article  PubMed  CAS  Google Scholar 

  • Morris JS, Yin G, Baggerly KA, Wu C, Zhang L (2005) Pooling information across different studies and oligonucleotide microarray chip types to identify prognostic genes for lung cancer. In: Shoemaker JS, Lin SM (eds) Methods of microarray data analysis IV. Springer, New York, pp 51–66

    Chapter  Google Scholar 

  • Newton MA, Kendziorski CM, Richmond CS, Blattner, FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52

    Article  PubMed  CAS  Google Scholar 

  • Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5:155–176

    Article  PubMed  Google Scholar 

  • Park T,Yi SG, Shin YK, Lee S (2006) Combining multiple microarrays in the presence of controlling variables. Bioinformatics 22:1682–1689

    Article  PubMed  Google Scholar 

  • Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E (2002) A statistical framework for expression-based molecular classification in cancer. J R Stat Soc B 64:717–736

    Article  Google Scholar 

  • Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E (2004) A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res 10:2922–2927

    Article  PubMed  CAS  Google Scholar 

  • Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM (2002) Meta-analysis of microarrays: inter-study validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res 62:4427–4433

    PubMed  CAS  Google Scholar 

  • Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM (2004) Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA 101:9309–9314

    Article  PubMed  CAS  Google Scholar 

  • Schadt EE, Li C, Ellis B, Wong WH (2001) Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J Cell Biochem Suppl 37:120–125

    Article  PubMed  Google Scholar 

  • Shen R, Ghosh D, Chinnaiyan AM (2004) Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics 5:94

    Article  PubMed  Google Scholar 

  • Spiegelhalter DJ, Thomas A, Best NG (2003) WinBUGS Version 1.4, User Manual. MRC Biostatistics Unit, Cambridge, and Imperial College School of Medicine, London. http://www.mrc-bsu.cam.ac.uk/bugs. Cited 1 Jan 2003

  • Stangl DK, Berry DA (2000) Meta-analysis: past and present challenges. In: Stangl DK, Berry DA (eds) Meta-analysis in medicine and health policy. Dekker, New York, pp 1–28

    Google Scholar 

  • Stevens JR, Doerge RW (2005) Combining affymetrix microarray results. BMC Bioinformatics 6:57

    Article  PubMed  Google Scholar 

  • Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc B 64:479–498

    Article  Google Scholar 

  • Storey JS, Tibshirani R (2003) SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. In: Parmigiani G, Garrett ES, Irizarry RA, Zeger SL (eds) The analysis of gene expression data: methods and software. Springer, New York, pp 272–290

    Chapter  Google Scholar 

  • Townsend JP, Hartl DL (2002) Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple treatments or samples. Genome Biol 3:research0071.1–71.16

    Google Scholar 

  • Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 29:2549–2557

    Article  PubMed  CAS  Google Scholar 

  • Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121

    Article  PubMed  CAS  Google Scholar 

  • Tweedie RL, Scott DJ, Biggerstaff BJ, Mengersen KL (1996) Bayesian meta-analysis, with application to studies of ETS and lung cancer. Lung Cancer 14 Suppl 1:S171–S194

    PubMed  Google Scholar 

  • Wang J, Coombes KR, Highsmith WE, Keating MJ, Abruzzo LV (2004) Differences in gene expression between B-cell chronic lymphocytic leukemia and normal B cells: a meta-analysis of three microarray studies. Bioinformatics 20:3166–3178

    Article  PubMed  CAS  Google Scholar 

  • Warnat P, Eils R, Brors B (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 6:265

    Article  PubMed  Google Scholar 

  • Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL (2005) Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics 21:3905-3911

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgements

The author thanks Mahlet Tadesse, Gheorghe Doros and Joon Jin Song for useful discussion, and Patrick Eichenberger and the laboratory of Richard Losick for providing the B. subtilis microarray data as well as helpful advice. The author also thanks the anonymous reviewers of this manuscript for their comments. EMC was supported by a University of Massachusetts Healey Endowment Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erin M. Conlon.

Appendices

Appendix A

Two-study simulation data

We produced synthetic data for two studies, with format similar to the B. subtilis mutant and induction biological data (see Appendix A: Biological data), and 3,000 genes for each study. We simulated three separate data sets, with simulated percent of differentially expressed genes p s  = 5, 10, 25%, with equivalent overexpressed and underexpressed genes. Study 1 contained five replicate slides within three replicate experiments, and Study 2 contained four replicate slides within three replicate experiments. We simulated data from Model (1), with model parameters specified to resemble the biological data. We set \( \eta ^{2}_{{jg0}} = 0.015 \) and c j  = 80, j = 1, 2; the interstudy variability was assumed to be normally distributed with mean 0.3, variance 0.1 for the differentially expressed genes, and mean 0.03, variance 0.004 for the nonexpressed genes, similar to the biological data. For Study 1, we assigned variance across slides to 0.074 and across experiments to 0.026; for Study 2, we assigned slide variation to 0.023 and experiment variation to 0.022. Each slide was standardized to have mean zero and unit standard deviation (see also Shen et al. 2004; Conlon et al. 2007). Although gene expression is expected to be correlated among genes, this has been shown to be difficult to model in simulations. We thus simulate gene expression independently for each gene, similar to other authors (e.g. Lönnstedt and Speed 2002; Gottardo et al. 2003; Bhowmick et al. 2006).

Simulation data for five studies

For the five study simulations, we used Study 1 and Study 2 data from the previous section, and simulated three additional studies, each with 3,000 genes. We again simulated three separate data sets, with percent differentially expressed genes p s  = 5, 10, 25%, and equivalent overexpressed and underexpressed genes. For Studies 3, 4, and 5, we used the data format similar to Study 1, with five replicate slides within three repeated experiments. We again simulated data from Model (1), with model parameters specified to be similar to the biological data. We set \( \eta ^{2}_{{jg0}} = 0.015 \) and c j  = 80, j = 1,...,5; the interstudy variability was assigned a normal distribution with mean 0.3, variance 0.1 for the differentially expressed genes, and mean 0.03, variance 0.004 for the nonexpressed genes, similar to the biological data. For the slide and experiment variance parameters for Studies 3, 4, and 5, we assigned values that were either within the range of values for Studies 1 and 2, or somewhat outside the range. For Study 3, we assigned slide variance of 0.05 and experiment variance 0.02; for Study 4, slide variance 0.04 and experiment variance 0.022; for Study 5, slide variance 0.06 and experiment variance 0.03. Each slide was standardized to have zero mean and unit standard deviation.

Biological data

B. subtilis is a bacterium that has the ability to form spores, which allow the organism to endure extreme environmental conditions. Two reciprocal B. subtilis microarray studies were implemented to identify sporulation genes controlled by a key transcription factor σE (sigma E). The studies had complementary designs; the first was a deletion of σE, and the second was an induction of σE, described in the following (see also Eichenberger et al. 2003, Conlon et al. 2004).

Mutant study

In the mutant study, the sample with a knockout of the gene for σE (the mutant sample) was compared to the wild-type sample, using cDNA microarrays. The wild-type/mutant gene expression ratios identified genes in the σE regulon. In total, five microarray slides were created from three separate identical experiments; the first experiment had three replicate slides and the second and third experiments each created one slide. The number of genes spotted on each array was somewhat larger than the genome size of 4,106 of B. subtilis due to replicate spotting of particular genes on the slides. The missing data rate due to low quality spots ranged from 18.6 to 64.5% for the five arrays; in total, 3,713 genes contained measurable expression levels for at least one array. We used log-ratios of expression values (Dudoit et al. 2002), normalized slides by a rank-invariant method (Schadt et al. 2001; Tseng et al. 2001), and standardized each array to have mean zero and unit standard deviation (Shen et al. 2004; Conlon et al. 2007).

Induction study

For the induction study, σE was overexpressed in response to a specific inducer; thus inducer-treated cells were compared to wild-type cells. Induction/wild-type ratios identified genes under the control of σE. In total, four microarray slides were produced from three separate identical experiments. The first two experiments each created one array and the third experiment created two replicate arrays. The percent of missing data due to low quality ranged from 52.6 to 67.0% for the four arrays; 2,552 genes contained measurable expression for at least one array. We again analyzed the post-normalized log-expression ratios, and standardized each slide to have mean zero and unit standard deviation.

Appendix B

Markov chain Monte Carlo implementation

We use a Markov chain Monte Carlo (MCMC) algorithm to simulate the posterior distributions of the parameters. MCMC procedures produce samples from the density p(ω) for the parameter ω [note that p(ω) may be known only to a proportionality constant] by generating a Markov chain on the state space of ω that has p as its stationary distribution (further details in Liu 2001).

Joint posterior distributions

For Model (1), the joint posterior distribution of the data and parameters is:

$$\begin{array}{*{20}c} {{p{\left( {y_{{jgse}} ,\xi _{{jge}} ,\phi ^{2}_{{jg}} ,\sigma ^{2}_{{jg}} ,\theta _{{jg}} ,I_{g} ,p,\eta ^{2}_{{jg0}} ,c_{j} } \right)} = }} \\ {{\left[ {{\prod\limits_{j = 1}^J {{\prod\limits_{g = 1}^G {{\prod\limits_{e = 1}^E {{\left\{ {{\prod\limits_{s = 1}^{S_{e} } {p{\left( {y_{{jgse}} \left| {\xi _{{jge}} ,\phi ^{2}_{{jg}} } \right.} \right)}p{\left( {\xi _{{jge}} \left| {\theta _{{jg,}} \sigma ^{2}_{{jg}} } \right.} \right)}} }} \right\}} \times } }} }} }} \right.}} \\ {{\left. {p{\left( {\theta _{{jg}} ,I_{g} \left| {p,\Omega _{j} } \right.} \right)}p{\left( {\phi ^{2}_{{jg}} } \right)}p{\left( {\sigma ^{2}_{{jg}} } \right)}p{\left( p \right)}p{\left( {\Omega _{j} } \right)}} \right]}} \\ \end{array} $$

where \(\Omega _{j} = {\left\{ {\eta ^{2}_{{jg0}} ,c_{j} } \right\}}\), j = study, g = gene, e = experiment, s = slide.

For Model (2), the joint posterior distribution of the data and parameters is:

$$\begin{array}{*{20}c} {{p{\left( {y_{{jgse}} ,\xi _{{jge}} ,\phi ^{2}_{{jg}} ,\sigma ^{2}_{{jg}} ,\theta _{{jg}} ,I_{g} ,p,\Omega _{j} } \right)} = }} \\ {{\left[ {{\prod\limits_{j = 1}^J {{\prod\limits_{g = 1}^G {{\prod\limits_{e = 1}^E {{\left\{ {{\prod\limits_{s = 1}^{S_{e} } {p{\left( {y_{{jgse}} \left| {\xi _{{jge}} ,\phi ^{2}_{{jg}} } \right.} \right)}p{\left( {\xi _{{jge}} \left| {\theta _{{jg,}} \sigma ^{2}_{{jg}} } \right.} \right)}} }} \right\}} \times } }} }} }} \right.}} \\ {{\left. {p{\left( {\theta _{{jg}} ,I_{g} \left| {p,\Omega _{{jg}} } \right.} \right)}p{\left( {\phi ^{2}_{{jg}} } \right)}p{\left( {\sigma ^{2}_{{jg}} } \right)}p{\left( p \right)}p{\left( {\Omega _{{jg}} } \right)}} \right]}} \\ \end{array} $$

where p = {p 0,p 1,p 2}, \(\Omega _{{{\mathbf{jg}}}} = {\left\{ {\mu _{{jg1}} ,\delta ^{2}_{{jg1}} ,\tau ^{2}_{{jg1}} ,\mu _{{jg2}} ,\delta ^{2}_{{jg2}} ,\tau ^{2}_{{jg2}} } \right\}}\), j = study, g = gene, e = experiment, s = slide.

Prior distributions for parameters common to Models (1) and (2)

We assigned as uninformative prior distributions as possible to the parameters of Models (1) and (2) that still resulted in convergence of the models. The prior distributions of the slide effect and experiment effect variance parameters, \( \phi ^{2}_{{jg}} \) and \( \sigma ^{2}_{{jg}} \), respectively, required some information from the data. We assigned the following inverse chi-squared prior distributions for these parameters (see also Tseng et al. 2001; Lönnstedt and Speed 2002; Gottardo et al. 2003; Conlon et al. 2006, 2007):

$$\begin{array}{*{20}c} {\phi ^{2}_{{jg}} \sim {\text{ }}{k\widetilde{\phi }^{2}_{j} } \mathord{\left/ {\vphantom {{k\widetilde{\phi }^{2}_{j} } {\chi ^{2}_{k} }}} \right. \kern-\nulldelimiterspace} {\chi ^{2}_{k} }} \\ {\sigma ^{2}_{{jg}} \sim {\text{ }}{h\widetilde{\sigma }^{2}_{j} } \mathord{\left/ {\vphantom {{h\widetilde{\sigma }^{2}_{j} } {\chi ^{2}_{h} }}} \right. \kern-\nulldelimiterspace} {\chi ^{2}_{h} }} \\ \end{array} $$

Here, \( \widetilde{\phi }^{2}_{j} \) and \( \widetilde{\sigma }^{2}_{j} \) are the scale parameters for the inverse chi-squared distributions determined from the data. \( \widetilde{\phi }^{2}_{j} \) is derived as follows:

$$ \widetilde{\phi }^{2}_{j} {\text{ }} = {\text{ }}\frac{1} {{G{\left( {{\sum {S_{e} } } - 1} \right)}}}{\sum\limits_{g = 1}^G {{\sum\limits_{e = 1}^E {{\sum\limits_{s = 1}^{S_{e} } {{\left( {y_{{jgse}} - y_{{jg \cdot e}} } \right)}^{2} } }} },} } $$

where y jg.e is the mean log-ratio of expression over the slides within experiment:

$$ y_{{jg \cdot e}} = \frac{1} {{S_{e} }}{\sum\limits_{s = 1}^{S_{e} } {y_{{jgse}} } }. $$

The scale parameter for \( \sigma ^{2}_{{jg}} \) is obtained similarly, as follows:

$$ \widetilde{\sigma }^{2}_{j} = \frac{1} {{G{\left( {E - 1} \right)}}}{\sum\limits_{g = 1}^G {{\sum\limits_{e = 1}^E {{\left( {y_{{jg.e}} - y_{{jg..}} } \right)}^{2} } }} }, $$

where y jg.. is the mean log-ratio of expression over both slides and experiments. We specified three degrees of freedom for each study for both \( \phi ^{2}_{{jg}} \) and \( \sigma ^{2}_{{jg}} \), i.e. h = k = 3, which corresponds to a distribution that is as uninformative as possible (for details, see Conlon et al. 2007).

Prior distributions for parameters specific to Model (1) only

For the parameter p, we assigned a noninformative Uniform(0,1) distribution. The remaining parameters were assigned the following prior distributions, which were as uninformative as possible while still allowing the models to converge.

$$\begin{array}{*{20}c} {\eta ^{2}_{{jg0}} {\text{ }} \sim {\text{ }}{as^{2}_{1} } \mathord{\left/ {\vphantom {{as^{2}_{1} } {\chi ^{2}_{a} }}} \right. \kern-\nulldelimiterspace} {\chi ^{2}_{a} }} \\ {c_{j} {\text{ }} \sim {\text{ }}{bs^{2}_{2} } \mathord{\left/ {\vphantom {{bs^{2}_{2} } {\chi ^{2}_{b} }}} \right. \kern-\nulldelimiterspace} {\chi ^{2}_{b} }} \\ \end{array} $$

The degrees of freedom and scale parameters a, \( s^{2}_{1} \), respectively, were assigned so that the prior mean of \( \eta ^{2}_{{jg0}} \) was 1 with variance 0.1, and b, \( s^{2}_{2} \) were assigned so that the prior mean of c j was 100 with variance 10,000 (further details in Conlon et al. 2007).

Prior distributions for parameters specific to Model (2) only

We assigned the following prior distributions to the parameters for Model (2).

$$\begin{array}{*{20}c} {\mu _{{jg1}} \sim {\text{ }}N{\left( {m_{1} ,{\text{ }}\delta ^{2}_{{jg1}} } \right)}{\left[ {{\text{truncated}}{\left( {{\text{0,}}\infty } \right)}} \right]}} \\ {\mu _{{jg2}} \sim {\text{ }}N{\left( {m_{2} ,{\text{ }}\delta ^{2}_{{jg2}} } \right)}{\left[ {{\text{truncated}}{\left( { - \infty {\text{,0}}} \right)}} \right]}} \\ {\delta ^{2}_{{jgi}} \sim {\text{ }}IG{\left( {u_{i} ,v_{i} } \right)},{\text{ }}i = 1,2} \\ {\tau ^{2}_{{jgi}} \sim {\text{ }}IG{\left( {u_{i} ,v_{i} } \right)},{\text{ }}i = 1,2} \\ \end{array} $$

Here, the hyperparameters m 1 and m 2 were set equal to the trimmed means of the highest and lowest p percent of data, respectively, based on the average of the individual study analyses (see also Shen et al. 2004; Conlon et al. 2007). The hyperparameters u 1 and u 2 were set equal to 3, and v 1 and v 2 were set equal to 2 × (trimmed variance) for the highest and lowest p percent of data, respectively, based on the average of the individual study analyses (see also Shen et al. 2004; Conlon et al. 2007). Based on these specifications, the prior means of \( \delta ^{2}_{{jgi}} \) and \( \tau ^{2}_{{jgi}} \), i = 1, 2 were equal to the variance of the top genes, with small variance, similar to Jung et al. (2006). Note that we used some information from the data but assigned as uninformative priors as possible. For p, we assigned an uninformative Dirichlet(α 0,α 1,α 2) prior with α 0 = α 1 = α 2 = 1.

Full conditional posterior distributions

Each parameter was sampled from the full conditional posterior distributions by an MCMC algorithm using the WinBUGS software (Spiegelhalter et al. 2003; Lunn 2003). For the two-study simulation and biological data sets, convergence was achieved after approximately 1,000 iterations for both models. For the five study simulation data sets, Model (1) converged after approximately 5,000 iterations; however, Model (2) required approximately 30,000 iterations until convergence. We used 5,000 iterations after convergence for all posterior inference, which was more than sufficient (similar to Tseng et al. 2001; Conlon et al. 2004, 2006, 2007). Further MCMC algorithm details can be found in Conlon et al. (2006, 2007).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Conlon, E.M. A Bayesian mixture model for metaanalysis of microarray studies. Funct Integr Genomics 8, 43–53 (2008). https://doi.org/10.1007/s10142-007-0058-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10142-007-0058-3

Keywords

Navigation