Robust Bayesian FDR Control Using Bayes Factors, with Applications to Multi-tissue eQTL Discovery

Abstract

Motivated by the genomic application of expression quantitative trait loci (eQTL) mapping, we propose a new procedure to perform simultaneous testing of multiple hypotheses using Bayes factors as input test statistics. One of the most significant features of this method is its robustness in controlling the targeted false discovery rate even under misspecifications of parametric alternative models. Moreover, the proposed procedure is highly computationally efficient, which is ideal for treating both complex system and big data in genomic applications. We discuss the theoretical properties of the new procedure and demonstrate its power and computational efficiency in applications of single-tissue and multi-tissue eQTL mapping.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Barreiro LB, Tailleux L, Pai AA, Gicquel B, Marioni JC, Gilad Y (2012) Deciphering the genetic architecture of variation in the immune response to Mycobacterium tuberculosis infection. Proc Natl Acad Sci USA 109(4):1204–1209

    Article  Google Scholar 

  2. 2.

    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57(1):289–300

    MathSciNet  MATH  Google Scholar 

  3. 3.

    Cox DR, Hinkley D (1979) Theoretical statistics. Chapman & Hall, London

    Google Scholar 

  4. 4.

    De la Cruz O, Wen X, Ke B, Song M, Nicolae DL (2010) Gene, region and pathway level analyses in whole-genome studies. Genet Epidemiol 34(3):222–231

    Google Scholar 

  5. 5.

    DiCiccio TJ, Kass RE, Raftery A, Wasserman L (1997) Computing Bayes factors by combining simulation and asymptotic approximations. J Am Stat Assoc 92(439):903–915

    MathSciNet  Article  Google Scholar 

  6. 6.

    Dimas AS, Deutsch S, Stranger BE, Montgomery SB et al (2009) Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325(5945):1246–1250

    Article  Google Scholar 

  7. 7.

    Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical bayes analysis of a microarray experiment. J Am Stat Assoc 96(456):1151–1160

    MathSciNet  Article  Google Scholar 

  8. 8.

    Flutre T, Wen X, Pritchard J, Stephens M (2013) A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet 9(5):e1003486

    Article  Google Scholar 

  9. 9.

    Genovese C, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32(3):1035–1061

    MathSciNet  Article  Google Scholar 

  10. 10.

    Good I (1992) The bayes/non-bayes compromise: a brief review. J Am Stat Assoc 87(419):597–606

    MathSciNet  Article  Google Scholar 

  11. 11.

    Ji Y, Lu Y, Mills GB (2008) Bayesian models based on test statistics for multiple hypothesis testing problems. Bioinformatics 24(7):943–949

    Article  Google Scholar 

  12. 12.

    Johnson VE (2005) Bayes factors based on test statistics. J R Stat Soc Ser B 67(5):689–701

    MathSciNet  Article  Google Scholar 

  13. 13.

    Johnson VE (2008) Properties of Bayes factors based on test statistics. Scand J Stat 35(2):354–368

    MathSciNet  Article  Google Scholar 

  14. 14.

    Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795

    MathSciNet  Article  Google Scholar 

  15. 15.

    Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of g priors for Bayesian variable selection. J Am Stat Assoc 103(481):410–423

    MathSciNet  Article  Google Scholar 

  16. 16.

    Müller P, Parmigiani G, Robert C, Rousseau J (2004) Optimal sample size for multiple testing: the case of gene expression microarrays. J Am Stat Assoc 99(468):990–1001

    MathSciNet  Article  Google Scholar 

  17. 17.

    Müller P, Parmigiani G, Rice K (2006) FDR and Bayesian multiple comparisons rules. In: Bayesian statistics 8, vol 0. Oxford University Press, p 349–370

  18. 18.

    Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5(2):155–76

    Article  Google Scholar 

  19. 19.

    Opgen-Rhein R, Strimmer K (2007) Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Stat Appl Genet Mol Biol 6

  20. 20.

    Raftery AE (1996) Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83(2):251–266

    MathSciNet  Article  Google Scholar 

  21. 21.

    Saville BR, Herring AH (2009) Testing random effects in the linear mixed model using approximate bayes factors. Biometrics 65(2):369–376

    MathSciNet  Article  Google Scholar 

  22. 22.

    Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3(7):e114

    Article  Google Scholar 

  23. 23.

    Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc Ser B 64(3):479–498

    MathSciNet  Article  Google Scholar 

  24. 24.

    Storey JD (2003) The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 31(6):2013–2035

    MathSciNet  Article  Google Scholar 

  25. 25.

    Storey JD (2007) The optimal discovery procedure: a new approach to simultaneous significance testing. J R Stat Soc Ser B 69(3):347–368

    MathSciNet  Article  Google Scholar 

  26. 26.

    Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100(16):9440–9445

    MathSciNet  Article  Google Scholar 

  27. 27.

    Storey JD, Taylor JE, Siegmund D (2004) Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Ser B 66(1):187–205

    MathSciNet  Article  Google Scholar 

  28. 28.

    Sun W, Cai TT (2007) Oracle and adaptive compound decision rules for false discovery rate control. J Am Stat Assoc 102(479):901–912

    MathSciNet  Article  Google Scholar 

  29. 29.

    Wakefield J (2009) Bayes factors for genome-wide association studies: comparison with P-values. Genet Epidemiol 33(1):79–86

    Article  Google Scholar 

  30. 30.

    Wen X (2014) Bayesian model selection in complex linear systems, as illustrated in genetic association studies. Biometrics 70(1):73–83

    MathSciNet  Article  Google Scholar 

  31. 31.

    Wen X, Stephens M (2014) Bayesian methods for genetic association analysis with heterogeneous subgroups: from meta-analyses to gene-environment interactions. Ann Appl Stat 8(1):176–203

    MathSciNet  Article  Google Scholar 

  32. 32.

    Whittemore AS (2007) A Bayesian false discovery rate for multiple testing. J Appl Stat 34(1):1–9

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgments

We thank Debashis Ghosh, Matthew Stephens, and Timothee Flutre for their fruitful discussion and feedbacks. We are grateful for the insightful comments from the two anonymous reviewers. This work is supported by the NIH Grant R01 MH101825 (PI: M.Stephens).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xiaoquan Wen.

Appendices

Appendix 1: Proof of Proposition 1

Proof

Suppose that the true generative distributions of \({{\varvec{Y}}}_i\) under the null and alternative model are given by \(p^{0^*}_i\) and \(p^{1^*}_i\), respectively. As sample size \(n_i\) is sufficiently large for each test, the true Bayes factor, \(\mathrm{BF}_i^*\), has the following properties

$$\begin{aligned} \begin{array}{ll} &{}\mathrm{BF}_i^* \xrightarrow {P} 0,~\text{ if } {{\varvec{Y}}}_i \sim p^{0^*}_i, \\ &{}~~~~~~~~~\text {and} \\ &{} \mathrm{BF}_i^* \xrightarrow {P} \infty ,~\text{ if } {{\varvec{Y}}}_i \sim p^{1^*}_i.\\ \end{array} \end{aligned}$$
(6.1)

Assumption 1, in contrast, only requires that the assumed Bayes factors satisfy

$$\begin{aligned} \mathrm{BF}_i \xrightarrow {P} 0,~\text{ if } {{\varvec{Y}}}_i \sim p^{0^*}_i. \end{aligned}$$
(6.2)

Under the conditions stated in Assumption 2, \(\Pr \left( {\hat{\pi }}_0 \ge \pi _0 \mid \pi _0 \right) \rightarrow 1\) implies that \(\Pr \left( {\hat{\pi }}_0 \ge \pi _0 \mid {{\varvec{\mathcal {Y}}}}\right) \rightarrow 1\) for an arbitrary prior distribution on \(\pi _0\). Let \(\epsilon := \Pr \left( {\hat{\pi }}_0 < \pi _0 \mid {{\varvec{\mathcal {Y}}}}\right) \). Then, it follows that

$$\begin{aligned} \epsilon \xrightarrow {P} 0,~\text{ as } \text{ the } \text{ number } \text{ of } \text{ null } \text{ tests } \text{ is } \text{ sufficiently } \text{ large. } \end{aligned}$$
(6.3)

Importantly, the large number of the null tests is required for ensuring the upper-bound property of \({\hat{\pi }}_0\).

Consequently,

$$\begin{aligned} \begin{aligned} \Pr (Z_i = 0 \mid {{\varvec{\mathcal {Y}}}})&= \int \frac{\pi _0}{\pi _0 + (1-\pi _0) \mathrm{BF}_i^*} \, p(\pi _0 \mid {{\varvec{Y}}}_i)\,d\,\pi _0 \\&= \int _{\pi _0 \le {\hat{\pi }}_0} \frac{\pi _0}{\pi _0 + (1-\pi _0) \mathrm{BF}_i^*} \, p(\pi _0 \mid {{\varvec{Y}}}_i)\,d\,\pi _0\\&\quad \,+ \int _{\pi _0 > {\hat{\pi }}_0} \frac{\pi _0}{\pi _0 + (1-\pi _0) \mathrm{BF}_i^*} \, p(\pi _0 \mid {{\varvec{Y}}}_i)\,d\,\pi _0 \\&\le \frac{{\hat{\pi }}_0}{{\hat{\pi }}_0 + (1- {\hat{\pi }}_0) \mathrm{BF}_i^*} + \epsilon \end{aligned} \end{aligned}$$
(6.4)

By (6.1), as \(n_i \rightarrow \infty \),

$$\begin{aligned} \frac{{\hat{\pi }}_0}{{\hat{\pi }}_0 + (1- {\hat{\pi }}_0) \mathrm{BF}_i^*} \xrightarrow {P} I\{{{\varvec{Y}}}_i \sim p^{0*}_i \} + 0 \cdot I\{{{\varvec{Y}}}_i \sim p^{1*}_i \} \end{aligned}$$
(6.5)

whereas by (6.2),

$$\begin{aligned} \begin{array}{ll} &{}\dfrac{{\hat{\pi }}_0}{{\hat{\pi }}_0 + (1- {\hat{\pi }}_0) \mathrm{BF}_i} \xrightarrow {P} 1, ~ \text{ if } ~ {{\varvec{Y}}}_i \sim p^{0*}_i, \\ &{}\dfrac{{\hat{\pi }}_0}{{\hat{\pi }}_0 + (1- {\hat{\pi }}_0) \mathrm{BF}_i} \ge 0, ~ \text{ if } ~ {{\varvec{Y}}}_i \sim p^{1*}_i. \end{array} \end{aligned}$$
(6.6)

Hence,

$$\begin{aligned} \lim _{n \rightarrow \infty } \Pr \left( \frac{{\hat{\pi }}_0}{{\hat{\pi }}_0 + (1- {\hat{\pi }}_0) \mathrm{BF}_i} \ge \frac{{\hat{\pi }}_0}{{\hat{\pi }}_0 + (1- {\hat{\pi }}_0) \mathrm{BF}_i^*} \right) = 1, \end{aligned}$$
(6.7)

and by (6.3) and (6.4), each individual test satisfies

$$\begin{aligned} \lim _{n_i \rightarrow \infty } \Pr \left( (1 - {\hat{v}}_i) \ge \Pr (Z_i = 0 \mid {{\varvec{Y}}}_i) \right) = 1 \end{aligned}$$
(6.8)

The decision rule stated in the Proposition 1 yields a true Bayesian FDR

$$\begin{aligned} \overline{\text {FDR}} = \frac{\sum _{i=1}^m \delta ^\dagger _i \Pr (Z_i =0 \mid {{\varvec{Y}}}_i)}{D^\dagger \vee 1}. \end{aligned}$$
(6.9)

By (6.8), it is clear that

$$\begin{aligned} \lim _{n \rightarrow \infty } \Pr \left( \overline{\text {FDR}}\le \frac{\sum _{i=1}^m \delta ^\dagger _i (1-{\hat{v}}_i)}{D^\dagger \vee 1} \right) = 1. \end{aligned}$$
(6.10)

Therefore, the Bayesian FDR can be consistently controlled. Furthermore, controlling \(\frac{\sum _{i=1}^m \delta ^\dagger _i (1-{\hat{v}}_i)}{D^\dagger \vee 1} \le \alpha \) ensures that

$$\begin{aligned} \text {FDR}= \mathrm{E}(\overline{\text {FDR}}) \le \mathrm{E}\left( \frac{\sum _{i=1}^m \delta ^\dagger _i (1-{\hat{v}}_i)}{D^\dagger \vee 1} \right) \le \alpha , \end{aligned}$$
(6.11)

because \(\overline{\text {FDR}}\) is obviously bounded. \(\square \)

Appendix 2: Proof of Lemma 2

Proof

Let \((\mathrm{BF}_{(1)},\dots ,\mathrm{BF}_{(m)})\) denote the order statistics from the m Bayes factors that are all generated from the respective null hypotheses. Let \(M_j := \frac{1}{j} \sum _{i=1}^j \mathrm{BF}_{(i)}\) denote the partial sample mean computed by the EBF procedure. Note that the sequence \(M_1, M_2, \dots \) is monotonically non-decreasing. Furthermore, by the law of large numbers and the result of Lemma 1, it follows that

$$\begin{aligned} M_m \xrightarrow {P} 1, \end{aligned}$$
(6.12)

for sufficiently large m.

In the case that \(d_0 < m\), it must be true that

$$\begin{aligned} 1 \le M_{d_0+1} \le M_m \xrightarrow {P} 1. \end{aligned}$$

Because the truncated sample mean from the order statistics \((\mathrm{BF}_{(d_0+1)}, ... , \mathrm{BF}_{(m)})\) converges to a quantity that is strictly >1, their contribution to the overall sample mean, \(M_m\), must be negligible, i.e.,

$$\begin{aligned} \frac{m - d_0}{m} \, \left( \frac{1}{m-d_0}\sum _{j=d_0+1}^m \mathrm{BF}_{(j)} \right) \xrightarrow {P} 0. \end{aligned}$$

Taken together, we have shown that

$$\begin{aligned} {\hat{\pi }}_0 \, = \, \frac{d_0}{m} \, \xrightarrow {P} \, 1. \end{aligned}$$
(6.13)

\(\square \)

Appendix 3: Proof of Proposition 2

Proof

In the general mixture case, let \(\mathcal {S}^0\) denote the subset of Bayes factors whose data are generated from the null models. Based on Lemma 2, applying the EBF procedure on \(\mathcal {S}^0\) results in an estimate

$$\begin{aligned} \frac{d_{\mathcal {S}^0}}{ |\mathcal {S}^0|} \xrightarrow {P} 1 , \end{aligned}$$
(6.14)

where \(|\mathcal {S}^0|\) denotes the cardinality of \(\mathcal {S}^0\). In the mixed samples, (6.14) suggests that

$$\begin{aligned} M_{d_{\mathcal {S}^0}} = \frac{1}{d_{\mathcal {S}^0}} \, \sum _{j=1}^{d_{\mathcal {S}^0}} \mathrm{BF}_{(j)} \le 1. \end{aligned}$$
(6.15)

The LHS should be strictly <1 if there exists small values of Bayes factors from the alternative models.

Because the EBF procedure finds the largest subset whose sample mean is <1, it must hold true that

$$\begin{aligned} \Pr ( d_0 \ge d_{\mathcal {S}^0} \mid \pi _0 ) \rightarrow 1, \end{aligned}$$
(6.16)

and thus we conclude that

$$\begin{aligned} \Pr ({\hat{\pi }}_0 \ge \pi _0 \mid \pi _0 ) \rightarrow 1. \end{aligned}$$
(6.17)

\(\square \)

Appendix 4: Proof of Corollary 1

Proof

Given a predefined FDR level \(\alpha \), the rejection threshold on the estimated false discovery probabilities is given by

$$\begin{aligned} t^\dagger _{\alpha } = \arg \min _t \left( \frac{\sum _{i=1}^m \delta _i^\dagger (t) (1-{\hat{v}}_i )}{D^\dagger (t) \vee 1} \le \alpha \right) . \end{aligned}$$

Equivalently when there is at least one rejection, the above rejection threshold implies that the rejection set \(\Omega := \{i: {\hat{v}}_i > t^\dagger _{\alpha }\}\) is the largest set such that

$$\begin{aligned} \frac{\sum _{ i \in \Omega } \Pr (Z_i =0 \mid {{\varvec{Y}}}_i, {\hat{\pi }}_0)}{|| \Omega || } \le \alpha , \end{aligned}$$

where \(\Pr (Z_i = 0 \mid {{\varvec{Y}}}_i, {\hat{\pi }}_0) = (1-{\hat{v}}_i)\) and \(|| \Omega || = D^\dagger (t_{\alpha }^\dagger )\) denotes the cardinality of the set \(\Omega \). That is, the average estimated false rejection probability in the rejection set should be \(\le \alpha \). Consequently, it implies that if the i-th test yields \(\Pr (Z_i = 0 \mid {{\varvec{Y}}}_i, {\hat{\pi }}_0) \le \alpha \), it must be included in the rejection set (because \(\Omega \) is the largest such set). Therefore, to prove the corollary, it is sufficient to show that \(\Pr (Z_i=0 \mid {{\varvec{Y}}}_i, {\hat{\pi }}_0) \le \alpha \), provided that \(\mathrm{BF}_i \ge \frac{m}{\alpha }\).

Applying the EBF procedure, a single Bayes factor with the value exceeding m leads to

$$\begin{aligned} {\hat{\pi }}_0 \le 1- \frac{1}{m}. \end{aligned}$$
(6.18)

This implies

$$\begin{aligned} \Pr (Z_i= & {} 0 \mid {{\varvec{Y}}}_i, {\hat{\pi }}_0) \le \frac{ 1- \frac{1}{m} }{(1- \frac{1}{m}) + \frac{1}{m} \mathrm{BF}_i} \le \frac{m-1}{m+(m-1) \alpha }\nonumber \\&\cdot \alpha< \frac{m-1}{ m -1 + m \alpha } \cdot \alpha < \alpha \end{aligned}$$
(6.19)

\(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wen, X. Robust Bayesian FDR Control Using Bayes Factors, with Applications to Multi-tissue eQTL Discovery. Stat Biosci 9, 28–49 (2017). https://doi.org/10.1007/s12561-016-9153-0

Download citation

Keywords

  • False Discovery Rate
  • Null Model
  • eQTL Mapping
  • False Discovery Rate Control
  • False Discovery Rate Level