Abstract
Metaanalysis methods are used to synthesize results of multiple studies on the same topic. The most frequently used statistical model in metaanalysis is the randomeffects model containing parameters for the overall effect, betweenstudy variance in primary study’s true effect size, and random effects for the studyspecific effects. We propose Bayesian hypothesis testing and estimation methods using the marginalized randomeffects metaanalysis (MAREMA) model where the studyspecific true effects are regarded as nuisance parameters which are integrated out of the model. We propose using a flat prior distribution on the overall effect size in case of estimation and a proper unit information prior for the overall effect size in case of hypothesis testing. For the betweenstudy variance (which can attain negative values under the MAREMA model), a proper uniform prior is placed on the proportion of total variance that can be attributed to betweenstudy variability. Bayes factors are used for hypothesis testing that allow testing point and onesided hypotheses. The proposed methodology has several attractive properties. First, the proposed MAREMA model encompasses models with a zero, negative, and positive betweenstudy variance, which enables testing a zero betweenstudy variance as it is not a boundary problem. Second, the methodology is suitable for default Bayesian metaanalyses as it requires no prior information about the unknown parameters. Third, the proposed Bayes factors can even be used in the extreme case when only two studies are available because Bayes factors are not based on large sample theory. We illustrate the developed methods by applying it to two metaanalyses and introduce easytouse software in the R package BFpack to compute the proposed Bayes factors.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The rapidly expanding scientific literature calls for methods to synthesize research such as a systematic review. Metaanalysis is an important part of a systematic review and refers to applying statistical methods to combine findings of different studies on the same topic. The first step of a metaanalysis is to obtain a standardized effect size (e.g., standardized mean difference or correlation coefficient) for each included study and these effect sizes are subsequently combined by means of metaanalysis methods to summarize the included studies. Metaanalysis is nowadays seen as the gold standard for research synthesis (Aguinis et al., 2011; Head et al., 2015) and is often used for policy making as its results are seen as best available evidence (Thompson & Sharp, 1999; Cordray & Morphy, 2009; Hedges & Olkin, 1985).
The randomeffects model is the most commonly used statistical model in metaanalysis (Borenstein et al., 2010; 2009). In the randomeffects model, each study is assumed to have an unknown true underlying effect size. The main parameter of interest in this model is the overall effect size of the studies included in the metaanalysis. However, estimation and statistical inferences for the betweenstudy variance in the randomeffects model is just as important (Higgins et al., 2009) because both parameters focus on distinct and relevant aspects of a metaanalysis. For example, the overall effect size in a metaanalysis on the efficacy of a psychological treatment refers to the average efficacy of the treatment across studies and the betweenstudy variance quantifies the heterogeneity across studies.
The vast majority of metaanalyses uses frequentist analysis techniques for estimation and drawing statistical inferences. However, Bayesian metaanalysis methods have been proposed as well and have gained in popularity (Xu et al., 2008). Bayesian methods are especially well suited for analyzing metaanalytic data (Smith et al., 1995; Sutton & Abrams, 2001; Lunn et al., 2013; Turner & Higgins, 2019) because the multilevel structure of a randomeffects metaanalysis can be straightforwardly taken into account. Moreover, estimation of the betweenstudy variance is imprecise using frequentist metaanalysis methods in the common situation of a metaanalysis containing a small number of studies (Chung, RabeHesketh, & Choi, 2013; Sidik & Jonkman, 2007; Kontopantelis, Springate, & Reeves, 2013). Bayesian metaanalysis methods are advantageous in this situation because (i) externally available information about the betweenstudy variance can be incorporated in the prior distribution if available, and (ii) the methodology does not directly rely on large sample theory.
Metaanalysts generally want to estimate and conduct hypothesis tests for the parameters in the randomeffects model. The vast majority of the literature on Bayesian metaanalysis methods has been focused on parameter estimation using either empirical Bayes or fully Bayesian estimation (e.g., (Turner et al., 2015; Rhodes et al., 2015; Normand, 1999; Lambert et al., 2002)). However, hypothesis testing using Bayes factors has also been proposed, which quantifies evidence of one model relative to another model (Kass & Raftery, 1995). (Berry, 1998) proposed a Bayes factor for testing the null hypothesis of no betweenstudy variance in a metaanalysis of studies using 2x2 contingency tables. Rouder and Morey (2011) developed a Bayes factor to test the null hypothesis of no effect in a metaanalysis of studies using a twoindependent groups design. (Scheibehenne et al., 2017) and (Gronau et al., 2017) proposed a Bayesian model averaging approach to compute an average Bayes factor for testing the null hypothesis of no effect. In this approach, an average Bayes factor is computed by averaging over posterior model probabilities obtained with the randomeffects model and the equaleffect model (a.k.a. fixedeffect or commoneffect model) where the study’s true effect sizes are assumed to be homogeneous.
The first contribution of our paper is that we propose, in contrast to existing Bayesian metaanalysis methods, to use a marginalized randomeffects metaanalysis (MAREMA) model rather than the randomeffects model. In this MAREMA model, the studyspecific true effects are regarded as nuisance parameters and integrated out of the probability density function. The elimination of nuisance parameters via integration is common in integrated likelihood methods (e.g., Berger et al., (1999)), and has recently been extended to marginal random intercept models (Mulder and Fox, 2013; 2019), and marginal item response theory models (Fox et al., 2017).
The proposed MAREMA model encompasses three important metaanalysis models. First, the equaleffect model is included if the betweenstudy variance is equal to zero. Second, the randomeffects model is included when the betweenstudy variance is positive, which implies that the differences between studies’ effect sizes cannot be fully explained by sampling error. Third, the randomeffects model in case of a negative betweenstudy variance is also included. A negative betweenstudy variance indicates that the differences between studies’ effect sizes are smaller than expected based on sampling error. A negative betweenstudy variance may yield relevant insights for metaanalysts because it may indicate that the assumption of independence of primary studies in a metaanalysis is violated or that the computation of the effect sizes of the primary studies is incorrect (Ioannidis et al., 2006).
Our proposed methodology is also distinctive from other Bayesian metaanalysis methods because we place a prior distribution on the proportion of total variance that can be attributed to betweenstudy variance rather than directly on the betweenstudy variance parameter. This proportion is known as I^{2} (Higgins & Thompson, 2002; Higgins et al., 2003) in the metaanalysis literature and it is frequently used to quantify the relative heterogeneity around the true effect size (Borenstein et al., 2017). An advantage of placing a prior on I^{2} is that it is a bounded parameter which enables us to place a proper (noninformative) uniform prior to compute Bayes factors (note that Bayes factors generally cannot be computed using improper priors while at the same time arbitrarily vague proper priors should also be avoided due to Bartlett’s phenomenon (Jeffreys, 1961; Lindley, 1957; Bartlett, 1957). Due to the uniformity of the prior, the proposed Bayes factors can be used for a default Bayesian metaanalysis without requiring external prior knowledge about the model parameters.
The proposed Bayes factors enable testing point and onesided hypotheses. Examples of a point and onesided hypothesis are testing whether the overall effect size in a metaanalysis equals zero (i.e., H : μ = 0) or is larger than zero (i.e., H : μ > 0). Moreover, the proposed Bayes factors also enable testing multiple hypotheses simultaneously (e.g., H : μ = 0 vs. H : μ > 0 vs. H : μ < 0). Another attractive property is that the quantification of relative evidence between hypotheses is exact even in the extreme case of only two studies in a metaanalysis as Bayes factors do not rely on large sample theory.
The outline of this paper is as follows. We continue by further introducing the MAREMA model and illustrate Bayesian estimation under the MAREMA model. Subsequently, we introduce Bayes factor testing under the MAREMA model and elaborate on the specification of the prior distributions. In Section “Bayesian hypothesis testing in examples”, the MAREMA model is used to compute Bayes factors for a metaanalysis on the efficacy of two treatments for posttraumatic stress disorder (PTSD) and a metaanalysis on the effect of a smartphone application and cessation support on longterm smoking abstinence. We conclude the paper with a discussion section.
Marginalized randomeffects metaanalysis model
The conventional randomeffects metaanalysis model (Borenstein et al., 2009; Konstantopoulos & Hedges, 2019) assumes that i = 1, 2, ..., k independent effect sizes are extracted from studies on the same topic. The statistical model of the randomeffects model can be written as
where y_{i} is the observed effect size of the i th study, 𝜃_{i} is the studyspecific (unknown) true effect size, μ is the overall true effect size, τ^{2} is the betweenstudy variance in true effect size, and \({\sigma _{i}^{2}}\) is the known sampling variance of the i th study, which is conventionally estimated in practice and then assumed to be known in the analysis. The randomeffects model simplifies to an equaleffect model if τ^{2} = 0 because all studies then share a common true effect size. Note that other distributions for the random effects than the normal distribution have been proposed (e.g., Baker & Jackson, 2008; 2016; Lee & Thompson, 2008), but a normal distribution for the random effects remains to be used in almost any randomeffects metaanalysis.
The studyspecific true effects 𝜃_{i} are generally treated as nuisance parameters in the randomeffects model. We integrate these out of the randomeffects model in Eq. 1 to obtain the MAREMA model, which is given by (see also Raudenbush & Bryk, 1985 for instance)
Multiple estimators have been proposed for estimating the betweenstudy variance τ^{2} (Veroniki et al., 2016; Langan et al., 2016). Estimates of τ^{2} cannot be compared across metaanalyses if these metaanalyses used different effect size measures. That is, a τ^{2} estimate in a metaanalysis of standardized mean differences cannot be compared to one of correlation coefficients, and this was one of the reasons to develop I^{2} that will be described next (Higgins & Thompson, 2002).
Quantifying heterogeneity using I ^{2}
A commonly used way to quantify the relative heterogeneity in a metaanalysis is using I^{2} (Higgins & Thompson, 2002; Higgins et al., 2003),
where \(\tilde {\sigma }^{2}\) is the typical withinstudy sampling variance that is computed with
I^{2} has an intuitive interpretation because it is the proportion of total variance that can be attributed to betweenstudy variance in true effect size. Note that I^{2} resembles the intraclass correlation coefficient (ICC) that is routinely reported in multilevel analysis. This ICC indicates the proportion of total variance that can be attributed to taking into account the dependence of observations within the level 2 units (e.g., (Hox et al., 2018)). However, a major difference between I^{2} and the ICC is that the total variance in the computation of I^{2} (i.e., \(\tau ^{2}+\tilde {\sigma }^{2}\)) is a function of the studies’ sample size whereas the ICC is not a function of the sample size of the level 2 units. Hence, I^{2} artificially increases if the sample size of the primary studies increases while τ^{2} remains constant (Rücker et al., 2008).
Next, we reparameterize the MAREMA model in Eq. 2 using I^{2}. We replace I^{2} with the Greek letter ρ to make explicit that it is an unknown parameter and can attain negative values. The MAREMA model can then equivalently be written as
where \(\tilde {\sigma }^{2} \rho /(1\rho )\) transforms ρ to its corresponding τ^{2}. In Eq. 5, ρ must lie in the interval (ρ_{min},1) with
where \(\sigma ^{2}_{min}\) is the smallest sampling variance of the studies included in the metaanalysis. ρ_{min} is the smallest possible value of the parameter ρ given the observed data and is always negative. Note that the special case ρ = 0 (i.e., the equaleffect model) does not lie on the boundary of the parameter space, which enables testing the hypothesis of homogeneous true effect size.
Bayesian estimation under the MAREMA model
The MAREMA model can be estimated using flat priors if prior information is absent,
where U refers to the uniform distribution. Flat priors are used for estimation to minimize the impact of the priors on the results. We illustrate Bayesian estimation under the MAREMA model by analyzing data of two metaanalyses. The first metaanalysis is by Ho & Lee, (2012) on the efficacy of eye movement desensitization and reprocessing (EMDR) therapy versus exposure based cognitive behavior therapy (CBT) to treat PTSD. This metaanalysis consists of ten standardized mean differences (i.e., Hedges’ g) and a positive effect size indicates that EMDR therapy is more efficacious than CBT. The second metaanalysis is by Whittaker et al.,, (2019) on the difference between using a smartphone app for smoking cessation support and lower intensity support on longterm abstinence. Three studies are included in this metaanalysis and log risk ratio is the effect size measure of interest. A positive log risk ratio indicates that using a smartphone app for smoking cessation support yields more longterm abstinence than lower intensity support. These examples were selected to illustrate that the proposed methodology can be used for different effect size measures and metaanalyses with only a small number of primary studies, which is especially common in medical research (Rhodes et al., 2015; Turner et al., 2015).
A Gibbs sampler is proposed for Bayesian estimation under the MAREMA model (see Appendix A). We also analyzed the data of these metaanalyses with frequentist randomeffects metaanalysis using the restricted maximum likelihood estimator (Raudenbush, 2009) for estimating τ^{2} as implemented in the R package metafor (Viechtbauer, 2010). R code of these analyses is available at https://osf.io/jcge7/.
The posterior distributions of μ and ρ when fitting the MAREMA model to the metaanalyses by Ho & Lee, (2012) (solid lines) and Whittaker et al.,, (2019) (dashed lines) are presented in Fig. 1. Remarkably, the posterior distributions of ρ (right panel of Fig. 1) are very wide for both metaanalyses. There is also considerable posterior support for negative ρ values. This could suggest that the randomeffects model, which is employed for the frequentist analysis, may not be appropriate for these data. We will reflect on causes and the implications of a negative value for ρ in the discussion section.
Parameter estimates obtained with the MAREMA model and also the frequentist randomeffects model are presented in Table 1. The first row of Table 1 shows the results of the MAREMA model and the second row those of the frequentist randomeffects model. The first three columns show the results of estimating μ and the last three columns those of estimating ρ. The metafor package was used for the frequentist metaanalysis, which does not report a standard error for \(\hat {\rho }\).
Parameter estimates of the MAREMA model and frequentist metaanalysis of the metaanalysis by Ho & Lee, (2012) were comparable. However, the estimates for μ were slightly larger under the MAREMA model relative to the frequentist metaanalysis estimate. Furthermore, as expected, the 95% credibility interval for ρ under the MAREMA model was considerably wider than the 95% confidence interval of the frequentist metaanalysis due to the fact that the randomeffects model does not allow negative values for ρ and therefore there is less “room” for ρ to vary. To conclude, EMDR therapy was more efficacious than CBT therapy for treating PTSD, and heterogeneity was imprecisely estimated close to zero (indicating homogeneity).
Parameter estimates for μ under the MAREMA model were approximately zero whereas the frequentist metaanalytic estimate for μ was slightly larger for the metaanalysis by Whittaker et al.,, (2019). Furthermore, due to the skewness of the posterior of ρ under the MAREMA model there is a considerable difference between the posterior mean and posterior mode, where the latter is close to the estimate under the frequentist randomeffects model. To conclude, estimates based on the MAREMA model and frequentist metaanalysis differed slightly for the metaanalysis of Whittaker et al.,, (2019). These difference were probably caused by the metaanalysis only containing three studies.
Given the uncertainty in the unconstrained estimates it is particularly useful to test precise null hypotheses on the overall effect μ and the relative heterogeneity ρ. Hence, Bayesian hypothesis testing under the MAREMA model is discussed next.
Bayesian hypothesis testing under the MAREMA model
Testing hypotheses plays a fundamental role in scientific research in general and psychological science in particular. In this section, we propose multiple hypothesis tests for the mean μ and the relative betweenstudy heterogeneity ρ separately.
We propose testing hypotheses for both μ and ρ under the MAREMA model. The following hypotheses are being tested for μ:
where support for H_{0}, H_{1}, or H_{2} indicates that the overall effect μ is equal to zero, is negative, or is positive, respectively. For ρ we test the following hypotheses:
where support for H_{0}, H_{2}, or H_{1} indicates a good fit of an equaleffect model, a randomeffects model, or a model which assumes less variance due to sampling error (and thus a misfit of the equaleffect or randomeffects model), respectively.
Bayes factors are used for testing the proposed hypotheses (Jeffreys, 1961; Kass & Raftery, 1995). A Bayes factor quantifies the evidence in the data for one hypothesis relative to a contrasting hypothesis via the ratio of the marginal likelihood,
where y is a vector of the observed effect size y_{i} and m_{1} and m_{2} are the marginal likelihoods under H_{1} and H_{2}, respectively. For example, B_{12} = 1 indicates that both hypotheses are equally supported by the data whereas B_{12} = 20 indicates that there is 20 times more support in the data for H_{1} relative to H_{2}.
Prior specification
In Bayesian hypothesis testing for the overall effect size, it is not recommended to use an arbitrarily vague prior to avoid Bartlett’s paradox (Jeffreys, 1961; Lindley, 1957; Bartlett, 1957). Instead, following (Mulder & Fox, 2019), we propose a unitinformation prior for μ conditional on ρ in combination with a proper uniform prior for ρ. Under the unconstrained MAREMA model this boils down to
The unitinformation prior π(μρ) contains the amount of information of a single study (Zellner, 1986). This is visible in the variance of π(μρ) because the number of studies in a metaanalysis k is multiplied by the variance of \(\hat {\mu }\), which is \((\mathbf {1}^{\prime } \mathbf {\Sigma }_{\rho }^{1} \mathbf {1})^{1}\) where 1 is a column vector of ones and \(\mathbf {\Sigma }_{\rho } = \text {diag}\left ({{\sigma }_{1}^{2}}+\tau ^{2},...,{{\sigma }_{k}^{2}}+\tau ^{2}\right )\). Unitinformation priors are commonly used for computing Bayes factors in model selection and hypothesis testing problems. For example, the wellknown Bayesian information criterion (BIC, (Schwarz, 1978)) is based on an approximation of the marginal likelihood using a unitinformation prior (Raftery, 1995; Kass & Wasserman, 1995). This class of priors is also employed in many other Bayesian testing scenarios (e.g., (Liang et al., 2008; Rouder & D Morey, 2012; Mulder et al., in press)). The usefulness of unitinformation priors lies in the fact that they cover a reasonable range of possible values for the model parameters. As the prior contains the information of a single study, these priors are neither too informative nor too vague (to avoid Bartlett’s paradox). The prior π(μρ) depends on ρ because τ^{2} is included in the variancecovariance matrix Σ_{ρ}. Note that the prior π(μρ) is different from the prior used for Bayesian estimation under the MAREMA model in Section “Bayesian estimation under the MAREMA model” as we used an improper prior for estimation whereas the prior π(μρ) used for testing is a proper prior. The proper prior π(ρ) is the same prior distribution as we proposed for estimation under the MAREMA model.
The unconstrained prior in Eq. 11 is used as building block for the priors under the hypotheses of interest. In particular, for the onesided hypotheses, truncated priors are considered while the precise null hypotheses receive a point mass at the null value. Figure 2 illustrates the proposed prior distributions for testing unconstrained, onesided, and point hypotheses under the MAREMA model. The left panel shows prior distributions for testing hypotheses regarding μ where ρ is left unconstrained. The dashed line refers to the prior distribution of the unconstrained hypothesis in Eq. 11. The asterisk at the top of the figure illustrates the point mass for testing the hypothesis H_{0} : μ = 0. The solid and dotted lines refer to the prior distributions for the onesided hypotheses H_{1} : μ < 0 and H_{2} : μ > 0, respectively. Their heights are twice as large as the unconstrained prior to ensure that the distributions integrate to 1. The right panel shows prior distributions for testing hypotheses regarding ρ. The dashed, solid, and dotted lines refer to the prior distributions of the unconstrained (H_{u}), leftsided (H_{1} : ρ < 0), and rightsided (H_{2} : ρ > 0) hypotheses whereas the asterisk refers to the point mass for the hypothesis of no betweenstudy variance (H_{0} : ρ = 0).
Marginal likelihood
The marginal likelihoods of the different hypotheses differ with respect to the prior distributions. Hence, the marginal likelihoods of the different hypotheses can be computed by using different prior distributions in combination with adjusting the limits of integration.
For example, the marginal likelihood for the onesided hypothesis H_{1} : μ < 0 with ρ unconstrained can be written as a function of the marginal likelihood under the unconstrained model,
where f(yμ,ρ) is the likelihood function of the MAREMA model in Eq. 5, m_{u}(y) is the marginal likelihood under the unconstrained model, P(μ < 0y,H_{u}) and P(μ < 0H_{u}) are the posterior and prior model probabilities for μ < 0 under the hypothesis H_{u}, and the prior under H_{1} can be written as a truncation of the unconstrained prior
where I(⋅) is the indicator function. Note here that \(P(\mu < 0  H_{u})=\frac {1}{2}\) because the unconstrained prior for μ is centered around 0. The posterior probability in Eq. 12 can be computed as the proportion of unconstrained posterior draws satisfying μ < 0. Interested readers are referred to Appendix B for further computational details.
Software: BFpack
The R package BFpack (Mulder et al., in press) contains functions for computing Bayes factors for a large set of statistical models (e.g., multivariate regression, generalized linear models, and correlation analysis). As an argument the main function “BF” needs a fitted modeling object which defines under which model Bayes factors need to be computed. To execute the Bayes factor test under the MAREMA model using the “BF” function, the function needs as argument an object returned by fitting a randomeffects metaanalysis model with the metafor package. The metafor package is popular software for conducting metaanalysis that can be used for any effect size measure and requires the metaanalyst to supply the observed effect sizes of the primary studies and the corresponding sampling variances (or standard errors). Hence, researchers familiar with the metafor package can readily compute the Bayes factors that we propose using the function BF. The “BF” function also returns unconstrained estimates of μ and ρ based on a Gibbs sampler.
Bayesian hypothesis testing in examples
We compute Bayes factors using the MAREMA model for the two examples introduced in Section “?? ??”. The MAREMA model was fitted to the two metaanalyses using the proposed unitinformation prior on μ and uniform prior on ρ.^{Footnote 1} We tested the three hypotheses for μ and ρ listed in Eqs. 8 and 9, respectively. We analyzed the data using R (R Core Team, 2020) and the R packages metafor (Viechtbauer, 2010) and BFpack (Mulder et al., in press) in particular. R code illustrating how to compute Bayes factors and posterior probabilities for the hypotheses for the two examples is available at https://osf.io/ejfsv/.
The Bayes factors and posterior model probabilities for hypotheses on μ are presented in Table 2. For the metaanalysis by Ho & Lee, (2012), the Bayes factor comparing H_{2} with H_{1} is the largest of the tested hypotheses, which implies that the hypothesis H_{2} : μ > 0 is 15.810 times more likely than the hypothesis H_{1} : μ < 0. Moreover, the Bayes factor comparing H_{2} with H_{0} : μ = 0 equaled 3.779 implying that μ in this metaanalysis is likely larger than zero. Also the posterior probabilities (last row in Table 2; assuming equal prior probabilities) suggested that a positive effect (H_{2}) was most likely after observing the data with a posterior probability of 0.753. The twotailed frequentist hypothesis test of H_{0} : μ = 0 was not statistically significant using a significance level of 0.05 (z = 1.936, twotailed p value= 0.053).
The Bayes factors and posterior model probabilities for hypotheses on ρ are shown in the first columns of Table 3 for the study by Ho & Lee, (2012). The Bayes factors comparing the hypotheses H_{0} : ρ = 0 with H_{1} : ρ < 0 and H_{2} : ρ > 0 indicated that H_{0} is approximately four and five times more likely, respectively. These results were also corroborated by the posterior probability that was the largest for H_{0} (with 0.689). This indicates that there was most evidence for an equaleffect model. The commonly used Qtest for testing whether the studies are homogeneous in a frequentist metaanalysis was not statistically significant (Q(9) = 9.417,p = 0.4). Note here that a nonsignificant result does not imply evidence for the null, as p values cannot be used for quantifying evidence for the null (because p values are by definition uniformly distributed if H_{0} would be true). To conclude, the EMDR treatment was observed to be on average more efficacious than the CBT treatment and effects are homogeneous across studies. However, due to the uncertainty in the posterior probabilities and the parameter estimates (see Section “?? ??”), more studies are required in order to draw more definite conclusions.
Bayes factors for testing hypotheses on μ for the metaanalysis by Whittaker et al.,, (2019) are shown in the last columns of Table 2. Hypothesis H_{0} : μ = 0 received more support than hypotheses H_{1} : μ < 0 and H_{2} : μ > 0, but there was no strong evidence for any of the hypotheses (largest Bayes factor equaled 2.558). This absence of strong evidence was likely also caused by this metaanalysis only consisting of three studies. The posterior model probabilities (last row Table 2) also showed that H_{0} was most likely (probability is 0.537). Application of a twotailed frequentist hypothesis test resulted in a nonsignificant result, and thus H_{0} could not be rejected (z = 0.349, twotailed p value= 0.727).
The hypothesis H_{0} : ρ = 0 received 10.958 more evidence than H_{1} : ρ < 0 and 2.901 more evidence than H_{2} : ρ > 0. Moreover, the hypothesis H_{2} : ρ > 0 was 3.778 more likely than H_{1} : ρ < 0. This corroborated the posterior model probabilities indicating that the effect sizes were most likely to be either homogeneous (with a probability 0.696) or heterogeneous (with a probability 0.240). Interestingly, the frequentist Qtest was statistically significant (Q(2) = 6.24,p = 0.044. This may be due to the fact that the significance tests rely on large sample theory, which may not be realistic here given there are only three studies. Using the posterior probabilities to get conditional error probabilities, there is a probability of 0.210 + 0.254 = 0.464 that we would be wrong when concluding that the hypothesis of no effect is true, and a probability of 0.064 + 0.240 = 0.304 that the hypothesis of homogeneous effects is true given the observed data. Thus (as expected) more data would be needed in order to receive more pronounced evidence about which hypothesis is likely to be true.
Discussion
The main goals of metaanalyses are estimating the overall effect and the heterogeneity in effect size as well as drawing inferences for these parameters. This paper proposes novel Bayesian estimation and hypothesis testing methods to achieve these goals. Our approach is novel compared to alternative Bayesian metaanalysis methods because the framework builds on the MAREMA model where the (nuisance) studyspecific effects are integrated out. This MAREMA model encompasses both an equaleffect and a randomeffects metaanalytic model, and also encompasses a model which assumes that there is less variance than under the equaleffect and randomeffects models.
Another major contribution is that we place a uniform prior distribution on the proportion of the variance that can be explained by betweenstudy heterogeneity in true effect size rather than on the betweenstudy variance or standard deviation in true effect size directly, as is usually done in Bayesian metaanalyses (Berry, 1998; Scheibehenne et al., 2017; Gronau et al., 2017). This relative variance has a clear standardized scale (Higgins et al., 2003) which facilitates its interpretation. Furthermore, the bounded parameter allows specifying a proper noninformative uniform prior in case prior information is absent or when a default (reference) test is preferred. Other advantages are that this prior does not depend on the effect size measure used in the metaanalysis, and avoids the need of eliciting a prior scale for which the Bayes factor can be highly sensitive.
We illustrated the proposed Bayesian hypothesis testing and estimation in two illustrative examples. Both the estimation and hypothesis testing results revealed large posterior uncertainty regarding the heterogeneity in true effect size. The uncertainty can be explained by the relatively small number of studies which is common in metaanalyses (Rhodes et al., 2015; Turner et al., 2015), and therefore the obtained quantifications for posterior uncertainty seemed reasonable. More convincing evidence for one of the tested hypotheses will be obtained if more studies become available and can be included in the metaanalysis. Note here that because Bayes factors are consistent the evidence in favor of the true hypothesis would go to infinity if the number of studies in a metaanalysis tends to infinity. The frequentist test for betweenstudy heterogeneity on the other hand was statistically significant even though the example only included three studies. This may be another example that p values tend to overestimate the evidence against a null hypothesis (e.g., (Berger & Delampady, 1987; Sellke et al., 2001; Benjamin & Berger, 2019), and that Bayes factors and posterior probabilities may better capture the evidence in favor of or against statistical hypotheses in case of small samples.
In both applications, evidence was found that the heterogeneity (ρ) may be negative. This implies that an equaleffect metaanalysis model (where ρ = 0) or a random effects model (where ρ > 0) may not be appropriate, and might cause bias in the results. A negative ρ may also indicate a violation of the assumptions of the metaanalysis model, which is “corrected” by allowing ρ to be negative. For example, the reported withinstudy sampling variances \({\sigma _{i}^{2}}\) may be overestimated, the assumption of independent primary study’s effect sizes in a metaanalysis is violated, or the effect sizes of the primary studies were incorrectly computed (Ioannidis et al., 2006). More research is needed to explore the possible causes of a negative ρ, and how it affects the results. A good starting point would be (Nielsen et al., 2021) who explored the effect of negative intraclass correlations in a multilevel model.
We have only tested point hypotheses where the parameter was constrained to zero or onesided hypotheses comparing hypotheses where the parameter was smaller or larger than zero. Other hypotheses that can be tested are combined hypotheses (e.g., H : μ > 0&ρ > 0) or hypotheses testing a parameter to another hypothesized value than zero. For example, metaanalysts may want to test point hypotheses using the rules of thumb for a small, medium, and large effect size as defined for many common effect size measures (for rules of thumb for multiple effect size measures see (Cohen, 1988)). Point hypotheses on the heterogeneity may be constrained to ρ = 0.25, 0.5, and 0.75 to resemble low, moderate, and high heterogeneity according to the thresholds proposed by (Higgins et al., 2003). It is important to note that any hypothesis on the proportion of total variance that can be explained by heterogeneity in true effect size can directly be transformed to a hypothesis on the betweenstudy variance in true effect size. For example, testing the hypothesis H : ρ = 0 is equivalent to testing H : τ^{2} = 0. If a hypothesis on ρ is tested against another value than 0, the equivalent hypothesis on the betweenstudy variance in true effect size can be obtained by transforming ρ to the betweenstudy variance in true effect size.
We proposed a framework for Bayesian hypothesis testing and estimation using minimally informative default prior distributions, but our methodology can readily be extended to other prior distributions. Informative priors can be specified to incorporate external information about the heterogeneity in true effect size or overall effect from research in comparable fields. This may provide better estimates and statistical inferences especially if the number of primary studies in the metaanalysis is small. For example, if prior information about the heterogeneity parameter ρ is available this could be translated to an informative stretched beta prior distribution (following, (Mulder & Fox, 2019) for example) while prior information about the effect size could be translated to an informative normal prior. By considering different prior distributions one can assess the robustness of the quantification of the relative evidence in the data between statistical hypotheses under the MAREMA model.
Metaanalysts are generally not only interested in estimating and drawing inferences about the overall effect size and the betweenstudy variance, but they are also interested in studying whether systematic differences between primary studies can explain the betweenstudy variance. In case of large betweenstudy variance, examining whether systematic differences between studies exist might even be more insightful than focusing on the overall effect size (Rouder et al., 2019). The betweenstudy variance in a metaanalysis can be explained by including moderator variables in a metaanalysis model in a socalled metaregression (Thompson & Sharp, 1999; Van Houwelingen et al., 2002). Future research may focus on extending our MAREMA model such that Bayesian estimation and Bayes factor testing can be conducted when moderators are included.
Future research may also focus on extending our methods to more advanced metaanalysis models such as multilevel metaanalysis (van den Noortgate & Onghena, 2003; Konstantopoulos, 2011) and multivariate metaanalysis (Jackson et al., 2011; Hedges, 2019). These models relax the strong assumption of the conventional randomeffects model that effect sizes in a metaanalysis have to be independent. Another avenue for future research is studying whether relaxing the assumptions of the randomeffects model benefit Bayesian metaanalysis under the MAREMA model in particular and Bayesian metaanalysis in general. For example, the randomeffects model assumes that the withinstudy sampling variances are known (van Aert & Jackson, 2019; Jackson, 2009; Konstantopoulos & Hedges, 2019), and we adopted this assumption in the MAREMA model. This assumption is not tenable in practice which can be problematic if the sample size of studies is small. Estimates of the withinstudy sampling variance are then imprecise, which potentially lead to biased parameter estimates and inaccurate statistical inferences. This strong assumption can be avoided in a Bayesian metaanalysis by taking the uncertainty in these variances into account by means of a prior distribution instead of using a plugin estimate. A logical choice for a prior distribution on the withinstudy sampling variances is the inversegamma distribution.
Another topic for future research is studying to what extent the proposed estimation and Bayes factor test under the MAREMA model are affected by publication bias. Publication bias implies that especially studies with statistically significant findings are more likely to be published than studies with nonsignificant findings. Consequently, studies with nonsignificant findings are more difficult to locate and are less likely to be included in a metaanalysis. Due to publication bias, effect sizes of primary studies and, in turn, also the overall effect size of a metaanalysis are most likely positively biased (Kraemer et al., 1998; van Assen et al., 2015; Lane & Dunlap, 1978). We expect estimation and inferences based on the MAREMA model to be inaccurate if publication bias is severe, and recommend researchers to also apply and report methods that correct for publication bias in this case.
To conclude, we have proposed Bayesian hypothesis testing and estimation using the MAREMA model. The proposed Bayes factors allow testing point and onesided hypotheses for both the overall effect and heterogeneity in true effect size. We hope that our methods together with the easytouse software included in the R package BFpack enables researchers to frequently use Bayesian methods in their metaanalyses.
Code Availability
All R code is available at https://osf.io/jcge7/ and https://osf.io/ejfsv/.
Notes
The random walk procedures were based on 100,000 iterations (burnin period of 5000 iterations) and starting values ρ_{0} equal to the estimate of a frequentist randomeffects metaanalysis with the restricted maximum likelihood as estimator for the betweenstudy variance and \(s^{2} = \sqrt {0.1}\)
References
Aguinis, H., Gottfredson, R. K., & Wright, T. A. (2011). Bestpractice recommendations for estimating interaction effects using metaanalysis. Journal of Organizational Behavior, 32(8), 1033–1043. https://doi.org/10.1002/job.719
Baker, R., & Jackson, D. (2008). A new approach to outliers in metaanalysis. Health Care Management Science, 11(2), 121–131. https://doi.org/10.1007/s1072900790418
Baker, R, & Jackson, D. (2016). New models for describing outliers in metaanalysis. Research Synthesis Methods, 7(3), 314–328. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1191
Bartlett, M. S. (1957). A comment on D. V. Lindley’s statistical paradox. Biometrika, 44(34), 533–534.
Benjamin, D. J., & Berger, J. O. (2019). Three recommendations for improving the use of p values. The American Statistician, 73 (sup1), 186–191. https://doi.org/10.1080/00031305.2018.1543135
Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 2(3), 317–335. http://www.jstor.org/stable/2245772
Berger, J. O., Liseo, B., & Wolpert, R. L. (1999). Integrated likelihood methods for eliminating nuisance parameters. Statistical Science, 14(1), 1–28. https://doi.org/10.1214/ss/1009211804
Berry, S. M. (1998). Understanding and testing for heterogeneity across 2×2 tables: Application to metaanalysis. Statistics in Medicine, 17 (20), 2353–2369. https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%2910970258%2819981030%2917%3A20%3C2353%3A%3AAIDSIM923%3E3.0.COf%%g3B2Y
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009) Introduction to metaanalysis. Chichester: John Wiley & Sons, Ltd. ISBN 9780470057247 0470057246
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2010). A basic introduction to fixedeffect and randomeffects models for metaanalysis. Research Synthesis Methods, 1(2), 97–111. https://doi.org/10.1002/jrsm.12
Borenstein, M., Higgins, J. P. T., Hedges, L. V., & Rothstein, H. R. (2017). Basics of metaanalysis: I2 is not an absolute measure of heterogeneity. Research Synthesis Methods, 8(1), 5–18. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1230
Kooperberg, C. (2020). logspline: Routines for logspline density estimation. https://cran.rproject.org/package=logspline
Chung, Y., RabeHesketh, S., & Choi, I. H. (2013). Avoiding zero betweenstudy variance estimates in randomeffects metaanalysis. Statistics in Medicine, 32(23), 4071–4089.
Cohen, J. (1988) Statistical power analysis for the behavioral sciences, (2nd edition). Hillsdale: Lawrence Erlbaum Associates. ISBN 0805802835 9780805802832
Cordray, D. S., & Morphy, P. (2009). Research synthesis and public policy. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.) The Handbook of Research Synthesis and MetaAnalysis (pp. 473–493). New York: Russell 1Sage Foundation.
Liang, F., Paulo, R., Molina, G., A Clyde, M., & O Berger, J. (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481), 410–423. https://doi.org/10.1198/016214507000001337
Fox, J.P., Mulder, J., & Sinharay, S. (2017). Bayes factor covariance testing in item response models. Psychometrika, 82(4), 979–1006. https://doi.org/10.1007/s1133601795776
Gronau, Q. F., van Erp, S., Heck, D. W., Cesario, J., Jonas, K. J., & Wagenmakers, E. J. (2017). A Bayesian modelaveraged metaanalysis of the power pose effect with informed and default priors: the case of felt power, (Vol. 2 pp. 123–138), DOI https://doi.org/10.1080/23743603.2017.1326760
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of phacking in science. PLoS Biology, 13(3), e1002106. https://doi.org/10.1371/journal.pbio.1002106
Hedges, L. V. (2019). Stochastically dependent effect sizes. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.) The handbook of research synthesis and metaanalysis. (3rd edition) (pp. 281–297). New York: Russell Sage Foundation.
Hedges, L. V., & Olkin, I. (1985) Statistical methods for meta analysis. Orlando: Academic Press. ISBN 0123363802 9780123 363800
Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a metaanalysis. Statistics in Medicine, 21(11), 1539–1558. https://doi.org/10.1002/sim.1186
Higgins, J. P. T., Thompson, S. G, Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in metaanalyses. British Medical Journal, 327(7414), 557–560. https://doi.org/10.1136/bmj.327.7414.557
Higgins, J. P. T., Thompson, S. G., & Spiegelhalter, D. J. (2009). A reevaluation of randomeffects metaanalysis. Journal of the Royal Statistical Society, 172(1), 137–159.
Ho, M. S. K., & Lee, C. W. (2012). Cognitive behaviour therapy versus eye movement desensitization and reprocessing for posttraumatic disorder  is it all in the homework then?. Revue Européenne de Psychologie Appliquée/European Review of Applied Psychology, 62(4), 253–260.
Hox, J. J., Moerbeek, M., & Van de Schoot, R. (2018). Multilevel analysis: Techniques and applications. Routledge, New York. ISBN 9781315650982 1315650983 9781138121409 1138121401 9781138121362 1138121363 9781317308683 1317308689.
Ioannidis, J. P., Trikalinos, T. A., & Zintzaras, E. (2006). Extreme betweenstudy homogeneity in metaanalyses could offer useful insights. Journal of Clinical Epidemiology, 59(10), 1023–1032. https://doi.org/10.1016/j.jclinepi.2006.02.013
Jackson, D. (2009). The significance level of the standard test for a treatment effect in metaanalysis. Statistics in Biopharmaceutical Research, 1(1), 92–100. https://doi.org/10.1198/sbr.2009.0009
Jackson, D., Riley, R., & White, I. R. (2011). Multivariate metaanalysis: Potential and promise. Statistics in Medicine, 30(20), 2481–2498. https://doi.org/10.1002/sim.417
Jeffreys, H. (1961) Theory of probability, (3rd edition). Oxford: Clarendon Press.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 791.
Kass, R. E., & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431), 928– 934. https://doi.org/10.1080/01621459.1995.10476592
Konstantopoulos, S. (2011). Fixed effects and variance components estimation in threelevel metaanalysis. Research Synthesis Methods, 2(1), 61–76. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.35
Konstantopoulos, S., & Hedges, L. V. (2019). Statistically analyzing effect sizes Fixed and randomeffects models. In H Cooper, LV Hedges, & JC Valentine (Eds.) The handbook of research synthesis and metaanalysis. (3rd edition) (pp. 245–280). New York: Russell Sage Foundation.
Kontopantelis, E., Springate, D. A., & Reeves, D. (2013). A reanalysis of the Cochrane Library data. The dangers of unobserved heterogeneity in metaanalyses. PLOS ONE 8 (7).
Kraemer, H. C., Gardner, C., Brooks, J., & Yesavage, J. A. (1998). Advantages of excluding underpowered studies in metaanalysis. Inclusionist versus exclusionist viewpoints. Psychological Methods, 3(1), 23–31. https://doi.org/10.1037/1082989X.3.1.23
Lambert, P. C., Sutton, A. J., Abrams, K. R., & Jones, D. R. (2002). A comparison of summary patientlevel covariates in metaregression with individual patient data metaanalysis. Journal of Clinical Epidemiology, 55(1), 86–94. http://www.sciencedirect.com/science/article/pii/S0895435601004140
Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical &, Statistical Psychology, 31, 107–112.
Langan, D., Higgins, J. P. T., & Simmonds, M. (2016). Comparative performance of heterogeneity variance estimators in metaanalysis: A review of simulation studies. Research Synthesis Methods, 8(2), 181–198. https://doi.org/10.1002/jrsm.1198
Lee, K. J., & Thompson, S. G. (2008). Flexible parametric models for randomeffects distributions. Statistics in Medicine, 27(3), 418–434. https://doi.org/10.1002/sim.2897
Lindley, D. V. (1957). A statistical paradox. Biometrika, 44(12), 187–192. https://doi.org/10.1093/biomet/44.12.187
Lunn, D., Barrett, J., Sweeting, M., & Thompson, S. G. (2013). Fully Bayesian hierarchical modelling in two stages, with application to metaanalysis. Journal of the Royal Statistical Society: Series C, 62(4), 551–572. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssc.12007
Mulder, J., & Fox, J. P. (2013). Bayesian tests on components of the compound symmetry covariance matrix. Statistics and Computing, 23(1), 109–122. https://doi.org/10.1007/s1122201192953
Mulder, J., & Fox, J. P. (2019). Bayes factor testing of multiple intraclass correlations. Bayesian Analysis, 14(2), 521–552. https://projecteuclid.org:443/euclid.ba/1533866668
Mulder, J., Williams, D. R., Gu, X., Tomarken, A., BoeingMessing, F., OlssonCollentine, A., ..., van Lissa, C. (in press). BFpack: Flexible Bayes factor testing of scientific theories in R. Journal of Statistical Software.
Nielsen, N. M., A C Smink, W., & Fox, J. P. (2021). Small and negative correlations among clustered observations: Limitations of the linear mixed effects model. Behaviormetrika, 48(1), 51–77. https://doi.org/10.1007/s41237020001308
Normand, S. L. T. (1999). Metaanalysis: Formulating, evaluating, combining, and reporting. Statistics in Medicine, 18(3), 321–359.
R Core Team, R. (2020). A language and environment for statistical computing. http://www.rproject.org/
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. https://doi.org/10.2307/271063
Raudenbush, S. W. (2009). Analyzing effect sizes: Randomeffects models. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.) The Handbook of Research Synthesis and MetaAnalysis (pp. 295–315). New York: Russell Sage Foundation.
Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes metaanalysis. Journal of Educational Statistics, 10(2), 75–98.
Rhodes, K. M., Turner, R. M., & Higgins, J. P. (2015). Predictive distributions were developed for the extent of heterogeneity in metaanalyses of continuous outcome data. Journal of Clinical Epidemiology, 68(1), 52–60.
Rouder, J. N., & D Morey, R. (2012). Default Bayes factors for model selection in regression. Multivariate Behavioral Research, 47(6), 877–903. https://doi.org/10.1080/00273171.2012.734737
Rouder, J. N., Haaf, J. M., DavisStober, C. P., & Hilgard, J. (2019). Beyond overall effects: A Bayesian approach to finding constraints in metaanalysis. Psychological Methods, 24(5), 606–621. https://doi.org/10.1037/met0000216
Rouder, J. N., & Morey, R. D. (2011). A Bayes factor metaanalysis of Bems ESP claim. Psychonomic Bulletin &, Review, 18(4), 682–689.
Rücker, G, Schwarzer, G., Carpenter, J. R., & Schumacher, M. (2008). Undue reliance on I2 in assessing heterogeneity may mislead. BMC Medical Research Methodology, 8(1), 79. https://doi.org/10.1186/14712288879
Scheibehenne, B., Gronau, Q. F., Jamil, T., & Wagenmakers, E.J. (2017). Fixed or random? A resolution through model averaging: Reply to Carlsson, Schimmack, Williams, and Bürkner (2017). https://journals.sagepub.com/doi/abs/10.1177/0956797617724426https://journals.sagepub.com/doi/abs/10.1177/0956797617724426 (Vol. 28 pp. 1698–1701).
Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464.
Sellke, T., Bayarri, J. M., & Berger, O. J. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62–71. http://www.jstor.org/stable/2685531
Sidik, K., & Jonkman, J. N. (2007). A comparison of heterogeneity variance estimators in combining results of studies. Statistics in Medicine, 26(9), 1964–1981.
Smith, T. C., Spiegelhalter, D. J., & Thomas, A. (1995). Bayesian approaches to randomeffects metaanalysis: A comparative study. Statistics in Medicine, 14(24), 2685–2699.
Sutton, A. J., & Abrams, K. R. (2001). Bayesian methods in metaanalysis and evidence synthesis. Statistical Methods in Medical Research, 10(4), 277–303. https://journals.sagepub.com/doi/abs/10.1177/096228020101000404
Thompson, S. G., & Sharp, S. J. (1999). Explaining heterogeneity in metaanalysis a comparison of methods. Statistics in Medicine, 18(20), 2693–2708. https://doi.org/10.1002/(sici)10970258(19991030)18:20%3C2693::aidsim235%3E3.0.co;2v
Turner, R. M., & Higgins, J. P. T. (2019). Bayesian metaanalysis. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.) The handbook of research synthesis and metaanalysis. (3rd edition) (pp. 299–314). New York: Russell Sage Foundation.
Turner, R. M., Jackson, D., Wei, Y., Thompson, S. G., & Higgins, J. P. T. (2015). Predictive distributions for betweenstudy heterogeneity and simple methods for their application in Bayesian metaanalysis. Statistics in Medicine, 34(6), 984–998.
van Aert, R. C. M., & Jackson, D. (2019). A new justification of the HartungKnapp method for randomeffects metaanalysis based on weighted least squares regression. Research Synthesis Methods, 10(4), 515–527. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1356
van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2015). Metaanalysis using effect size distributions of only statistically significant studies. Psychological Methods, 20(3), 293– 309. https://doi.org/10.1037/met0000025
van den Noortgate, W., & Onghena, P. (2003). Multilevel metaanalysis a comparison with traditional metaanalytical procedures. Educational and Psychological Measurement, 63(5), 765–790.
Van Houwelingen, H. C., Arends, L. R., & Stijnen, T. (2002). Advanced methods in metaanalysis: Multivariate approach and metaregression. Statistics in Medicine, 21(4), 589–624. https://doi.org/10.1002/sim.1040.
Veroniki, A. A., Jackson, D., Viechtbauer, W., Bender, R., Bowden, J., Knapp, G., ..., Salanti, G. (2016). Methods to estimate the betweenstudy variance and its uncertainty in metaanalysis. Research Synthesis Methods, 7(1), 55–79. https://doi.org/10.1002/jrsm.1164
Viechtbauer, W. (2007). Confidence intervals for the amount of heterogeneity in metaanalysis. Statistics in Medicine, 26(1), 37–52. https://doi.org/10.1002/sim.2514
Viechtbauer, W. (2010). Conducting metaanalyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03
Whittaker, R., McRobbie, H., Bullen, C., Rodgers, A., Gu Y, & Dobson, R. (2019). Mobile phone text messaging and appbased interventions for smoking cessation. Cochrane Database of Systematic Reviews,(10). https://doi.org/10.1002/14651858.CD006611.pub5
Xu, H., Platt, R. W., Luo, Z. C., Wei, S., & Fraser, W. D. (2008). Exploring heterogeneity in metaanalyses: Needs, resources and challenges. Paediatric and Perinatal Epidemiology, 22(Suppl 1), 18–28. http://europepmc.org/abstract/MED/18237348https://doi.org/10.1111/j.13653016.2007.00908.x
Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with gprior distributions. In P. K. Goel, & A. Zellner (Eds.) Bayesian inference and decision techniques: Essays in honor of Bruno de Finetti. Amsterdam: Elsevier.
Acknowledgements
We would like to thank JeanPaul Fox for insightful discussions about marginalized random effects models.
Funding
Robbie C.M. van Aert is supported by the European Research Council. Grant Number: 726361(IMPROVE).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
There are no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Gibbs sampler for Bayesian estimation under the MAREMA model
We show in this appendix how the MAREMA model can be estimated by means of a Gibbs sampler. The likelihood of the MAREMA model given the prior distributions in Eq. 7 is
We first derive the conditional distribution of μ, i.e.,
If we let \(c = \sum 1/\left ({\sigma ^{2}_{i}}+\tilde {\sigma }^{2} \rho /(1\rho )\right )\) and \(v = \sum 2 y_{i}/({\sigma ^{2}_{i}}+\tilde {\sigma }^{2} \rho /(1\rho ))\) and we complete the square for μ, we get
Using the conditional distribution of μ and a random walk procedure to approximate the conditional distribution of ρ, the Gibbs sampler can be written as

1.
Set the initial value ρ_{0} and the variance of the proposal distribution s^{2}.

2.
Repeat the following steps to obtain j = 1, 2, ..., J draws and verify after every 100 draws whether the acceptance rate a in the random walk procedure is between 0.15 and 0.5 and adjust s^{2} to \(\left [\sqrt {s^{2}} \left (\frac {a0.5}{10.15}+1\right )\right ]^{2}\) if a > 0.5 and to \(\left [\frac {\sqrt {s^{2}}}{2a/0.15}\right ]^{2}\) if a < 0.15.

(a)
Draw a candidate sample ρ^{∗} from the proposal distribution q(ρ^{∗}ρ_{j− 1},s^{2}), which is a truncated normal distribution with mean ρ_{j− 1}, variance s^{2}, lower bound of truncation \(\sigma ^{2}_{min}\), and upper bound of truncation 1.

(b)
Draw a μ_{j} from the conditional distribution of μ in Eq. 15 with ρ = ρ^{∗}.

(c)
Compute the ratio
$$ R = \frac{ h(\mathbf{y}  \mu_{j}, \rho^{*}) / q(\rho^{*}  \rho_{j1}, s^{2})} { h(\mathbf{y}  \mu_{j}, \rho^{*}) / q(\rho_{j1}  \rho^{*}, s^{2})}. $$ 
(d)
Draw a number from a uniform distribution ranging from 0 to 1 and denote this by b.

(e)
If R < b,ρ_{j} = ρ^{∗} otherwise ρ_{j} = ρ_{j− 1}.

(a)

3.
Exclude a burnin period from the draws.
Appendix B: Computation of marginal likelihoods
In this appendix, we describe how the marginal likelihoods of the hypotheses can be computed. Note that point and onesided hypotheses are always specified for one parameter while the other parameter not included in hypothesis is left unconstrained.
B.1: Computing the marginal likelihood of hypotheses with μ and ρ unconstrained (i.e., H _{u})
The marginal likelihood of the hypothesis where both μ and ρ are unconstrained is computed by first integrating out μ in f(yμ, ρ)π(μ, ρ) (see Appendix C for the derivation),
where g(yρ) is the likelihood where μ is integrated out and q refers to the particular hypothesis that is being tested. This marginal likelihood in Eq. 16 can then be approximated by using importance sampling,
where w(ρ) is a proposal density. This proposal density is a stretched beta distribution and was proposed by (Mulder & Fox, 2019) as a prior distribution for the ICC in hierarchical models. This stretched beta distribution ranges from ρ_{min} till 1. The shape parameters of this stretched beta distribution are computed with \(\alpha = (\rho _{min}\bar {\rho }) [(\bar {\rho }1) \bar {\rho }\bar {\rho }\rho _{min} + \rho _{min}+\lambda ]/[(\rho _{min}1) \lambda ]\) and \(\beta = \alpha (\bar {\rho }1) / (\bar {\rho }\rho _{min})\) where \(\bar {\rho }\) and λ are the mean and variance of draws from the posterior distribution of ρ. Both shape parameters are multiplied by 0.6 to make sure that the proposal density has heavier tails than the posterior distribution. These heavier tails yield a more accurate approximation of the marginal likelihood.
B.2: Computation of the marginal likelihood of H _{0} : μ = 0
The marginal likelihood of the point hypothesis H_{0} : μ = 0 has to be approximated because ρ could not be integrated out of f(yμ,ρ)π(μ,ρ). Importance sampling is used for approximating this marginal likelihood,
The proposal density w(ρ) is the stretched beta distribution that is also used for computing the marginal likelihood of hypothesis H_{u} (see Appendix B.1).
B.3: Computation of the marginal likelihood of H _{1} : μ < 0
The computation of the marginal likelihood for the hypothesis H_{1} : μ < 0 is described in Section “Marginal likelihood”. The posterior model probability P(μ < 0y, H_{u}) is approximated using a random walk procedure that closely resembles the one in the Gibbs sampler in Appendix A and can be described as

1.
Set the initial value ρ_{0} and the variance of the proposal distribution s^{2}.

2.
Repeat the following steps to obtain j = 1, 2, ..., J draws and verify after every 100 draws whether the acceptance rate a is between 0.15 and 0.5 and adjust s^{2} to \(\left [\sqrt {s^{2}} \left (\frac {a0.5}{10.15}+1\right )\right ]^{2}\) if a > 0.5 and to \(\left [\frac {\sqrt {s^{2}}}{2a/0.15}\right ]^{2}\) if a < 0.15.

(a)
Draw a candidate sample ρ^{∗} from the proposal distribution q(ρ^{∗}ρ_{j− 1},s^{2}), which is a truncated normal distribution with mean ρ_{j− 1}, variance s^{2}, lower bound of truncation \(\sigma ^{2}_{min}\), and upper bound of truncation 1.

(b)
Compute the ratio
$$ R = \frac{ g(\mathbf{y}  \rho^{*}) / q(\rho^{*}  \rho_{j1}, s^{2})} { g(\mathbf{y}  \rho^{*}) / q(\rho_{j1}  \rho^{*}, s^{2})}. $$ 
(c)
Draw a number from a uniform distribution ranging from 0 to 1 and denote this by b.

(d)
If R < b,ρ_{j} = ρ^{∗} otherwise ρ_{j} = ρ_{j− 1}.

(a)

3.
Draw samples from the conditional posterior of μ (denoted by μ_{j}) given the sampled ρ_{j}. The conditional posterior of μ is
$$ N \left( \frac{\sum y_{i} / ({\sigma}^{2}_{i}+\tilde{\sigma}^{2} \rho_{j}/(1\rho_{j})) + \mu_{0}/{\sigma}^{2}_{j}}{\sum 1 / ({\sigma}^{2}_{i}+\tilde{\sigma}^{2} \rho_{j}/(1\rho_{j})) + 1/{\sigma}^{2}_{j}}, \frac{1}{\sum 1/({\sigma}^{2}_{i}+\tilde{\sigma}^{2} \rho_{j}/(1\rho_{j})) + 1/{\sigma}^{2}_{j}} \right) $$(19)where μ_{0} = 0 and \({\sigma ^{2}_{j}} = \frac {k}{\sum 1/({\sigma ^{2}_{i}}+\tilde {\sigma }^{2} \rho _{j}/(1\rho _{j}))}\).

4.
Exclude a burnin period from the draws.

5.
Compute how many draws of μ_{j} are smaller than 0 to approximate P(μ < 0y, H_{u}).
B.4: Computation of the marginal likelihood of H _{2} : μ > 0
Computation of the marginal likelihood of the hypothesis H_{2} : μ > 0 is highly similar to that of H_{1} : μ < 0. That is, the marginal likelihood can be approximated with
where P(μ > 0y,H_{u}) and P(μ > 0H_{u}) are the posterior and prior model probabilities for μ > 0 under the hypothesis H_{u} where both μ and ρ are unconstrained. The probability P(μ > 0H_{u}) is computed using the prior distribution π(μρ). The posterior model probability P(μ > 0y,H_{u}) is approximated by means of the random walk procedure described in Appendix B.3 where is computed how many draws are larger instead of smaller than 0 in the final step.
B.5: Computation of the marginal likelihood of H _{0} : ρ = 0
The marginal likelihood of the hypothesis H_{0} : ρ = 0 can be computed with the likelihood function g(yρ) in Eq. 16 where μ is integrated out,
B.6: Computation of the marginal likelihood of H _{1} : ρ < 0
The marginal likelihood of the hypothesis H_{1} : ρ < 0 is computed in a similar way as the marginal likelihood for onesided hypotheses on μ. That is,
where P(ρ < 0y,H_{u}) and P(ρ < 0H_{u}) are the posterior and prior model probabilities for ρ < 0 under the hypothesis H_{u} where μ and ρ are unconstrained. The probability P(ρ < 0H_{u}) is computed using the prior π(ρ). The probability P(ρ < 0y,H_{u}) is approximated by means of the random walk procedure described in Appendix B.3 by computing how many draws of ρ_{j} are smaller than 0.
B.7: Computation of the marginal likelihood of H _{2} : ρ > 0
Computation of the marginal likelihood of the hypothesis H_{2} : ρ > 0 is highly equivalent to that of H_{1} : ρ < 0,
with the exception that P(ρ > 0y,H_{u}) and P(ρ > 0H_{u}) are now the posterior and prior model probability of ρ > 0. The posterior model probability P(ρ > 0y,H_{u}) is approximated by computing how many draws of ρ_{j} are larger than 0 in the random walk procedure described in Appendix B.3.
Appendix C: Derivation of marginal likelihood of MAREMA model where μ is integrated out
We show in this appendix how the marginal likelihood in Eq. 16 can be obtained by integrating out μ in f(yμ, ρ)π (μ, ρ). The marginal likelihood of the MAREMA model in Eq. 16 equals
Combining powers and replacing some expressions with their equivalents in matrix algebra yields
Rewriting Eq. 25 to facilitate integrating out μ gives
Integrating out μ of Eq. 26 yields the marginal likelihood in Eq. 16.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
van Aert, R.C.M., Mulder, J. Bayesian hypothesis testing and estimation under the marginalized randomeffects metaanalysis model. Psychon Bull Rev 29, 55–69 (2022). https://doi.org/10.3758/s13423021019189
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13423021019189