Skip to main content
Log in

Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Typical Bayesian methods for models with latent variables (or random effects) involve directly sampling the latent variables along with the model parameters. In high-level software code for model definitions (using, e.g., BUGS, JAGS, Stan), the likelihood is therefore specified as conditional on the latent variables. This can lead researchers to perform model comparisons via conditional likelihoods, where the latent variables are considered model parameters. In other settings, however, typical model comparisons involve marginal likelihoods where the latent variables are integrated out. This distinction is often overlooked despite the fact that it can have a large impact on the comparisons of interest. In this paper, we clarify and illustrate these issues, focusing on the comparison of conditional and marginal Deviance Information Criteria (DICs) and Watanabe–Akaike Information Criteria (WAICs) in psychometric modeling. The conditional/marginal distinction corresponds to whether the model should be predictive for the clusters that are in the data or for new clusters (where “clusters” typically correspond to higher-level units like people or schools). Correspondingly, we show that marginal WAIC corresponds to leave-one-cluster out cross-validation, whereas conditional WAIC corresponds to leave-one-unit out. These results lead to recommendations on the general application of the criteria to models with latent variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Celeux, G., Forbes, F., Robert, C. P., & Titterington, D. M. (2006). Deviance information criteria for missing data models. Bayesian Analysis, 1(4), 651–673.

    Article  Google Scholar 

  • daSilva, M. A., Bazán, J. L., & Huggins-Manley, A. C. (2019). Sensitivity analysis and choosing between alternative polytomous IRT models using Bayesian model comparison criteria. Communications in Statistics-Simulation and Computation, 48(2), 601–620.

    Article  Google Scholar 

  • De Boeck, P. (2008). Random item IRT models Random item IRT models. Psychometrika, 73, 533–559.

    Article  Google Scholar 

  • Denwood, M. J. (2016). runjags: An R package providing interface utilities, model templates, parallel computing methods and additional distributions for MCMC models in JAGS. Journal of Statistical Software, 71(9), 1–25. 10.18637/jss.v071.i09.

    Article  Google Scholar 

  • Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461–470.

    Article  Google Scholar 

  • Fox, J. P. (2010). Bayesian item response modeling: Theory and applications. New York, NY: Springer.

    Book  Google Scholar 

  • Furr, D. C. (2017). Bayesian and frequentist cross-validation methods for explanatory item response models. (Unpublished doctoral dissertation). University of California Berkeley, CA.

  • Gelfand, A. E., Sahu, S. K., & Carlin, B. P. (1995). Efficient parametrisations for normal linear mixed models. Biometrika, 82, 379–488.

    Article  Google Scholar 

  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., Rubin, D. B., et al. (2013). Bayesian data analysis (3rd ed.). New York: Chapman & Hall/CRC.

    Google Scholar 

  • Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997–1016.

    Article  Google Scholar 

  • Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2, 1360–1383.

    Article  Google Scholar 

  • Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–807.

    Google Scholar 

  • Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science, 7, 457–511.

    Article  Google Scholar 

  • Gronau, Q. F., & Wagenmakers, E. J. (2018). Limitations of Bayesian leave-one-out cross-validation for model selection. Computational Brain & Behavior, 2(1), 1–11.

    Article  Google Scholar 

  • Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14, 382–417.

    Article  Google Scholar 

  • Kang, T., Cohen, A. S., & Sung, H. J. (2009). Model selection indices for polytomous items. Applied Psychological Medicine, 35, 499–518.

    Article  Google Scholar 

  • Kaplan, D. (2014). Bayesian statistics for the social sciences. New York, NY: The Guildford Press.

    Google Scholar 

  • Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics, 95, 391–413.

    Article  Google Scholar 

  • Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. Boca Raton, FL: Chapman & Hall.

    Google Scholar 

  • Li, F., Cohen, A. S., Kim, S. H., & Cho, S. J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353–373.

    Article  Google Scholar 

  • Li, L., Qui, S., & Feng, C. X. (2016). Approximating cross-validatory predictive evaluation in Bayesian latent variable models with integrated IS and WAIC. Statistics and Computing, 26, 881–897.

    Article  Google Scholar 

  • Lu, Z. H., Chow, S. M., & Loken, E. (2017). A comparison of Bayesian and frequentist model selection methods for factor analysis models. Psychological Methods, 22(2), 361–381.

    Article  PubMed Central  PubMed  Google Scholar 

  • Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The BUGS book: A practical introduction to Bayesian analysis. New York, NY: Chapman & Hall/CRC.

    Google Scholar 

  • Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.

    Article  Google Scholar 

  • Luo, U., & Al-Harbi, K. (2017). Performances of LOO and WAIC as IRT model selection methods. Psychological Test and Assessment Modeling, 59, 183–205.

    Google Scholar 

  • Marshall, E. C., & Spiegelhalter, D. J. (2007). Identifying outliers in Bayesian hierarchical models: A simulation-based approach. Bayesian Analysis, 2(2), 409–444.

    Article  Google Scholar 

  • McElreath, R. (2015). Statistical rethinking: A Bayesian course with examples in R and Stan. New York, NY: Chapman & Hall/CRC.

    Google Scholar 

  • Merkle, E. C., & Rosseel, Y. (2018). blavaan: Bayesian structural equation models via parameter expansion. Journal of Statistical Software, 85(4), 1–30.

    Article  Google Scholar 

  • Millar, R. B. (2009). Comparison of hierarchical Bayesian models for overdispersed count data using DIC and Bayes’ factors. Biometrics, 65, 962–969.

    Article  PubMed  Google Scholar 

  • Millar, R. B. (2018). Conditional vs. marginal estimation of predictive loss of hierarchical models using WAIC and cross-validation. Statistics and Computing, 28, 375–385.

    Article  Google Scholar 

  • Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.

    Article  Google Scholar 

  • Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313–335.

    Article  PubMed  Google Scholar 

  • Navarro, D. (2018). Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection. Computational Brain & Behavior, 2(1), 28–34.

    Google Scholar 

  • Naylor, J. C., & Smith, A. F. (1982). Applications of a method for the efficient computation of posterior distributions. Journal of the Royal Statistical Society C (Applied Statistics), 31, 214–225.

    Google Scholar 

  • Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1–32.

    Article  Google Scholar 

  • O’Hagan, A. (1976). On posterior joint and marginal modes. Biometrika, 63, 329–333.

    Article  Google Scholar 

  • Piironen, J., & Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27, 711–735.

    Article  Google Scholar 

  • Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational Graphics and Statistics, 4, 12–35.

    Google Scholar 

  • Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, Leisch, F. & Zeileis, A. (Eds.), Proceedings of the 3rd international workshop on distributed statistical computing.

  • Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics, 9(3), 523–539.

    Article  PubMed  Google Scholar 

  • Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics, 128(2), 301–323.

    Article  Google Scholar 

  • Raftery, A. E., & Lewis, S. M. (1995). The number of iterations, convergence diagnostics, and generic Metropolis algorithms. London: Chapman and Hall.

    Google Scholar 

  • Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage.

    Google Scholar 

  • Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.

    Article  Google Scholar 

  • Song, X. Y., & Lee, S. Y. (2012). Basic and advanced Bayesian structural equation modeling: With applications in the medical and behavioral sciences. Chichester, UK: Wiley.

    Book  Google Scholar 

  • Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series, B64, 583–639.

    Article  Google Scholar 

  • Spielberger, C. (1988). State-trait anger expression inventory research edition [Computer software manual]. FL: Odessa.

  • Stan Development Team. (2014). Stan modeling language users guide and reference manual, version 2.5.0 [Computer software manual]. http://mc-stan.org/.

  • Trevisani, M., & Gelfand, A. E. (2003). Inequalities between expected marginal log-likelihoods, with implications for likelihood-based model complexity and comparison measures. The Canadian Journal of Statistics, 31, 239–250.

    Article  Google Scholar 

  • Vansteelandt, K. (2000). Formal models for contextualized personality psychology (Unpublished doctoral dissertation). Belgium: University of Leuven Leuven.

    Google Scholar 

  • Vehtari, A., Gelman, A., & Gabry, J. (2016). loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 0.1.6. https://github.com/stan-dev/loo.

  • Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.

    Article  Google Scholar 

  • Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., & Winther, O. (2016). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research, 17, 1–38.

    Google Scholar 

  • Vehtari, A., Simpson, D. P., Yao, Y., & Gelman, A. (2018). Limitations of "Limitations of Bayesian leave-one-out cross-validation for model selection". Computational Brain & Behavior, 2(1), 22–27.

    Google Scholar 

  • Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571–3594.

    Google Scholar 

  • White, I. R. (2010). simsum: Analyses of simulation studies including Monte Carlo error. The Stata Journal, 10, 369–385.

    Article  Google Scholar 

  • Wicherts, J. M., Dolan, C. V., & Hessen, D. J. (2005). Stereotype threat and group differences in test performance: A question of measurement invariance. Journal of Personality and Social Psychology, 89(5), 696–716.

    Article  PubMed  Google Scholar 

  • Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis, 13, 917–1007. https://doi.org/10.1214/17-BA1091.

    Article  Google Scholar 

  • Zhang, X., Tao, J., Wang, C., & Shi, N. Z. (2019). Bayesian model selection methods for multilevel IRT models: A comparison of five DIC-based indices. Journal of Educational Measurement, 56, 3–27.

    Article  Google Scholar 

  • Zhao, Z., & Severini, T. A. (2017). Integrated likelihood computation methods. Computational Statistics, 32, 281–313.

    Article  Google Scholar 

  • Zhu, X., & Stone, C. A. (2012). Bayesian comparison of alternative graded response models for performance assessment applications. Educational and Psychological Measurement, 7(2), 5774–799.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edgar C. Merkle.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research reported here was supported by NSF Grant 1460719 and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D140059. The authors thank Frédéric Gosselin and three anonymous reviewers for comments that improved the paper. Code to replicate the results from this paper can be found at http://semtools.r-forge.r-project.org/.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 7 KB)

Appendices

A. Posterior Expectations of Marginal and Conditional Likelihoods

Following Trevisani and Gelfand (2003), who studied DIC in the context of linear mixed models, we can use Jensen’s inequality to show that the posterior expected value of the marginal log-likelihood is less than the posterior expected value of the conditional log-likelihood.

First, consider the function \(h(x) = x\log (x)\). It is convex, so Jensen’s inequality states that

$$\begin{aligned} h(\mathrm {E}(x)) \le \mathrm {E}(h(x)). \end{aligned}$$
(13)

Setting \(x = f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta })\) and taking expected values with respect to \(\varvec{\zeta }\), we have that

$$\begin{aligned} h(\mathrm {E}(x))&= \displaystyle \log \left[ \displaystyle \int f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \right] \int f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \end{aligned}$$
(14)
$$\begin{aligned} \mathrm {E}(h(x))&= \displaystyle \int \log (f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta })) f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta }, \end{aligned}$$
(15)

so that

$$\begin{aligned}&\displaystyle \log \left[ \displaystyle \int f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \right] \int f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta }\nonumber \\&\le \displaystyle \int \log (f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta })) f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta }. \end{aligned}$$
(16)

We now multiply both sides of this inequality by \(p(\varvec{\omega }, \varvec{\psi })/c\), where \(p(\varvec{\omega }, \varvec{\psi })\) is a prior distribution and

$$\begin{aligned} c = \displaystyle \int _{\varvec{\psi }} \displaystyle \int _{\varvec{\omega }} \displaystyle \int _{\varvec{\zeta }} f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \cdot p(\varvec{\omega }, \varvec{\psi }) \text {d}\varvec{\omega } \text {d}\varvec{\psi } \end{aligned}$$
(17)

is the posterior normalizing constant. Finally, we integrate both sides with respect to \(\varvec{\omega }\) and \(\varvec{\psi }\) to obtain

$$\begin{aligned}&\displaystyle \int _{\varvec{\psi }} \displaystyle \int _{\varvec{\omega }} \log \left( \displaystyle \int _{\varvec{\zeta }} f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \right) \displaystyle \int _{\varvec{\zeta }} f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \cdot \left[ p(\varvec{\omega }, \varvec{\psi })/c \right] \text {d}\varvec{\omega } \text {d}\varvec{\psi } \le \nonumber \\&\quad \displaystyle \int _{\varvec{\psi }} \displaystyle \int _{\varvec{\omega }} \displaystyle \int _{\varvec{\zeta }} \log \left( f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) \right) f_{\text {c}}(\varvec{y} | \varvec{\omega }, \varvec{\zeta }) g(\varvec{\zeta } | \varvec{\psi }) \text {d}\varvec{\zeta } \cdot \left[ p(\varvec{\omega }, \varvec{\psi })/c \right] \text {d}\varvec{\omega } \text {d}\varvec{\psi }. \end{aligned}$$
(18)

We can now recognize both sides of (18) as expected values of log-likelihoods with respect to the model’s posterior distribution, leading to

$$\begin{aligned} \mathrm {E}_{\varvec{\omega }, \varvec{\psi } | \varvec{y}} \left[ \log f_\text {m}(\varvec{y} |\varvec{\omega }, \varvec{\psi } ) \right] \le \mathrm {E}_{\varvec{\omega }, \varvec{\zeta } | \varvec{y}} \left[ \log f_\text {c}(\varvec{y}|\varvec{\omega }, \varvec{\zeta }) \right] . \end{aligned}$$
(19)

Note that the above results do not rely on normality, so they also apply to, e.g., the two-parameter logistic model estimated via marginal likelihood.

B. Effective Number of Parameters for Marginal and Conditional DIC

To consider the effective number of parameters for normal likelihoods, we rely on results from Spiegelhalter et al. (2002). They showed that the effective number of parameters \(p_D\) can be viewed as the fraction of information about model parameters in the likelihood, relative to the total information contained in both the likelihood and prior. Under this view, a specific model parameter gets a value of “1” if all of its information is contained in the likelihood, and it gets a value below “1” if some information is contained in the prior. We sum these values across all parameters to obtain \(p_D\).

Spiegelhalter et al. (2002) relatedly showed that, for normal likelihoods, \(p_D\) can be approximated by

$$\begin{aligned} p_D \approx \text {tr}(\varvec{I}(\hat{\varvec{\theta }}) \varvec{V}), \end{aligned}$$
(20)

where \(\varvec{\theta }\) includes all model parameters (including \(\varvec{\zeta }\) in the conditional model), \(\varvec{I}(\hat{\varvec{\theta }})\) is the observed Fisher information matrix, and \(\varvec{V}\) is the posterior covariance matrix of \(\varvec{\theta }\). When the prior distribution of \(\varvec{\theta }\) is non-informative, then \(\varvec{I}(\hat{\varvec{\theta }}) \approx \varvec{V}^{-1}\). Consequently, matching the discussion in the previous paragraph, the effective number of parameters under non-informative priors will approximate the total number of model parameters.

This result implies that the conditional \(p_D\) will tend to be much larger than the marginal \(p_D\). In particular, in the conditional case, each individual has a unique \(\varvec{\zeta }_j\) vector that is included as part of the total parameter count. The resulting \(p_D\) will not necessarily be close to the total parameter count because the “prior distribution” of \(\varvec{\zeta }_j\) is a hyperdistribution, whereby individuals’ \(\varvec{\zeta }_j\) estimates are shrunk toward the mean. Thus, for these parameters, the “prior” is informative. However, even when the fraction of information in the likelihood is low for these parameters, the fact that we are summing over hundreds or thousands of \(\varvec{\zeta }_j\) vectors implies that the conditional \(p_D\) will be larger than the marginal \(p_D\).

C. Adaptive Gaussian Quadrature for Marginal Likelihoods

We modify the adaptive quadrature method proposed by Rabe-Hesketh et al. (2005) for generalized linear mixed models to a form designed to exploit MCMC draws from the joint posterior of all latent variables and model parameters. Here, we describe one-dimensional integration, but the method is straightforward to generalize to multidimensional integration as in Rabe-Hesketh et al. (2005) . We assume that \(\zeta _j\) represents a disturbance with zero mean and variance \(\tau ^2\) so that there is only one hyperparameter \(\psi =\tau \).

In a non-Bayesian setting, standard (non-adaptive) Gauss–Hermite quadrature can be viewed as approximating the conditional prior density \(g(\zeta _j|\tau )\) by a discrete distribution with masses \(w_m\), \(m=1,\ldots , M\) at locations \(a_m\tau \) so that the integrals in (2) are approximated by sums of M terms, where M is the number of quadrature points,

$$\begin{aligned} f_\text {m}(\varvec{y}| \varvec{\omega }, {\tau })\ \approx \ \prod _{j=1}^J \sum _{m=1}^M w_m f_\text {c}(\varvec{y}_{j} | \varvec{\omega }, \zeta _j=a_m\tau ) . \end{aligned}$$
(21)

To obtain information criteria, this method can easily be applied to the conditional likelihood function for each draw \(\varvec{\omega }^s\) and \(\tau ^s\) (\(s=1,\ldots ,S\)) of the model parameters from MCMC output. This approximation can be thought of as a deterministic version of Monte Carlo integration. Adaptive quadrature is then a deterministic version of Monte Carlo integration via importance sampling.

If applied to each MCMC draw \(\varvec{\omega }^s\) and \(\tau ^s\), the importance density is a normal approximation to the conditional posterior density of \(\zeta _j\), given the current draws of the model parameters. Rabe-Hesketh et al. (2005) used a normal density with mean and variance equal to the mean \(\text {E}(\zeta _j|\varvec{y}_j, \varvec{\omega }^s, \tau ^s)\) and variance \(\mathrm {var}(\zeta _j|\varvec{y}_j, \varvec{\omega }^s, \tau ^s)\) of the conditional posterior density of \(\zeta _j\), whereas Pinheiro and Bates (1995) and some software use a normal density with mean equal to the mode of the conditional posterior and variance equal to minus the reciprocal of the second derivative of the conditional log posterior. Here, we modify the method by Rabe-Hesketh et al. (2005) for the Bayesian setting by using a normal approximation to the unconditional posterior density of \(\zeta _j\) as importance density. Specifically, we use a normal density with mean

$$\begin{aligned} {\tilde{\mu }}_j = \widetilde{E}(\zeta _j | \varvec{y}) = \frac{1}{S} \sum _{s=1}^S \zeta _j^s , \end{aligned}$$
(22)

and standard deviation

$$\begin{aligned} {\tilde{\phi }}_j = \sqrt{{\widetilde{\mathrm {var}}}(\zeta _j | \varvec{y}) } = \sqrt{\frac{1}{S-1} \sum _{s=1}^S (\zeta _j^s- {\tilde{\mu }}_j)^2}, \end{aligned}$$
(23)

where \(\zeta _j^s\) is the draw of \(\zeta _j\) from its unconditional posterior in the sth MCMC iteration. The tildes indicate that these quantities are subject to Monte Carlo error.

Note that this version of adaptive quadrature is computationally more efficient than the one based on the mean and standard deviation of the conditional posterior distributions because the latter would have to be evaluated for each MCMC draw and would require numerical integration, necessitating a procedure that iterates between updating the quadrature locations and weights and updating the conditional posterior means and standard deviations. A disadvantage of our approach is that the quadrature locations and weights (and hence importance density) are not as targeted, but the computational efficiency gained also makes it more feasible to increase M.

The adaptive quadrature approximation to the marginal likelihood for cluster j at posterior draw s becomes

$$\begin{aligned} f_\text {m}(\varvec{y}| \varvec{\omega }, {\tau })\ \approx \ \prod _{j=1}^J \sum _{m=1}^M w_{jm}^s f_\text {c}(\varvec{y}_{j} | \varvec{\omega }, \zeta _j=a_{jm}) , \end{aligned}$$
(24)

where the adapted locations are

$$\begin{aligned} a_{jm} = {\tilde{\mu }}_j + {\tilde{\phi }}_j \times a_m, \end{aligned}$$
(25)

and the corresponding weights or masses are

$$\begin{aligned} w_{jm}^s = \sqrt{2\pi } \times {\tilde{\phi }}_j \times \exp \left( \frac{a_{m}^2}{2} \right) \times g \left( a_{jm}; 0, \tau ^{2,s} \right) \times w_m, \end{aligned}$$
(26)

where \( g \left( a_{jm}; 0, \tau ^{2,s} \right) \) is the normal density function with mean zero and variance \(\tau ^{2,s}\), evaluated at \(a_{jm}\).

The number of integration points M required to obtain a sufficiently accurate approximation is determined by evaluating the approximation of the target quantity (DIC, WAIC) with increasing values of M (7, 11, 17, etc.) and choosing the value of M for which the target quantity changes by less than 0.01 from the previous value. Here, the candidate values for M are chosen to increase approximately by 50% while remaining odd so that one of the quadrature locations is at the posterior mean. Furr (2017) finds this approach to be accurate in simulations for linear mixed models where the adaptive quadrature approximation can be compared with the closed-form integrals.

D. Monte Carlo Error for the DIC and WAIC Effective Number of Parameters

For the DIC effective number of parameters, \(p_\text {D}\), we can make use of the well-known method for estimating the Monte Carlo error for the mean of a quantity across MCMC iterations. Let a quantity computed in MCMC iteration s (\(s=1,\ldots ,S\)) be denoted \(\gamma _s\), so the point estimate of the expectation of \(\gamma \) is

$$\begin{aligned} \overline{\gamma } \ = \ \frac{1}{S}\sum _{s=1}^S \gamma _s. \end{aligned}$$

Then the squared Monte Carlo error (or Monte Carlo error variance) is estimated as

$$\begin{aligned} \mathrm{MCerr}^2(\overline{\gamma })\ =\ \frac{1}{S_\mathrm{{}eff}}\left[ \frac{1}{S-1} \sum _{s=1}^S (\gamma _s-\overline{\gamma })^2\right] , \end{aligned}$$
(27)

where \(S_\mathrm{{}eff}\) is the effective sample size.

For the effective number of parameter approximation in (6) proposed by Plummer (2008), we can obtain the Monte Carlo error variance by substituting

$$\begin{aligned} \gamma _s\ =\ \frac{1}{2} \log \left\{ \frac{f(\varvec{y}_s^{\text {r}1} | \varvec{\theta }_s^1)}{f(\varvec{y}_s^{\text {r}1} | \varvec{\theta }_s^2)} \right\} + \frac{1}{2} \log \left\{ \frac{f(\varvec{y}_s^{\text {r}2} | \varvec{\theta }_s^2)}{f(\varvec{y}_s^{\text {r}2} | \varvec{\theta }_s^1)} \right\} \end{aligned}$$

in (27).

For the effective number of parameter approximation in (4) proposed by Spiegelhalter et al. (2002), we assume that the variation due to \(\text {E}_{\varvec{\theta }|\varvec{y}}[-2\log f(\varvec{y}|{\varvec{\theta }})]\) dominates and use

$$\begin{aligned} \gamma _s = -2\log f(\varvec{y}|{\varvec{\theta }_s}). \end{aligned}$$

For the WAIC effective number of parameters, \(p_\text {W}\), we use expressions for the Monte Carlo error of sample variances (see, e.g., White, 2010). Let the variance of \(\gamma _s\) over MCMC iterations be denoted \(v(\gamma )\),

$$\begin{aligned} v(\gamma )\ = \ \frac{1}{S-1} \sum _{s=1}^S (\gamma _s-\overline{\gamma })^2 = \frac{1}{S} \sum _{s=1}^S T_s,\quad \quad T_s = \frac{S}{S-1}(\gamma _s-\overline{\gamma })^2. \end{aligned}$$

Then the Monte Carlo error variance is estimated as

$$\begin{aligned} \mathrm{MCerr}^2(v(\gamma ))\ =\ \frac{1}{S_\mathrm{{}eff}\times S} \sum _{s=1}^S (T_s-v(\gamma ))^2, \end{aligned}$$
(28)

The conditional version of the effective number of parameters is given by the sum over all units of the posterior variances of the pointwise log posterior densities,

$$\begin{aligned} p_{\text {W}\text {c}} \ = \ \sum _{j=1}^J\sum _{i=1}^{n_j} \text {Var}_{\varvec{\omega },\varvec{\zeta }|\varvec{y}} \left[ \log f_\text {c}({y}_{ij}|\varvec{\omega },\varvec{\zeta }_j)\right] . \end{aligned}$$

The posterior variance \(\text {Var}_{\varvec{\omega },\varvec{\zeta }|\varvec{y}} \left[ \log f_\text {c}({y}_{ij}|\varvec{\omega },\varvec{\zeta }_j)\right] \) for a given unit is estimated by \(v(\gamma _{ij})\) with

$$\begin{aligned} \gamma _{ijs}\ =\ \log f_\text {c}({y}_{ij}|\varvec{\omega }_s,\varvec{\zeta }_{js}), \end{aligned}$$

where we have added subscripts ij to identify the unit, and has Monte Carlo error variance \(\mathrm{MCerr}^2(v(\gamma _{ij}))\) given in (28). The variance of the sum of the independent contributions \(v(\gamma _{ij})\) to \(\hat{p}_{\text {W}\text {c}}\) is the sum of the variances of these contributions,

$$\begin{aligned} \mathrm{MCerr}^2(\hat{p}_{\text {W}\text {c}})\ = \ \sum _{j=1}^J\sum _{i=1}^{n_j} \mathrm{MCerr}^2(v(\gamma _{ij})). \end{aligned}$$

For the marginal version of the effective number of parameters, \(p_{\text {W}\text {m}}\), we define

$$\begin{aligned} \gamma _{js}\ =\ \log f_\text {m}(\varvec{y}_{j}|\varvec{\omega }_s,\varvec{\psi }_s) \end{aligned}$$

and

$$\begin{aligned} \mathrm{MCerr}^2(\hat{p}_{\text {W}\text {m}})\ = \ \sum _{j=1}^J\mathrm{MCerr}^2(v(\gamma _{j})). \end{aligned}$$

E. Additional Results

This section contains additional results from the CFA example that were not included in the main text.

Figure 7 shows Spiegelhalter DIC values for models that use the uninformative priors described in the “Prior sensitivity” subsection. Of note here is that Models 2 and 2a sometimes failed to converge, resulting in fewer than ten points in the graphs. Because we used the automatic convergence procedure described in the main text, “failure to converge” here means that the chains did not achieve Gelman–Rubin statistics below 1.05 in the five minutes allotted. When we removed the 5-minute maximum time to convergence, we encountered situations where chains ran for days without converging. In our experience, these convergence issues are often observed for CFA models in JAGS with flat priors. Chains sometimes get stuck in extreme values of the parameter space and cannot recover.

Fig. 7
figure 7

Marginal and conditional DICs (Spiegelhalter et al. definitions) under uninformative prior distributions for nine models from Wicherts et al.

Figure 8 shows Plummer DIC values for models that use the informative priors described in the “Prior sensitivity” subsection. The figure also contains error bars (\(\pm 2\) SDs) from a single replication, similarly to Figure 3 in the main text. These error bars appear to continue to track Monte Carlo error in DIC. Comparing Figure 8 to Figure 4, we observe a different pattern in the conditional DICs across the Plummer and Spiegelhalter definitions. The Spiegelhalter conditional DICs (Figure 4) consistently prefer Model 2a, whereas the Plummer conditional DICs (Figure 8) generally decrease across models and become lowest for the final models, labeled 5b and 6 (though Models 4 and 5a are also similar). On the other hand, the marginal DICs are similar across the Spiegelhalter and Plummer definitions.

Fig. 8
figure 8

Marginal and conditional DICs (Plummer definitions) under informative prior distributions for nine models from Wicherts et al.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Merkle, E.C., Furr, D. & Rabe-Hesketh, S. Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods. Psychometrika 84, 802–829 (2019). https://doi.org/10.1007/s11336-019-09679-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-019-09679-0

Keywords

Navigation