Bayesian inference via projections

Silva, Ricardo; Kalaitzis, Alfredo

doi:10.1007/s11222-015-9557-6

Bayesian inference via projections

Published: 11 June 2015

Volume 25, pages 739–753, (2015)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Ricardo Silva¹ &
Alfredo Kalaitzis¹

678 Accesses
3 Citations
8 Altmetric
1 Mention
Explore all metrics

Abstract

Bayesian inference often poses difficult computational problems. Even when off-the-shelf Markov chain Monte Carlo (MCMC) methods are available to the problem at hand, mixing issues might compromise the quality of the results. We introduce a framework for situations where the model space can be naturally divided into two components: (i) a baseline black-box probability distribution for the observed variables and (ii) constraints enforced on functionals of this probability distribution. Inference is performed by sampling from the posterior implied by the first component, and finding projections on the space defined by the second component. We discuss the implications of this separation in terms of priors, model selection, and MCMC mixing in latent variable models. Case studies include probabilistic principal component analysis, models of marginal independence, and a interpretable class of structured ordinal probit models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Tucker3-PCovR: The Tucker3 principal covariates regression model

Article Open access 05 April 2024

Notes

In fact, this issue is endemic across the literature that considers Bayesian adaptions of frequentist estimators which depend on solving constrained optimization methods Gribonval and Machart (2013).
Please notice that Wang (2013) correctly indicates that the \({\mathcal {G}}\)-inverse Wishart prior with a \(\delta \) independent of \({\mathcal {G}}\) may concentrate mass around a diagonal matrix, as the dimensionality \(p\) of the problem increases. However, the empirical problems reported by Wang (2013), where the algorithm in Silva (2013) basically returns empty graphs in problems of 150 variables and small sample sizes, were unfortunately caused by a bug in our code: once this was corrected, the standard \({\mathcal {G}}\)-inverse Wishart prior had no issues in such problems. The point raised by Wang (2013) is still valid, and our procedure from Silva (2013) uses a hyperparameter \(\delta \) that depends on \({\mathcal {G}}\) – given a baseline hyperparameter \(\delta \), we change \(\delta \) according to \({\mathcal {G}}\) by subtracting from it the minimum number of non-adjacent nodes among all nodes in \({\mathcal {G}}\). However, in the experiments described in this paper, this made little difference. Moreover, we further add a small modification to the Gibbs sampler of Silva (2013) that is more scalable than the original version: unlike Silva (2013), which marginalizes a whole row/column of \(\varSigma \) every time each edge \(Y_i \leftrightarrow Y_j\) is sampled, we only marginalize \(\sigma _{ij}\).
Because of the positive definite projection, \({\mathbf {d}}\) is not necessarily monotonically decreasing in its entries, although in practice it will be approximately so.
Other straightforward criteria can be added to this scheme, such as requiring that \(d_k\) falls below a minimum acceptable error level. Although this selection rule is loosely inspired by the posterior predictive checks of Gelman et al. (1996), notice that here we apply this check to each sample of the distribution of \(\varSigma \) instead of samples from the data space.
First, a synthetic graph \({\mathcal {G}}\) is generated by adding each edge independently with probability \(0.05\). Observed variables \({\mathbf {Y}}\) are generated according to the model \({\mathbf {Y}} = B {\mathbf {X}} + {\mathbf {e}}\), where \({\mathbf {X}}\) is a set of independent standard Gaussian variables. Latent variables \({\mathbf {X}}\) are introduced such that for each pair \(\{Y_i, Y_j\}\) linked by a bi-directed edge, we create a latent variable \(X_k\), sampling the sign of \((B)_{ik}\) uniformly, and the magnitude of \((B)_{ik}\) from a truncated Gaussian in the positive axis with location parameter \(0.25\) and variance parameter 1. The same applies to \((B)_{jk}\). The entries of \(B\) not corresponding to this process are set to zero. Error vector \({\mathbf {e}}\) is jointly Gaussian with zero mean, and the off-diagonal entries of its covariance given by \(BB^T / 10\) (elements in the diagonal are set to \(1\)). The corresponding covariance matrix is then rescaled into a correlation matrix.
Coefficients \(\lambda _i\) were generated by sampling its sign uniformly and its magnitude from a truncated Gaussian in the positive axis with location parameter \(0.25\) and variance parameter 1. Correlation matrix \(\varSigma _Z\) was sampled by rescaling an inverse Wishart \((10, 10{\mathbf {I}})\). \(\varSigma _\epsilon \) and \({\mathcal {G}}\) were sampled using the same scheme as in 4.2. Vector \(\{\lambda _i\}\) and \(\varSigma _\epsilon \) are rescaled such that \(\lambda _i^2 + (\varSigma _\epsilon )_{ii} = 1\) for all \(i\). Marginal probabilities for each \(Y_i\) are generated by generating 5 uniform \((0, 1)\) variables, adding \(0.01\) to each, and renormalizing them. Thresholds \(\{\tau ^i_k\}\) are then set accordingly.
Matching is performed by creating a bipartite graph between latent variables \(\{Z_i^{(m)}\}\) in the candidate sample and the ground truth \(\{Z_i\}\), where an edge \(Z_i^{(m)} - Z_j\) is given as a weight the number of common observed variables assigned to \(Z_i^{(m)}\) and the number assigned to \(Z_j\) in the true model. The resulting matching is given by the Hungarian algorithm.
We ignore here a non-ordinal count of products student recycle.
The criteria were questions should either be binary or ordinal, with no “I don’t know” items; questions should be aimed at all employees and should not lead to follow-up questions such that only a subset of staff is asked to respond; questions should not have more than \(50~\%\) of missing data.

References

Andrieu, C., Roberts, G.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37, 697–725 (2009)
Article MATH MathSciNet Google Scholar
Barnard, J., McCulloch, R., Meng, X.: Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Stat. Sin. 10, 1281–1311 (2000)
MATH MathSciNet Google Scholar
Bartholomew, D., Steele, F., Moustaki, I., Galbraith, J.: Analysis of Multivariate Social Science Data, 2nd edn. Chapman & Hall, London (2008)
MATH Google Scholar
Beaumont, M., Zhang, W., Balding, D.: Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002)
Google Scholar
Bickel, P., Levina, E.: Covariance regularization by thresholding. Ann. Stat. 36, 2577–2604 (2008)
Article MATH MathSciNet Google Scholar
Bissiri, P., Holmes, C., Walker, S.: A general framework for updating belief distributions. arXiv:1306.6430 (2013)
Candès, E., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 11 (2011)
Article MathSciNet Google Scholar
Care Quality Commission and Aston University: Aston Business School, National Health Service National Staff Survey, 2009 [computer file]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at https://www.esds.ac.uk, SN: 6570 (2010)
Drovandi, C.C., Pettitt, A.N., Lee, A.: Bayesian indirect inference using a parametric auxiliary model. Stat. Sci. (in press)
Drton, M., Richardson, T.: A new algorithm for maximum likelihood estimation in Gaussian models for marginal independence. In: Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., (2003)
Gallant, A.R., McCulloch, R.E.: On the determination of general scientific models with application to asset pricing. J. Am. Stat. Assoc. 104(485), 117–131 (2009)
Article MathSciNet Google Scholar
Gelman, A., Meng, X., Stern, H.: Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–807 (1996)
MATH MathSciNet Google Scholar
Gribonval, R., Machart, P.: Reconciling “priors” & “priors” without prejudice? Adv. Neural Inf. Process. Syst. 26, 2193–2201 (2013)
Google Scholar
Grzebyk, M., Wild, P., Chouaniere, D.: On identification of multi-factor models with correlated residuals. Biometrika 91, 141–151 (2004)
Article MATH MathSciNet Google Scholar
Jerrum, M., Sinclair, A.: The Markov chain Monte Carlo method: an approach to approximate counting and integration. In: Hochbaum, D.S. (ed.) Approximation Algorithms for NP-hard Problems, pp. 482–520. PWS Publishing Company, Pacific Grove (1996)
Google Scholar
Neal, R.: Probabilistic inference using Markov chain monte carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993)
Palla, K., Knowles, D.A., Ghahramani, Z.: A nonparametric variable clustering model. Adv. Neural Inf. Process. Syst. 25, 2987–2995 (2012)
Google Scholar
Reeves, R., Pettitt, A.: A theoretical framework for approximate bayesian computation. In: 20th International Workshop on Statistical Modelling, pp. 393–396 (2005)
Richardson, T., Spirtes, P.: Ancestral graph Markov models. Ann. Stat. 30, 962–1030 (2002)
Article MATH MathSciNet Google Scholar
Silva, R.: A MCMC approach for learning the structure of Gaussian acyclic directed mixed graphs. In: Giudici, P., Ingrassia, S., Vichi, M. (eds.) Stat. Models Data Anal., pp. 343–352. Springer, New York (2013)
Chapter Google Scholar
Silva, R., Ghahramani, Z.: The hidden life of latent variables: Bayesian learning with mixed graph models. J. Mach. Learn. Res. 10, 1187–1238 (2009)
MATH MathSciNet Google Scholar
Tipping, M., Bishop, C.: Probabilistic principal component analysis. J. R. Stat. Soc. 61(3), 611–622 (1999)
Article MATH MathSciNet Google Scholar
Wang, H.: Scaling it up: Stochastic search structure learning in graphical models. Bayesian Anal. 10, 351–377 (2015)
Article Google Scholar
Wright, J., Ganesh, A., Rao, S., Peng, Y., Ma, Y.: Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. Adv. Neural Inf. Process. Syst. 22, 2080–2088 (2009)
Yin, G.: Bayesian generalized method of moments. Bayesian Anal. 4, 191–207 (2009)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank Irini Moustaki for the green consumer data. This work was partially funded by a EPSRC Grant EP/J013293/1.

Author information

Authors and Affiliations

Department of Statistical Science and CSML, University College London, London, UK
Ricardo Silva & Alfredo Kalaitzis

Authors

Ricardo Silva
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Kalaitzis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Silva.

Appendix: Algorithm for clustering variables in a partition-and-patch model

The algorithm below has three main stages. The first main stage adjusts the cluster assignment and parameters by changing one cluster assignment at a time. The second stage merges clusters with a single element with some other cluster, keeping records of past assignments so that the algorithm does not get stuck in an infinite loop. The third stage splits large clusters in two, again keeping track of which splits happened before.

Line 1 of the algorithm corresponds to setting \(|\lambda _i| = \sqrt{(A)_{ii}}\) and setting the signs of each coefficient according to the identifiability conditions discussed in Sect. 5.2. Line 6 of the algorithm can be efficiently solved in closed form by varying \(C_i \in \{1, 2, \ldots , p\}\) and taking the derivative with respect to \(\lambda _i\). In this algorithm, each optimization should be interpreted as keeping all other arguments fixed, optimizing only with respect to the variables on the left-hand side. Entries of \(\varSigma _Z\) and \(\{\lambda _i\}\) are constrained to the \([-1, 1]\) interval, with no enforcement of a global positive definiteness constraint.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silva, R., Kalaitzis, A. Bayesian inference via projections. Stat Comput 25, 739–753 (2015). https://doi.org/10.1007/s11222-015-9557-6

Download citation

Accepted: 08 March 2015
Published: 11 June 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11222-015-9557-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian inference via projections

Abstract

Access this article

Similar content being viewed by others

A Systematic Review of Hidden Markov Models and Their Applications

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Tucker3-PCovR: The Tucker3 principal covariates regression model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Algorithm for clustering variables in a partition-and-patch model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian inference via projections

Abstract

Access this article

Similar content being viewed by others

A Systematic Review of Hidden Markov Models and Their Applications

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Tucker3-PCovR: The Tucker3 principal covariates regression model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Algorithm for clustering variables in a partition-and-patch model

Appendix: Algorithm for clustering variables in a partition-and-patch model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation