Abstract
Bayesian inference often poses difficult computational problems. Even when off-the-shelf Markov chain Monte Carlo (MCMC) methods are available to the problem at hand, mixing issues might compromise the quality of the results. We introduce a framework for situations where the model space can be naturally divided into two components: (i) a baseline black-box probability distribution for the observed variables and (ii) constraints enforced on functionals of this probability distribution. Inference is performed by sampling from the posterior implied by the first component, and finding projections on the space defined by the second component. We discuss the implications of this separation in terms of priors, model selection, and MCMC mixing in latent variable models. Case studies include probabilistic principal component analysis, models of marginal independence, and a interpretable class of structured ordinal probit models.
Similar content being viewed by others
Notes
In fact, this issue is endemic across the literature that considers Bayesian adaptions of frequentist estimators which depend on solving constrained optimization methods Gribonval and Machart (2013).
Please notice that Wang (2013) correctly indicates that the \({\mathcal {G}}\)-inverse Wishart prior with a \(\delta \) independent of \({\mathcal {G}}\) may concentrate mass around a diagonal matrix, as the dimensionality \(p\) of the problem increases. However, the empirical problems reported by Wang (2013), where the algorithm in Silva (2013) basically returns empty graphs in problems of 150 variables and small sample sizes, were unfortunately caused by a bug in our code: once this was corrected, the standard \({\mathcal {G}}\)-inverse Wishart prior had no issues in such problems. The point raised by Wang (2013) is still valid, and our procedure from Silva (2013) uses a hyperparameter \(\delta \) that depends on \({\mathcal {G}}\) – given a baseline hyperparameter \(\delta \), we change \(\delta \) according to \({\mathcal {G}}\) by subtracting from it the minimum number of non-adjacent nodes among all nodes in \({\mathcal {G}}\). However, in the experiments described in this paper, this made little difference. Moreover, we further add a small modification to the Gibbs sampler of Silva (2013) that is more scalable than the original version: unlike Silva (2013), which marginalizes a whole row/column of \(\varSigma \) every time each edge \(Y_i \leftrightarrow Y_j\) is sampled, we only marginalize \(\sigma _{ij}\).
Because of the positive definite projection, \({\mathbf {d}}\) is not necessarily monotonically decreasing in its entries, although in practice it will be approximately so.
Other straightforward criteria can be added to this scheme, such as requiring that \(d_k\) falls below a minimum acceptable error level. Although this selection rule is loosely inspired by the posterior predictive checks of Gelman et al. (1996), notice that here we apply this check to each sample of the distribution of \(\varSigma \) instead of samples from the data space.
First, a synthetic graph \({\mathcal {G}}\) is generated by adding each edge independently with probability \(0.05\). Observed variables \({\mathbf {Y}}\) are generated according to the model \({\mathbf {Y}} = B {\mathbf {X}} + {\mathbf {e}}\), where \({\mathbf {X}}\) is a set of independent standard Gaussian variables. Latent variables \({\mathbf {X}}\) are introduced such that for each pair \(\{Y_i, Y_j\}\) linked by a bi-directed edge, we create a latent variable \(X_k\), sampling the sign of \((B)_{ik}\) uniformly, and the magnitude of \((B)_{ik}\) from a truncated Gaussian in the positive axis with location parameter \(0.25\) and variance parameter 1. The same applies to \((B)_{jk}\). The entries of \(B\) not corresponding to this process are set to zero. Error vector \({\mathbf {e}}\) is jointly Gaussian with zero mean, and the off-diagonal entries of its covariance given by \(BB^T / 10\) (elements in the diagonal are set to \(1\)). The corresponding covariance matrix is then rescaled into a correlation matrix.
Coefficients \(\lambda _i\) were generated by sampling its sign uniformly and its magnitude from a truncated Gaussian in the positive axis with location parameter \(0.25\) and variance parameter 1. Correlation matrix \(\varSigma _Z\) was sampled by rescaling an inverse Wishart \((10, 10{\mathbf {I}})\). \(\varSigma _\epsilon \) and \({\mathcal {G}}\) were sampled using the same scheme as in 4.2. Vector \(\{\lambda _i\}\) and \(\varSigma _\epsilon \) are rescaled such that \(\lambda _i^2 + (\varSigma _\epsilon )_{ii} = 1\) for all \(i\). Marginal probabilities for each \(Y_i\) are generated by generating 5 uniform \((0, 1)\) variables, adding \(0.01\) to each, and renormalizing them. Thresholds \(\{\tau ^i_k\}\) are then set accordingly.
Matching is performed by creating a bipartite graph between latent variables \(\{Z_i^{(m)}\}\) in the candidate sample and the ground truth \(\{Z_i\}\), where an edge \(Z_i^{(m)} - Z_j\) is given as a weight the number of common observed variables assigned to \(Z_i^{(m)}\) and the number assigned to \(Z_j\) in the true model. The resulting matching is given by the Hungarian algorithm.
We ignore here a non-ordinal count of products student recycle.
The criteria were questions should either be binary or ordinal, with no “I don’t know” items; questions should be aimed at all employees and should not lead to follow-up questions such that only a subset of staff is asked to respond; questions should not have more than \(50~\%\) of missing data.
References
Andrieu, C., Roberts, G.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37, 697–725 (2009)
Barnard, J., McCulloch, R., Meng, X.: Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Stat. Sin. 10, 1281–1311 (2000)
Bartholomew, D., Steele, F., Moustaki, I., Galbraith, J.: Analysis of Multivariate Social Science Data, 2nd edn. Chapman & Hall, London (2008)
Beaumont, M., Zhang, W., Balding, D.: Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002)
Bickel, P., Levina, E.: Covariance regularization by thresholding. Ann. Stat. 36, 2577–2604 (2008)
Bissiri, P., Holmes, C., Walker, S.: A general framework for updating belief distributions. arXiv:1306.6430 (2013)
Candès, E., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 11 (2011)
Care Quality Commission and Aston University: Aston Business School, National Health Service National Staff Survey, 2009 [computer file]. Colchester, Essex: UK Data Archive [distributor], October 2010. Available at https://www.esds.ac.uk, SN: 6570 (2010)
Drovandi, C.C., Pettitt, A.N., Lee, A.: Bayesian indirect inference using a parametric auxiliary model. Stat. Sci. (in press)
Drton, M., Richardson, T.: A new algorithm for maximum likelihood estimation in Gaussian models for marginal independence. In: Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., (2003)
Gallant, A.R., McCulloch, R.E.: On the determination of general scientific models with application to asset pricing. J. Am. Stat. Assoc. 104(485), 117–131 (2009)
Gelman, A., Meng, X., Stern, H.: Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–807 (1996)
Gribonval, R., Machart, P.: Reconciling “priors” & “priors” without prejudice? Adv. Neural Inf. Process. Syst. 26, 2193–2201 (2013)
Grzebyk, M., Wild, P., Chouaniere, D.: On identification of multi-factor models with correlated residuals. Biometrika 91, 141–151 (2004)
Jerrum, M., Sinclair, A.: The Markov chain Monte Carlo method: an approach to approximate counting and integration. In: Hochbaum, D.S. (ed.) Approximation Algorithms for NP-hard Problems, pp. 482–520. PWS Publishing Company, Pacific Grove (1996)
Neal, R.: Probabilistic inference using Markov chain monte carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993)
Palla, K., Knowles, D.A., Ghahramani, Z.: A nonparametric variable clustering model. Adv. Neural Inf. Process. Syst. 25, 2987–2995 (2012)
Reeves, R., Pettitt, A.: A theoretical framework for approximate bayesian computation. In: 20th International Workshop on Statistical Modelling, pp. 393–396 (2005)
Richardson, T., Spirtes, P.: Ancestral graph Markov models. Ann. Stat. 30, 962–1030 (2002)
Silva, R.: A MCMC approach for learning the structure of Gaussian acyclic directed mixed graphs. In: Giudici, P., Ingrassia, S., Vichi, M. (eds.) Stat. Models Data Anal., pp. 343–352. Springer, New York (2013)
Silva, R., Ghahramani, Z.: The hidden life of latent variables: Bayesian learning with mixed graph models. J. Mach. Learn. Res. 10, 1187–1238 (2009)
Tipping, M., Bishop, C.: Probabilistic principal component analysis. J. R. Stat. Soc. 61(3), 611–622 (1999)
Wang, H.: Scaling it up: Stochastic search structure learning in graphical models. Bayesian Anal. 10, 351–377 (2015)
Wright, J., Ganesh, A., Rao, S., Peng, Y., Ma, Y.: Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. Adv. Neural Inf. Process. Syst. 22, 2080–2088 (2009)
Yin, G.: Bayesian generalized method of moments. Bayesian Anal. 4, 191–207 (2009)
Acknowledgments
We thank Irini Moustaki for the green consumer data. This work was partially funded by a EPSRC Grant EP/J013293/1.
Author information
Authors and Affiliations
Corresponding author
Appendix: Algorithm for clustering variables in a partition-and-patch model
Appendix: Algorithm for clustering variables in a partition-and-patch model
The algorithm below has three main stages. The first main stage adjusts the cluster assignment and parameters by changing one cluster assignment at a time. The second stage merges clusters with a single element with some other cluster, keeping records of past assignments so that the algorithm does not get stuck in an infinite loop. The third stage splits large clusters in two, again keeping track of which splits happened before.
Line 1 of the algorithm corresponds to setting \(|\lambda _i| = \sqrt{(A)_{ii}}\) and setting the signs of each coefficient according to the identifiability conditions discussed in Sect. 5.2. Line 6 of the algorithm can be efficiently solved in closed form by varying \(C_i \in \{1, 2, \ldots , p\}\) and taking the derivative with respect to \(\lambda _i\). In this algorithm, each optimization should be interpreted as keeping all other arguments fixed, optimizing only with respect to the variables on the left-hand side. Entries of \(\varSigma _Z\) and \(\{\lambda _i\}\) are constrained to the \([-1, 1]\) interval, with no enforcement of a global positive definiteness constraint.
Rights and permissions
About this article
Cite this article
Silva, R., Kalaitzis, A. Bayesian inference via projections. Stat Comput 25, 739–753 (2015). https://doi.org/10.1007/s11222-015-9557-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9557-6