Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables

Maxwell Chickering, David; Heckerman, David

doi:10.1023/A:1007469629108

Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables

Published: November 1997

Volume 29, pages 181–212, (1997)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables

Download PDF

David Maxwell Chickering¹ &
David Heckerman¹

1937 Accesses
156 Citations
Explore all metrics

Abstract

We discuss Bayesian methods for model averaging and model selection among Bayesian-network models with hidden variables. In particular, we examine large-sample approximations for the marginal likelihood of naive-Bayes models in which the root node is hidden. Such models are useful for clustering or unsupervised learning. We consider a Laplace approximation and the less accurate but more computationally efficient approximation known as the Bayesian Information Criterion (BIC), which is equivalent to Rissanen's (1987) Minimum Description Length (MDL). Also, we consider approximations that ignore some off-diagonal elements of the observed information matrix and an approximation proposed by Cheeseman and Stutz (1995). We evaluate the accuracy of these approximations using a Monte-Carlo gold standard. In experiments with artificial and real examples, we find that (1) none of the approximations are accurate when used for model averaging, (2) all of the approximations, with the exception of BIC/MDL, are accurate for model selection, (3) among the accurate approximations, the Cheeseman–Stutz and Diagonal approximations are the most computationally efficient, (4) all of the approximations, with the exception of BIC/MDL, can be sensitive to the prior distribution over model parameters, and (5) the Cheeseman–Stutz approximation can be more accurate than the other approximations, including the Laplace approximation, in situations where the parameters in the maximum a posteriori configuration are near a boundary.

Article PDF

Likelihood Construction and Estimation

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

Article 14 January 2022

A Variational Bayesian Approach for Unsupervised Clustering

References

Azevedo-Filho, A. & Shachter, R. (1994). Laplace's method approximations for probabilistic inference in belief networks with continuous variables. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence (pp. 28–36). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Bareiss, E. & Porter, B. (1987). Protos: An exemplar-based learning apprentice. In Proceedings of the Fourth International Workshop on Machine Learning (pp. 12–23). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Becker, S. & LeCun, Y. (1989). Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Berger, J. (1985). Statistical decision theory and Bayesian analysis. Berlin: Springer.
Google Scholar
Bernardo, J. & Smith, A. (1994). Bayesian theory. New York: John Wiley and Sons.
Google Scholar
Buntine, W. (1994a). Computing second derivatives in feed-forward networks: A review. IEEE Transactions on Neural Networks, 5, 480–488.
Google Scholar
Buntine, W. (1994b). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.
Google Scholar
Buntine, W. (1996). A guide to the literature on learning graphical models. IEEE Transactions on Knowledge and Data Engineering, 8, 195–210.
Google Scholar
Cheeseman, P. & Stutz, J. (1995). Bayesian classification (AutoClass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., & Uthurusamy, R.(Eds.), Advances in knowledge discovery and data mining, pp. 153–180. Menlo Park, CA: AAAI Press.
Google Scholar
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90, 1313–1321.
Google Scholar
Chickering, D. & Heckerman, D. (1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 158–168). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Clogg, C. (1995). Latent class models. In Arminger, G., Clogg, C., & Sobel, M. (Eds.), Handbook of statistical modeling for the social and behavioral sciences. Plenum Press, New York.
Google Scholar
Cooper, G. & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.
Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39, 1–38.
Google Scholar
Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society B, 57, 45–97.
Google Scholar
Geiger, D. & Heckerman, D. (1994). Learning Gaussian networks. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence (pp. 235–243). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Geiger, D., Heckerman, D., & Meek, C. (1996). Asymptotic model selection for directed networks with hidden variables. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 283–290). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–742.
Google Scholar
Gilks, W., Richardson, S., & Spiegelhalter, D. (1996). Markov chain Monte Carlo in practice. New York: Chapman and Hall.
Google Scholar
Good, I. (1965). The estimation of probabilities. Cambridge, MA: MIT Press.
Google Scholar
Gull, S. & Skilling, J. (1991). Quantified maximum entropy. MemSys5 user's manual. Tech. rep., M.E.D.C., 33 North End, Royston, SG8 6NR, England.
Haughton, D. (1988). On the choice of a model to fit data from an exponential family. Annals of Statistics, 16, 342–355.
Google Scholar
Heckerman, D. (1995). A tutorial on learning Bayesian networks. Tech. rep. MSR-TR–95–06, Microsoft Research, Redmond, WA. Revised January, 1996.
Google Scholar
Heckerman, D. & Geiger, D. (1995). Likelihoods and priors for Bayesian networks. Tech. rep. MSR-TR–95–54, Microsoft Research, Redmond, WA.
Google Scholar
Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.
Google Scholar
Hong, Z. & Yang, J. (1994). Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition, 24, 317–324.
Google Scholar
Jeffreys, H. (1939). Theory of probability. Oxford University Press.
Jensen, F., Lauritzen, S., & Olesen, K. (1990). Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly, 4, 269–282.
Google Scholar
Kass, R. & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.
Google Scholar
Kass, R., Tierney, L., & Kadane, J. (1988). Asymptotics in Bayesian computation. In Bernardo, J., DeGroot, M., Lindley, D., & Smith, A. (Eds.), Bayesian statistics 3 (pp. 261–278). Oxford University Press.
Kass, R. & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928–934.
Google Scholar
MacKay, D. (1992a). Bayesian interpolation. Neural Computation, 4, 415–447.
Google Scholar
MacKay, D. (1992b). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472.
Google Scholar
MacKay, D. (1996). Choice of basis for the Laplace approximation. Tech. rep., Cavendish Laboratory, Cambridge, UK.
Google Scholar
Madigan, D. & York, J. (1995). Bayesian graphical models for discrete data. International Statistical Review, 63, 215–232.
Google Scholar
Meng, X. #x0026; Rubin, D. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909.
Google Scholar
Merz, C. & Murphy, P. (1996). UCI repository of machine learning databases, www.ics.uci.edu/ ~mlearn/mlrepository.html. Tech. rep., University of California, Irvine.
Google Scholar
Michalski, R. & Chilausky, R. (1980). Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4.
Neal, R. (1991). Bayesian mixture modeling by Monte Carlo simulation. Tech. rep. CRG-TR–91–2, Department of Computer Science, University of Toronto.
Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Tech. rep. CRG-TR–93–1, Department of Computer Science, University of Toronto.
Raftery, A. (1994). Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Tech. rep. 255, Department of Statistics, University of Washington.
Raftery, A. (1995). Bayesian model selection in social research. In Marsden, P. (Ed.), Sociological methodology. Cambridge, MA: Blackwells.
Google Scholar
Raftery, A. (1996). Hypothesis testing and model selection, chap. 10. Chapman and Hall.
Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B, 49, 223–239 and 253–265.
Google Scholar
Rubin, D. (1976). Inference and missing data. Biometrika, 3, 581–592.
Google Scholar
Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1146–1152). Morgan Kaufmann, San Mateo, CA.
Google Scholar
Saul, L., Jaakkola, T., & Jordan, M. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76.
Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Google Scholar
Spiegelhalter, D., Dawid, A., Lauritzen, S., & Cowell, R. (1993). Bayesian analysis in expert systems. Statistical Science, 8, 219–282.
Google Scholar
Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, Prediction, and Search. New York: Springer-Verlag.
Google Scholar
Thiesson, B. (1997). Score and information for recursive exponential models with incomplete data. In Proceedings of Thirteenth Conference on Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, 98052-6399
David Maxwell Chickering & David Heckerman

Authors

David Maxwell Chickering
View author publications
You can also search for this author in PubMed Google Scholar
David Heckerman
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maxwell Chickering, D., Heckerman, D. Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learning 29, 181–212 (1997). https://doi.org/10.1023/A:1007469629108

Download citation

Issue Date: November 1997
DOI: https://doi.org/10.1023/A:1007469629108

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables

Abstract

Article PDF

Similar content being viewed by others

Likelihood Construction and Estimation

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

A Variational Bayesian Approach for Unsupervised Clustering

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables

Abstract

Article PDF

Similar content being viewed by others

Likelihood Construction and Estimation

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

A Variational Bayesian Approach for Unsupervised Clustering

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation