Advertisement

Machine Learning

, Volume 29, Issue 2–3, pp 181–212 | Cite as

Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables

  • David Maxwell Chickering
  • David Heckerman
Article

Abstract

We discuss Bayesian methods for model averaging and model selection among Bayesian-network models with hidden variables. In particular, we examine large-sample approximations for the marginal likelihood of naive-Bayes models in which the root node is hidden. Such models are useful for clustering or unsupervised learning. We consider a Laplace approximation and the less accurate but more computationally efficient approximation known as the Bayesian Information Criterion (BIC), which is equivalent to Rissanen's (1987) Minimum Description Length (MDL). Also, we consider approximations that ignore some off-diagonal elements of the observed information matrix and an approximation proposed by Cheeseman and Stutz (1995). We evaluate the accuracy of these approximations using a Monte-Carlo gold standard. In experiments with artificial and real examples, we find that (1) none of the approximations are accurate when used for model averaging, (2) all of the approximations, with the exception of BIC/MDL, are accurate for model selection, (3) among the accurate approximations, the Cheeseman–Stutz and Diagonal approximations are the most computationally efficient, (4) all of the approximations, with the exception of BIC/MDL, can be sensitive to the prior distribution over model parameters, and (5) the Cheeseman–Stutz approximation can be more accurate than the other approximations, including the Laplace approximation, in situations where the parameters in the maximum a posteriori configuration are near a boundary.

Bayesian model averaging model selection multinomial mixtures clustering unsupervised learning Laplace approximation 

References

  1. Azevedo-Filho, A. & Shachter, R. (1994). Laplace's method approximations for probabilistic inference in belief networks with continuous variables. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence (pp. 28–36). San Mateo, CA: Morgan Kaufmann.Google Scholar
  2. Bareiss, E. & Porter, B. (1987). Protos: An exemplar-based learning apprentice. In Proceedings of the Fourth International Workshop on Machine Learning (pp. 12–23). San Mateo, CA: Morgan Kaufmann.Google Scholar
  3. Becker, S. & LeCun, Y. (1989). Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kaufmann.Google Scholar
  4. Berger, J. (1985). Statistical decision theory and Bayesian analysis. Berlin: Springer.Google Scholar
  5. Bernardo, J. & Smith, A. (1994). Bayesian theory. New York: John Wiley and Sons.Google Scholar
  6. Buntine, W. (1994a). Computing second derivatives in feed-forward networks: A review. IEEE Transactions on Neural Networks, 5, 480–488.Google Scholar
  7. Buntine, W. (1994b). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.Google Scholar
  8. Buntine, W. (1996). A guide to the literature on learning graphical models. IEEE Transactions on Knowledge and Data Engineering, 8, 195–210.Google Scholar
  9. Cheeseman, P. & Stutz, J. (1995). Bayesian classification (AutoClass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., & Uthurusamy, R.(Eds.), Advances in knowledge discovery and data mining, pp. 153–180. Menlo Park, CA: AAAI Press.Google Scholar
  10. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90, 1313–1321.Google Scholar
  11. Chickering, D. & Heckerman, D. (1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 158–168). San Mateo, CA: Morgan Kaufmann.Google Scholar
  12. Clogg, C. (1995). Latent class models. In Arminger, G., Clogg, C., & Sobel, M. (Eds.), Handbook of statistical modeling for the social and behavioral sciences. Plenum Press, New York.Google Scholar
  13. Cooper, G. & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.Google Scholar
  14. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39, 1–38.Google Scholar
  15. Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society B, 57, 45–97.Google Scholar
  16. Geiger, D. & Heckerman, D. (1994). Learning Gaussian networks. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence (pp. 235–243). San Mateo, CA: Morgan Kaufmann.Google Scholar
  17. Geiger, D., Heckerman, D., & Meek, C. (1996). Asymptotic model selection for directed networks with hidden variables. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 283–290). San Mateo, CA: Morgan Kaufmann.Google Scholar
  18. Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–742.Google Scholar
  19. Gilks, W., Richardson, S., & Spiegelhalter, D. (1996). Markov chain Monte Carlo in practice. New York: Chapman and Hall.Google Scholar
  20. Good, I. (1965). The estimation of probabilities. Cambridge, MA: MIT Press.Google Scholar
  21. Gull, S. & Skilling, J. (1991). Quantified maximum entropy. MemSys5 user's manual. Tech. rep., M.E.D.C., 33 North End, Royston, SG8 6NR, England.Google Scholar
  22. Haughton, D. (1988). On the choice of a model to fit data from an exponential family. Annals of Statistics, 16, 342–355.Google Scholar
  23. Heckerman, D. (1995). A tutorial on learning Bayesian networks. Tech. rep. MSR-TR–95–06, Microsoft Research, Redmond, WA. Revised January, 1996.Google Scholar
  24. Heckerman, D. & Geiger, D. (1995). Likelihoods and priors for Bayesian networks. Tech. rep. MSR-TR–95–54, Microsoft Research, Redmond, WA.Google Scholar
  25. Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.Google Scholar
  26. Hong, Z. & Yang, J. (1994). Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition, 24, 317–324.Google Scholar
  27. Jeffreys, H. (1939). Theory of probability. Oxford University Press.Google Scholar
  28. Jensen, F., Lauritzen, S., & Olesen, K. (1990). Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly, 4, 269–282.Google Scholar
  29. Kass, R. & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.Google Scholar
  30. Kass, R., Tierney, L., & Kadane, J. (1988). Asymptotics in Bayesian computation. In Bernardo, J., DeGroot, M., Lindley, D., & Smith, A. (Eds.), Bayesian statistics 3 (pp. 261–278). Oxford University Press.Google Scholar
  31. Kass, R. & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928–934.Google Scholar
  32. MacKay, D. (1992a). Bayesian interpolation. Neural Computation, 4, 415–447.Google Scholar
  33. MacKay, D. (1992b). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472.Google Scholar
  34. MacKay, D. (1996). Choice of basis for the Laplace approximation. Tech. rep., Cavendish Laboratory, Cambridge, UK.Google Scholar
  35. Madigan, D. & York, J. (1995). Bayesian graphical models for discrete data. International Statistical Review, 63, 215–232.Google Scholar
  36. Meng, X. #x0026; Rubin, D. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909.Google Scholar
  37. Merz, C. & Murphy, P. (1996). UCI repository of machine learning databases, www.ics.uci.edu/ ~mlearn/mlrepository.html. Tech. rep., University of California, Irvine.Google Scholar
  38. Michalski, R. & Chilausky, R. (1980). Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4.Google Scholar
  39. Neal, R. (1991). Bayesian mixture modeling by Monte Carlo simulation. Tech. rep. CRG-TR–91–2, Department of Computer Science, University of Toronto.Google Scholar
  40. Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Tech. rep. CRG-TR–93–1, Department of Computer Science, University of Toronto.Google Scholar
  41. Raftery, A. (1994). Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Tech. rep. 255, Department of Statistics, University of Washington.Google Scholar
  42. Raftery, A. (1995). Bayesian model selection in social research. In Marsden, P. (Ed.), Sociological methodology. Cambridge, MA: Blackwells.Google Scholar
  43. Raftery, A. (1996). Hypothesis testing and model selection, chap. 10. Chapman and Hall.Google Scholar
  44. Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B, 49, 223–239 and 253–265.Google Scholar
  45. Rubin, D. (1976). Inference and missing data. Biometrika, 3, 581–592.Google Scholar
  46. Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1146–1152). Morgan Kaufmann, San Mateo, CA.Google Scholar
  47. Saul, L., Jaakkola, T., & Jordan, M. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76.Google Scholar
  48. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.Google Scholar
  49. Spiegelhalter, D., Dawid, A., Lauritzen, S., & Cowell, R. (1993). Bayesian analysis in expert systems. Statistical Science, 8, 219–282.Google Scholar
  50. Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, Prediction, and Search. New York: Springer-Verlag.Google Scholar
  51. Thiesson, B. (1997). Score and information for recursive exponential models with incomplete data. In Proceedings of Thirteenth Conference on Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • David Maxwell Chickering
    • 1
  • David Heckerman
    • 1
  1. 1.Microsoft ResearchRedmond

Personalised recommendations