Abstract
We discuss Bayesian methods for model averaging and model selection among Bayesian-network models with hidden variables. In particular, we examine large-sample approximations for the marginal likelihood of naive-Bayes models in which the root node is hidden. Such models are useful for clustering or unsupervised learning. We consider a Laplace approximation and the less accurate but more computationally efficient approximation known as the Bayesian Information Criterion (BIC), which is equivalent to Rissanen's (1987) Minimum Description Length (MDL). Also, we consider approximations that ignore some off-diagonal elements of the observed information matrix and an approximation proposed by Cheeseman and Stutz (1995). We evaluate the accuracy of these approximations using a Monte-Carlo gold standard. In experiments with artificial and real examples, we find that (1) none of the approximations are accurate when used for model averaging, (2) all of the approximations, with the exception of BIC/MDL, are accurate for model selection, (3) among the accurate approximations, the Cheeseman–Stutz and Diagonal approximations are the most computationally efficient, (4) all of the approximations, with the exception of BIC/MDL, can be sensitive to the prior distribution over model parameters, and (5) the Cheeseman–Stutz approximation can be more accurate than the other approximations, including the Laplace approximation, in situations where the parameters in the maximum a posteriori configuration are near a boundary.
Article PDF
Similar content being viewed by others
References
Azevedo-Filho, A. & Shachter, R. (1994). Laplace's method approximations for probabilistic inference in belief networks with continuous variables. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence (pp. 28–36). San Mateo, CA: Morgan Kaufmann.
Bareiss, E. & Porter, B. (1987). Protos: An exemplar-based learning apprentice. In Proceedings of the Fourth International Workshop on Machine Learning (pp. 12–23). San Mateo, CA: Morgan Kaufmann.
Becker, S. & LeCun, Y. (1989). Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kaufmann.
Berger, J. (1985). Statistical decision theory and Bayesian analysis. Berlin: Springer.
Bernardo, J. & Smith, A. (1994). Bayesian theory. New York: John Wiley and Sons.
Buntine, W. (1994a). Computing second derivatives in feed-forward networks: A review. IEEE Transactions on Neural Networks, 5, 480–488.
Buntine, W. (1994b). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.
Buntine, W. (1996). A guide to the literature on learning graphical models. IEEE Transactions on Knowledge and Data Engineering, 8, 195–210.
Cheeseman, P. & Stutz, J. (1995). Bayesian classification (AutoClass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., & Uthurusamy, R.(Eds.), Advances in knowledge discovery and data mining, pp. 153–180. Menlo Park, CA: AAAI Press.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90, 1313–1321.
Chickering, D. & Heckerman, D. (1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 158–168). San Mateo, CA: Morgan Kaufmann.
Clogg, C. (1995). Latent class models. In Arminger, G., Clogg, C., & Sobel, M. (Eds.), Handbook of statistical modeling for the social and behavioral sciences. Plenum Press, New York.
Cooper, G. & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39, 1–38.
Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society B, 57, 45–97.
Geiger, D. & Heckerman, D. (1994). Learning Gaussian networks. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence (pp. 235–243). San Mateo, CA: Morgan Kaufmann.
Geiger, D., Heckerman, D., & Meek, C. (1996). Asymptotic model selection for directed networks with hidden variables. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 283–290). San Mateo, CA: Morgan Kaufmann.
Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–742.
Gilks, W., Richardson, S., & Spiegelhalter, D. (1996). Markov chain Monte Carlo in practice. New York: Chapman and Hall.
Good, I. (1965). The estimation of probabilities. Cambridge, MA: MIT Press.
Gull, S. & Skilling, J. (1991). Quantified maximum entropy. MemSys5 user's manual. Tech. rep., M.E.D.C., 33 North End, Royston, SG8 6NR, England.
Haughton, D. (1988). On the choice of a model to fit data from an exponential family. Annals of Statistics, 16, 342–355.
Heckerman, D. (1995). A tutorial on learning Bayesian networks. Tech. rep. MSR-TR–95–06, Microsoft Research, Redmond, WA. Revised January, 1996.
Heckerman, D. & Geiger, D. (1995). Likelihoods and priors for Bayesian networks. Tech. rep. MSR-TR–95–54, Microsoft Research, Redmond, WA.
Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.
Hong, Z. & Yang, J. (1994). Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition, 24, 317–324.
Jeffreys, H. (1939). Theory of probability. Oxford University Press.
Jensen, F., Lauritzen, S., & Olesen, K. (1990). Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly, 4, 269–282.
Kass, R. & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.
Kass, R., Tierney, L., & Kadane, J. (1988). Asymptotics in Bayesian computation. In Bernardo, J., DeGroot, M., Lindley, D., & Smith, A. (Eds.), Bayesian statistics 3 (pp. 261–278). Oxford University Press.
Kass, R. & Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90, 928–934.
MacKay, D. (1992a). Bayesian interpolation. Neural Computation, 4, 415–447.
MacKay, D. (1992b). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472.
MacKay, D. (1996). Choice of basis for the Laplace approximation. Tech. rep., Cavendish Laboratory, Cambridge, UK.
Madigan, D. & York, J. (1995). Bayesian graphical models for discrete data. International Statistical Review, 63, 215–232.
Meng, X. #x0026; Rubin, D. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909.
Merz, C. & Murphy, P. (1996). UCI repository of machine learning databases, www.ics.uci.edu/ ~mlearn/mlrepository.html. Tech. rep., University of California, Irvine.
Michalski, R. & Chilausky, R. (1980). Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4.
Neal, R. (1991). Bayesian mixture modeling by Monte Carlo simulation. Tech. rep. CRG-TR–91–2, Department of Computer Science, University of Toronto.
Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Tech. rep. CRG-TR–93–1, Department of Computer Science, University of Toronto.
Raftery, A. (1994). Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Tech. rep. 255, Department of Statistics, University of Washington.
Raftery, A. (1995). Bayesian model selection in social research. In Marsden, P. (Ed.), Sociological methodology. Cambridge, MA: Blackwells.
Raftery, A. (1996). Hypothesis testing and model selection, chap. 10. Chapman and Hall.
Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B, 49, 223–239 and 253–265.
Rubin, D. (1976). Inference and missing data. Biometrika, 3, 581–592.
Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1146–1152). Morgan Kaufmann, San Mateo, CA.
Saul, L., Jaakkola, T., & Jordan, M. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Spiegelhalter, D., Dawid, A., Lauritzen, S., & Cowell, R. (1993). Bayesian analysis in expert systems. Statistical Science, 8, 219–282.
Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, Prediction, and Search. New York: Springer-Verlag.
Thiesson, B. (1997). Score and information for recursive exponential models with incomplete data. In Proceedings of Thirteenth Conference on Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Maxwell Chickering, D., Heckerman, D. Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learning 29, 181–212 (1997). https://doi.org/10.1023/A:1007469629108
Issue Date:
DOI: https://doi.org/10.1023/A:1007469629108