An Introduction to Variational Methods for Graphical Models

Jordan, Michael I.; Ghahramani, Zoubin; Jaakkola, Tommi S.; Saul, Lawrence K.

doi:10.1023/A:1007665907178

An Introduction to Variational Methods for Graphical Models

Published: November 1999

Volume 37, pages 183–233, (1999)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An Introduction to Variational Methods for Graphical Models

Download PDF

Michael I. Jordan¹,
Zoubin Ghahramani²,
Tommi S. Jaakkola³ &
…
Lawrence K. Saul⁴

18k Accesses
1617 Citations
10 Altmetric
Explore all metrics

Abstract

This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.

References

Bathe, K. J. (1996). Finite element procedures. Englewood Cliffs, NJ: Prentice-Hall.
Google Scholar
Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164–171.
Google Scholar
Bishop, C. M., Lawrence, N., Jaakkola, T. S., & Jordan, M. I. (1998). Approximating posterior distributions in belief networks using mixtures. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems 10, Cambridge MA: MIT Press.
Google Scholar
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: John Wiley.
Google Scholar
Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence, 60, 141–153.
Google Scholar
Dayan, P., Hinton, G. E., Neal, R., & Zemel, R. S. (1995). The Helmholtz Machine. Neural Computation, 7, 889–904.
Google Scholar
Dean, T., & Kanazawa, K. (1989). A model for reasoning about causality and persistence. Computational Intelligence, 5, 142–150.
Google Scholar
Dechter, R. (1999). Bucket elimination: A unifying framework for probabilistic inference. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press.
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1–38.
Google Scholar
Draper, D. L., & Hanks, S. (1994). Localized partial evaluation of belief networks. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Frey, B., Hinton, G. E., & Dayan, P. (1996). Does the wake-sleep algorithm learn good density estimators? In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.
Google Scholar
Fung, R. & Favero, B. D. (1994). Backward simulation in Bayesian networks. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Galland, C. (1993). The limitations of deterministic Boltzmann machine learning. Network, 4, 355–379.
Google Scholar
Ghahramani, Z., & Hinton, G. E. (1996). Switching state-space models. (Technical Report CRG-TR–96–3). Toronto: Department of Computer Science, University of Toronto.
Google Scholar
Ghahramani, Z., & Jordan, M. I. (1997). Factorial Hidden Markov models. Machine Learning, 29, 245–273.
Google Scholar
Gilks, W., Thomas, A., & Spiegelhalter, D. (1994). A language and a program for complex Bayesian modelling. The Statistician, 43, 169–178.
Google Scholar
Heckerman, D. (1999). A tutorial on learning with Bayesian networks. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press.
Google Scholar
Henrion, M. (1991). Search-based methods to bound diagnostic probabilities in very large belief nets. Uncertainty and Artificial Intelligence: Proceedings of the Seventh Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Hinton, G. E., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press.
Google Scholar
Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. Proceedings of the 6th Annual Workshop on Computational Learning Theory. New York, NY: ACM Press.
Google Scholar
Hinton, G. E., Dayan, P., Frey, B., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161.
Google Scholar
Hinton, G. E., Sallans, B., & Ghahramani, Z. (1999). A hierarchical community of experts. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press.
Google Scholar
Horvitz, E. J., Suermondt, H. J., & Cooper, G. F. (1989). Bounded conditioning: Flexible inference for decisions under scarce resources. Conference on Uncertainty in Artificial Intelligence: Proceedings of the Fifth Conference. Mountain View, CA: Association for UAI.
Google Scholar
Jaakkola, T. S., & Jordan, M. I. (1996). Computing upper and lower bounds on likelihoods in intractable networks. Uncertainty and Artificial Intelligence: Proceedings of the Twelth Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jaakkola, T. S. (1997). Variational methods for inference and estimation in graphical models. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA.
Google Scholar
Jaakkola, T. S., & Jordan, M. I. (1997a). Recursive algorithms for approximating probabilities in graphical models. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press.
Google Scholar
Jaakkola, T. S., & Jordan, M. I. (1997b). Bayesian logistic regression: A variational approach. In D. Madigan & P. Smyth (Eds.), Proceedings of the 1997 Conference on Artificial Intelligence and Statistics. Ft. Lauderdale, FL.
Jaakkola, T. S., & Jordan, M. I. (1999a). Improving the mean field approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press.
Google Scholar
Jaakkola, T. S., & Jordan, M. I. (1999b). Variational methods and the QMR-DT database. Journal of Artificial Intelligence Research, 10, 291–322.
Google Scholar
Jensen, C. S., Kong, A., & Kjærulff, U. (1995). Blocking-Gibbs sampling in very large probabilistic expert systems. International Journal of Human-Computer Studies, 42, 647–666.
Google Scholar
Jensen, F. V., & Jensen, F. (1994). Optimal junction trees. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jensen, F. V. (1996). An introduction to Bayesian networks. London: UCL Press.
Google Scholar
Jordan, M. I. (1994). A statistical approach to decision tree modeling. In M. Warmuth (Ed.), Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory. New York: ACM Press.
Google Scholar
Jordan, M. I., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decision trees. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press.
Google Scholar
Kanazawa, K., Koller, D., & Russell, S. (1995). Stochastic simulation algorithms for dynamic probabilistic networks. Uncertainty and Artificial Intelligence: Proceedings of the Eleventh Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Kjærulff, U. (1990). Triangulation of graphs—Algorithms giving small total state space. (Research Report R–90–09). Department of Mathematics and Computer Science, Aalborg University, Denmark.
Google Scholar
Kjærulff, U. (1994). Reduction of computational complexity in Bayesian networks through removal of weak dependences. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
MacKay, D. J. C. (1997). Ensemble learning for hidden Markov models. Unpublished manuscript. Cambridge: Department of Physics, University of Cambridge.
Google Scholar
McEliece, R. J., MacKay, D. J. C., & Cheng, J.-F. (1998). Turbo decoding as an instance of Pearl's “belief propagation algorithm.” IEEE Journal on Selected Areas in Communication, 16, 140–152.
Google Scholar
Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases. Irvine, CA: Department of Information and Computer Science, University of California.
Google Scholar
Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56, 71–113.
Google Scholar
Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods. (Technical Report CRG-TR–93–1). Toronto: Department of Computer Science, University of Toronto.
Google Scholar
Neal, R., & Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press.
Google Scholar
Parisi, G. (1988). Statistical field theory. Redwood City, CA: Addison-Wesley.
Google Scholar
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmannn.
Google Scholar
Peterson, C., & Anderson, J. R. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1, 995–1019.
Google Scholar
Rockafellar, R. (1972). Convex analysis. Princeton University Press.
Rustagi, J. (1976). Variational methods in statistics. New York: Academic Press.
Google Scholar
Sakurai, J. (1985). Modern quantum mechanics. Redwood City, CA: Addison-Wesley.
Google Scholar
Saul, L. K., & Jordan, M. I. (1994). Learning in Boltzmann trees. Neural Computation, 6, 1173–1183.
Google Scholar
Saul, L. K., Jaakkola, T. S., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76.
Google Scholar
Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.
Google Scholar
Saul, L. K., & Jordan, M. I. (1999). A mean field learning algorithm for unsupervised neural networks. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press.
Google Scholar
Seung, S. (1995). Annealed theories of learning. In J.-H. Oh, C. Kwon, S. Cho (Eds.), Neural networks: The statistical mechanics perspectives. Singapore: World Scientific.
Google Scholar
Shachter, R. D., Andersen, S. K., & Szolovits, P. (1994). Global conditioning for probabilistic inference in belief networks. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Shenoy, P. P. (1992). Valuation-based systems for Bayesian decision analysis. Operations Research, 40, 463–484.
Google Scholar
Shwe, M. A., & Cooper, G. F. (1991). An empirical analysis of likelihood—Weighting simulation on a large, multiply connected medical belief network. Computers and Biomedical Research, 24, 453–475.
Google Scholar
Shwe, M. A., Middleton, B., Heckerman, D. E., Henrion, M., Horvitz, E. J., Lehmann, H. P., & Cooper, G. F. (1991). Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base. Meth. Inform. Med., 30, 241–255.
Google Scholar
Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227–270.
Google Scholar
Waterhouse, S., MacKay, D. J. C., & Robinson, T. (1996). Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.
Google Scholar
Williams, C. K. I., & Hinton, G. E. (1991). Mean field networks that learn to discriminate temporally distorted strings. In D. S. Touretzky, J. Elman, T. Sejnowski, & G. E. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Sciences and Department of Statistics, University of California, Berkeley, CA, 94720, USA
Michael I. Jordan
Gatsby Computational Neuroscience Unit, University College, London, WC1N 3AR, UK
Zoubin Ghahramani
Artificial Intelligence Laboratory, MIT, Cambridge, MA, 02139, USA
Tommi S. Jaakkola
AT&T Labs–Research, Florham Park, NJ, 07932, USA
Lawrence K. Saul

Authors

Michael I. Jordan
View author publications
You can also search for this author in PubMed Google Scholar
Zoubin Ghahramani
View author publications
You can also search for this author in PubMed Google Scholar
Tommi S. Jaakkola
View author publications
You can also search for this author in PubMed Google Scholar
Lawrence K. Saul
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jordan, M.I., Ghahramani, Z., Jaakkola, T.S. et al. An Introduction to Variational Methods for Graphical Models. Machine Learning 37, 183–233 (1999). https://doi.org/10.1023/A:1007665907178

Download citation

Issue Date: November 1999
DOI: https://doi.org/10.1023/A:1007665907178

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Introduction to Variational Methods for Graphical Models

Abstract

Article PDF

Similar content being viewed by others

Maximal Information Divergence from Statistical Models Defined by Neural Networks

Conditionally structured variational Gaussian approximation with importance weights

Gaussian variational approximation with sparse precision matrices

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

An Introduction to Variational Methods for Graphical Models

Abstract

Article PDF

Similar content being viewed by others

Maximal Information Divergence from Statistical Models Defined by Neural Networks

Conditionally structured variational Gaussian approximation with importance weights

Gaussian variational approximation with sparse precision matrices

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation