Machine Learning

, Volume 29, Issue 2–3, pp 245–273

Factorial Hidden Markov Models

  • Zoubin Ghahramani
  • Michael I. Jordan
Article

Abstract

Hidden Markov models (HMMs) have proven to be one of the most widely used tools for learning probabilistic models of time series data. In an HMM, information about the past is conveyed through a single discrete variable—the hidden state. We discuss a generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner. We describe an exact algorithm for inferring the posterior probabilities of the hidden state variables given the observations, and relate it to the forward–backward algorithm for HMMs and to algorithms for more general graphical models. Due to the combinatorial nature of the hidden state representation, this exact algorithm is intractable. As in other intractable systems, approximate inference can be carried out using Gibbs sampling or variational methods. Within the variational framework, we present a structured approximation in which the the state variables are decoupled, yielding a tractable algorithm for learning the parameters of the model. Empirical comparisons suggest that these approximations are efficient and provide accurate alternatives to the exact methods. Finally, we use the structured approximation to model Bach's chorales and show that factorial HMMs can capture statistical structure in this data set which an unconstrained HMM cannot.

Hidden Markov models time series EM algorithm graphical models Bayesian networks mean field theory 

References

  1. Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164–171.Google Scholar
  2. Bengio, Y., & Frasconi, P. (1995). An input-output HMM architecture. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 427–434. Cambridge, MA: MIT Press.Google Scholar
  3. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems 6, pp. 719–726. San Francisco, CA: Morgan Kaufmann.Google Scholar
  4. Conklin, D., & Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24, 51–73.Google Scholar
  5. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: John Wiley.Google Scholar
  6. Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert systems. Statistics and Computing, 2, 25–36.Google Scholar
  7. Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational Intelligence, 5, 142–150.Google Scholar
  8. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38.Google Scholar
  9. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58.Google Scholar
  10. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.Google Scholar
  11. Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 617–624. Cambridge, MA: MIT Press.Google Scholar
  12. Heckerman, D. (1995.). A tutorial on learning Bayesian networks. (Technical Report MSR-TR–95–06). Redmond, WA: Microsoft Research.Google Scholar
  13. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press.Google Scholar
  14. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems 6, pp. 3–10. San Francisco, CA: Morgan Kaufmann.Google Scholar
  15. Jensen, F. V., Lauritzen, S. L., & Olesen, K. G. (1990). Bayesian updating in recursive graphical models by local computations. Computational Statistical Quarterly, 4, 269–282.Google Scholar
  16. Jordan, M. I., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decision trees. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press.Google Scholar
  17. Jordan, M. I., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214.Google Scholar
  18. Kanazawa, K., Koller, D., & Russell, S. J. (1995). Stochastic simulation algorithms for dynamic probabilistic networks. In P. Besnard,, & S. Hanks (Eds.), Uncertainty in Artificial Intelligence: Proceedings of the Eleventh Conference. (pp. 346–351). San Francisco, CA: Morgan Kaufmann.Google Scholar
  19. Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531.Google Scholar
  20. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society B, 157–224.Google Scholar
  21. McCullagh, P., & Nelder, J. (1989). Generalized linear models. London: Chapman & Hall.Google Scholar
  22. Meila, M., & Jordan, M. I. (1996). Learning fine motion by Markov mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8, pp. 1003–1009. Cambridge, MA: MIT Press.Google Scholar
  23. Merz, C. J., & Murphy, P. M. (1996). UCI Repository of machine learning databases [http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.Google Scholar
  24. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56, 71–113.Google Scholar
  25. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Technical Report CRG-TR–93–1). Toronto, Ontario: University of Toronto, Department of Computer Science.Google Scholar
  26. Neal, R. M.,& Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. Unpublished manuscript, Department of Computer Science, University of Toronto, Ontario.Google Scholar
  27. Parisi, G. (1988). Statistical field theory. Redwood City, CA: Addison-Wesley.Google Scholar
  28. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann.Google Scholar
  29. Rabiner, L. R., & Juang, B. H. (1986). An Introduction to hidden Markov models. IEEE Acoustics, Speech & Signal Processing Magazine, 3, 4–16.Google Scholar
  30. Saul, L. K., & Jordan, M. I. (1997). Mixed memory Markov models. In D. Madigan, & P. Smyth (Eds.), Proceedings of the 1997 Conference on Artificial Intelligence and Statistics. Ft. Lauderdale, FL.Google Scholar
  31. Saul, L., Jaakkola, T., & Jordan, M. I. (1996). Mean Field Theory for Sigmoid Belief Networks. Journal of Artificial Intelligence Research, 4, 61–76.Google Scholar
  32. Saul, L., & Jordan, M. I. (1995). Boltzmann chains and hidden Markov models. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7, pp. 435–442. Cambridge, MA: MIT Press.Google Scholar
  33. Saul, L., & Jordan, M. I. (1996). Exploiting tractable substructures in Intractable networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8, pp. 486–492. Cambridge, MA: MIT Press.Google Scholar
  34. Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227–269.Google Scholar
  35. Stolcke, A., & Omohundro, S. (1993). Hidden Markov model induction by Bayesian model merging. In S.J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems 5, pp. 11–18. San Francisco, CA: Morgan Kaufmann.Google Scholar
  36. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82, 528–550.Google Scholar
  37. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions Information Theory, IT-13, 260–269.Google Scholar
  38. Williams, C. K. I., & Hinton, G. E. (1991). Mean field networks that learn to discriminate temporally distorted strings. In D. Touretzky, J. Elman, T. Sejnowski, & G. Hinton (Eds.), Connectionist models: Proceedings of the 1990 summer school (pp. 18–22). San Francisco, CA: Morgan Kaufmann.Google Scholar
  39. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. Ph.D. Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Zoubin Ghahramani
    • 1
  • Michael I. Jordan
    • 2
  1. 1.Department of Computer ScienceUniversity of TorontoTorontoCanada
  2. 2.Department of Brain & Cognitive SciencesMassachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations