Machine Learning

, Volume 32, Issue 1, pp 41–62 | Cite as

The Hierarchical Hidden Markov Model: Analysis and Applications

  • Shai Fine
  • Yoram Singer
  • Naftali Tishby


We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multi-scale structure which appears in many natural sequences, particularly in language, handwriting and speech. We seek a systematic unsupervised approach to the modeling of such structures. By extending the standard Baum-Welch (forward-backward) algorithm, we derive an efficient procedure for estimating the model parameters from unlabeled data. We then use the trained model for automatic hierarchical parsing of observation sequences. We describe two applications of our model and its parameter estimation procedure. In the first application we show how to construct hierarchical models of natural English text. In these models different levels of the hierarchy correspond to structures on different length scales in the text. In the second application we demonstrate how HHMMs can be used to automatically identify repeated strokes that represent combination of letters in cursive handwriting.

statistical models temporal pattern recognition hidden variable models cursive handwriting 


  1. Abe, N. & Warmuth, M. (1992). On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9:205-260.Google Scholar
  2. Baldi, P., Chauvin, Y., Hunkapiller, T. & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proc. Nat. Acd. Sci. (USA), 91(3):1059-1063.Google Scholar
  3. Baum, L.E. & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, Vol. 37.Google Scholar
  4. Bengio, Y. & Frasconi, P. (1995). An input-output HMM architecture. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 427-434. MIT Press, Cambridge, MA.Google Scholar
  5. Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley.Google Scholar
  6. Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39:1-38.Google Scholar
  7. Gat, I., Tishby, N. & Abeles, M. (1997). Hidden Markov modeling of simultaneously recorded cells in the associative cortex of behaving monkeys. Network: Computation in neural systems, 8:3, pages 297-322.Google Scholar
  8. Ghahramani Z. & Jordan, M.I. (1997). Factorial hidden Markov models. Machine Learning, 29: 245-273.Google Scholar
  9. Gillman, D. & Sipser, M. (1994). Inference and minimization of hidden Markov chains. Proceedings of the Seventh Annual Workshop on Computational Learning Theory, pages 147-158.Google Scholar
  10. Jelinek, F. (1985). Robust part-of-speech tagging using a hidden Markov model. IBM T.J. Watson Research Technical Report.Google Scholar
  11. Jelinek, F. (1983). Markov source modeling of text generation. Technical report, IBM T.J. Watson Research Center Technical Report.Google Scholar
  12. Jelinek, F. (1985). Self-organized language modeling for speech recognition. IBM T.J. Watson Research Center Technical Report.Google Scholar
  13. Krogh, A., Mian, S.I. & Haussler, D. (1994). A hidden Markov model that finds genes in E.coli DNA. NAR, 22”4768-4778.Google Scholar
  14. Lari, K. & Young, S.J. (1990). The estimation of stochastic context free grammars using the Inside-Outside algorithm. Computers Speech and Language, 4.Google Scholar
  15. Nag, R., Wong, K.H. & Fallside, F. (1985). Script recognition using hidden Markov models. Proceedings of International Conference on Acoustics Speech and Signal Processing, pages 2071-2074.Google Scholar
  16. Rabiner, L.R. & Juang, B.H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, No. 3.Google Scholar
  17. Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE.Google Scholar
  18. Rissanen, J. (1986). Complexity of strings in the class of Markov sources. IEEE Transactions on Information Theory, 32(4):526-532.Google Scholar
  19. Ron, D., Singer, Y. & Tishby, N. (1996). The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25:117-149.Google Scholar
  20. Singer, Y. & Tishby, N.. (1994). Dynamical encoding of cursive handwriting. Biological Cybernetics, 71(3):227-237.Google Scholar
  21. Singer, Y. & Warmuth, M.K. (1997). Training algorithms for hidden Markov models using entropy based distance functions. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 641-647. MIT Press, Cambridge, MA.Google Scholar
  22. Stolcke A. & Omohundro, S.M. (1994). Best-first model merging for hidden Markov model induction. Technical Report ICSI TR-94-003.Google Scholar
  23. Viterbi, A.J. (1967). Error bounds for convulutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13:260-269.Google Scholar

Copyright information

© Kluwer Academic Publishers 1998

Authors and Affiliations

  • Shai Fine
    • 1
  • Yoram Singer
    • 2
  • Naftali Tishby
    • 1
  1. 1.Institute of Computer ScienceHebrew UniversityJerusalemIsrael. E-mail
  2. 2.AT&T Labs

Personalised recommendations