Skip to main content

The Hierarchical Hidden Markov Model: Analysis and Applications


We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multi-scale structure which appears in many natural sequences, particularly in language, handwriting and speech. We seek a systematic unsupervised approach to the modeling of such structures. By extending the standard Baum-Welch (forward-backward) algorithm, we derive an efficient procedure for estimating the model parameters from unlabeled data. We then use the trained model for automatic hierarchical parsing of observation sequences. We describe two applications of our model and its parameter estimation procedure. In the first application we show how to construct hierarchical models of natural English text. In these models different levels of the hierarchy correspond to structures on different length scales in the text. In the second application we demonstrate how HHMMs can be used to automatically identify repeated strokes that represent combination of letters in cursive handwriting.


  • Abe, N. & Warmuth, M. (1992). On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9:205-260.

    Google Scholar 

  • Baldi, P., Chauvin, Y., Hunkapiller, T. & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proc. Nat. Acd. Sci. (USA), 91(3):1059-1063.

    Google Scholar 

  • Baum, L.E. & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, Vol. 37.

  • Bengio, Y. & Frasconi, P. (1995). An input-output HMM architecture. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 427-434. MIT Press, Cambridge, MA.

    Google Scholar 

  • Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley.

  • Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39:1-38.

    Google Scholar 

  • Gat, I., Tishby, N. & Abeles, M. (1997). Hidden Markov modeling of simultaneously recorded cells in the associative cortex of behaving monkeys. Network: Computation in neural systems, 8:3, pages 297-322.

    Google Scholar 

  • Ghahramani Z. & Jordan, M.I. (1997). Factorial hidden Markov models. Machine Learning, 29: 245-273.

    Google Scholar 

  • Gillman, D. & Sipser, M. (1994). Inference and minimization of hidden Markov chains. Proceedings of the Seventh Annual Workshop on Computational Learning Theory, pages 147-158.

  • Jelinek, F. (1985). Robust part-of-speech tagging using a hidden Markov model. IBM T.J. Watson Research Technical Report.

  • Jelinek, F. (1983). Markov source modeling of text generation. Technical report, IBM T.J. Watson Research Center Technical Report.

  • Jelinek, F. (1985). Self-organized language modeling for speech recognition. IBM T.J. Watson Research Center Technical Report.

  • Krogh, A., Mian, S.I. & Haussler, D. (1994). A hidden Markov model that finds genes in E.coli DNA. NAR, 22”4768-4778.

    Google Scholar 

  • Lari, K. & Young, S.J. (1990). The estimation of stochastic context free grammars using the Inside-Outside algorithm. Computers Speech and Language, 4.

  • Nag, R., Wong, K.H. & Fallside, F. (1985). Script recognition using hidden Markov models. Proceedings of International Conference on Acoustics Speech and Signal Processing, pages 2071-2074.

  • Rabiner, L.R. & Juang, B.H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, No. 3.

  • Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE.

  • Rissanen, J. (1986). Complexity of strings in the class of Markov sources. IEEE Transactions on Information Theory, 32(4):526-532.

    Google Scholar 

  • Ron, D., Singer, Y. & Tishby, N. (1996). The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25:117-149.

    Google Scholar 

  • Singer, Y. & Tishby, N.. (1994). Dynamical encoding of cursive handwriting. Biological Cybernetics, 71(3):227-237.

    Google Scholar 

  • Singer, Y. & Warmuth, M.K. (1997). Training algorithms for hidden Markov models using entropy based distance functions. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 641-647. MIT Press, Cambridge, MA.

    Google Scholar 

  • Stolcke A. & Omohundro, S.M. (1994). Best-first model merging for hidden Markov model induction. Technical Report ICSI TR-94-003.

  • Viterbi, A.J. (1967). Error bounds for convulutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13:260-269.

    Google Scholar 

Download references

Author information

Authors and Affiliations


Rights and permissions

Reprints and Permissions

About this article

Cite this article

Fine, S., Singer, Y. & Tishby, N. The Hierarchical Hidden Markov Model: Analysis and Applications. Machine Learning 32, 41–62 (1998).

Download citation

  • Issue Date:

  • DOI:

  • statistical models
  • temporal pattern recognition
  • hidden variable models
  • cursive handwriting