Abstract
In automatic speech recognition (ASR) systems, the speech signal is captured and parameterized at front end and evaluated at back end using the statistical framework of hidden Markov model (HMM). The performance of these systems depend critically on both the type of models used and the methods adopted for signal analysis. Researchers have proposed a variety of modifications and extensions for HMM based acoustic models to overcome their limitations. In this review, we summarize most of the research work related to HMM-ASR which has been carried out during the last three decades. We present all these approaches under three categories, namely conventional methods, refinements and advancements of HMM. The review is presented in two parts (papers): (i) An overview of conventional methods for acoustic phonetic modeling, (ii) Refinements and advancements of acoustic models. Part I explores the architecture and working of the standard HMM with its limitations. It also covers different modeling units, language models and decoders. Part II presents a review on the advances and refinements of the conventional HMM techniques along with the current challenges and performance issues related to ASR.
Similar content being viewed by others
References
Anusuya, M. A., & Katti, S. K. (2011). Front end analysis of speech recognition: a review. International Journal of Speech Technology, 14, 99–145.
Aubert, X., Haeb-Umbach, R., & Ney, H. (1993). Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 648–651).
Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114.
Bakis, R. (1976). Continuous speech word recognition via centisecond acoustic states. In Proc. ASA meeting, Washington, DC, USA.
Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73, 360–363.
Bellegarda, I. R., & Nahamoo, D. (1990). Tied mixture continuous parameter modeling for speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(12), 2033–2045.
Beulen, K., Ortmanns, S., & Elting, C. (1999). Dynamic programming search techniques for across-word modeling in speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 609–612).
Bilmes, J. A. (1999). Buried Markov models for speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. 713–716).
Bilmes, J. A. (2003). Buried Markov models: a graphical-modeling approach to automatic speech recognition. Computer Speech and Language, 17(2–3), 213–231.
Bilmes, J. A., & Bartels, C. (2005). Graphical model architectures for speech recognition. IEEE Signal Processing Magazine, 22(5), 89–100.
Casar, M., & Fonollosa, J. A. R. (2006). Analysis of HMM temporal evolution for automatic speech recognition and utterance verification. In Proceedings of IEEE international conference on spoken language processing (ICSLP).
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Deng, L., Lennig, M., Seitz, P., & Mermelstein, P. (1990). Large vocabulary word recognition using context-dependent allophonic hidden Markov models. Computer Speech and Language, 4, 345–357.
Deng, L., & O’Shaughnessy, D. (2003). Speech processing—a dynamic and optimization-oriented approach. New York: Dekker.
Digalakis, V. V., & Murveit, H. (1994). Genones: optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 537–540).
Digalakis, V. V., Monaco, P., & Murveit, H. (1996). Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers. IEEE Transactions on Speech and Audio Processing, 4(4), 281–289.
Evemnann, G., & Woodland, P. C. (2000). Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1655–1658).
Ferguson, J. D. (1980). Variable duration models for speech. In Symposium: application of hidden Markov models to text and speech, Institute for Defense Analyses, Princeton, NJ (pp. 143–179).
Fetter, P., Kaltenmeier, A., Kuhn, T., & Regel-Brietzmann, P. (1996). Improved modeling of OOV words in spontaneous speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 534–537).
Gales, M., & Young, S. J. (1993). Cepstral parameter compensation for HMM recognition. Speech Communication, 12, 231–239.
Gales, M., & Young, S. (1996). Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.
Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Ganapathiraju, A., et al. (1998). Support vector machines for speech recognition. In Proceedings of IEEE ICSLP (pp. 2923–2926).
Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.
Goel, N., Thomas, S., Agarwal, M., et al. (2010). Approaches to automatic lexicon learning with limited training examples. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 5094–5097).
Gong, Y. (2005). A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition. IEEE Transactions on Speech and Audio Processing, 13(5), 975–983.
Goronzy, S. (2002). LNAI: Vol. 2560. Robust adaptation to non-native accents in automatic speech recognition. Berlin: Springer.
He, X., & Deng, L. (2008). Discriminative learning for speech recognition: theory and practice. San Rafael: Morgan and Claypool Publishers.
Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.
Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 289–292).
Huang, X. D., & Jack, M. A. (1988). Performance comparison between semi-continuous and discrete hidden Markov models. IEE Electronics Letters, 24(3), 149–150.
Huang, X. D., Ariki, Y., & Jack, M. A. (1990). Hidden Markov models for speech recognition. Edinburg: Edinburg University Press.
Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing: a guide to theory algorithm and system development. New York: Prentice Hall.
Hwang, M., & Huang, X. (1992). Subphonetic modeling with Markov states—Senone. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 33–36).
Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development, 13, 675–685.
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge: MIT Press.
Krstulovic, S., Bimbot, F., et al. (2006). Optimizing the coverage of speech database through a selection of representative speaker recordings. Speech Communication, 48, 1319–1348.
Layton, M. I., & Gales, M. J. F. (2006). Augmented statistical models for speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, Article 1).
Lee, C. H., Gauvain, J. L., Pieraccini, R., & Rabiner, L. R. (1993). Large vocabulary speech recognition using subword units. Speech Communication, 13, 263–279.
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2), 171–185.
Li, J., Deng, L., Yu, D., Gong, Y., & Acero, A. (2007). High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In Proceedings IEEE automatic speech recognition and understanding workshop, Kyoto, Japan (pp. 65–70).
Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.
Mihelic, F., & Zibert, J. (2008). Speech recognition technologies and applications. Vienna: I-TECH
Murveit, H., Butzberger, J., Digalakis, V., & Weintraub, M. (1993). Large vocabulary dictation using SRI’s decipher speech recognition system: progressive search techniques. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Minneapolis, MN (Vol. 2, pp. 319–322).
Nanjo, H., & Kawahara, T. (2003). Unsupervised language model adaptation for lecture speech recognition. In Proceedings of the ISCA and IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan.
Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 263–271.
Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 65–83.
Ney, H., & Ortmanns, S. (2000). Progress in dynamic programming search for large vocabulary continuous speech recognition. Proceedings of the IEEE, 88(8), 1224–1240.
O’Shaughnessy, D. (2003). Interacting with computers by voice-automatic speech recognitions and synthesis. Proceedings of the IEEE, 91(9), 1272–1305.
Odell, J. J., Valtchev, V., Woodland, P. C., & Young, S. J. (1994). A one pass decoder design for large vocabulary recognition. In Proc. ARPA human language technology workshop, Princeton, NJ (pp. 405–410).
Ortmanns, S., Ney, H., & Eiden, A. (1996a). Language-model look-ahead for large vocabulary speech recognition. In Proc. international conference on spoken language processing, Philadelphia, PA, USA (Vol. 4, pp. 2095–2098).
Ortmanns, S., Ney, H., Eiden, A., & Coenen, N. (1996b). Look-ahead techniques for improved beam search. In Proc. CRIM-FORWISS workshop, Montreal, Canada (pp. 10–22).
Ostendorf, M., Digalakis, V., & Kimball, O. A. (1996). From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–378.
Paul, D. B. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 449–452).
Paul, D. B. (1991). Algorithms for an optimal A ∗ search and linearizing the search in the stack decoder. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Toronto, Canada (Vol. 1, pp. 693–696).
Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.
Pylkkonen, J., & Kurimo, M. (2003). Duration modeling techniques for continuous speech recognition. In Proceedings of European conference on speech technology.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21(2), 282–295.
Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 43–49.
Schwartz, R., Klovstad, J., Makhoul, J., & Sorensen, J. (1980). A preliminary design of a phonetic vocoder based on a diphone model. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 32–35).
Schwartz, R., Chow, Y., et al. (1985). Context dependent modeling for acoustic-phonetic recognition of speech signals. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Tampa, FLA (pp. 1205–1208).
Schwartz, R., Austin, S., Kubala, F., & Makhoul, J. (1992). New uses for N-best sentence hypothesis, within the BYBLOS speech recognition system. In Proceedings IEEE international conference on acoustics, speech, and signal processing, San Francisco, CA (Vol. 1, pp. 1–4).
Shankar, A., Gadde, V. R. R., Stolcke, A., & Weng, F. (2002). Improved modeling and efficiency for automatic transcription of broadcast news. Speech Communication, 37(1–2), 133–158.
Sharma, A., Shrotriya, M. C., Farooq, O., & Abbasi, Z. A. (2008). Hybrid wavelet based LPC features for Hindi speech recognition. International Journal of Information and Communication Technology, 1, 373–381.
Sixtus, A., & Ney, H. (2002). From within-word model search to crossword model search in large vocabulary continuous speech recognition. Computer Speech and Language, 16(2), 245–271.
Tsao, Y., & Lee, C. H. (2007a). An ensemble modeling approach to joint characterization of speaker and speaking environments. In Proc. Eurospeech, Antwerp, Belgium (pp. 1050–1053).
Tsao, Y., & Lee, C. H. (2007b). Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition. In Proceedings IEEE automatic speech recognition and understanding workshop, Kyoto, Japan (pp. 77–80).
Vergin, R., O’Shaughnessy, D., & Farhat, A. (1999). Generalized Mel frequency cepstral coefficients for large vocabulary speaker-independent continuous-speech recognition. IEEE Transactions on Speech and Audio Processing, 7(5), 525–532.
Welch, L. R. (2003). HMMs and the Baum–Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.
Wessel, F., Macherey, K., & Schluter, R. (1999). A comparison of word graph and N-best list based confidence scores. In Proc. Eurospeech, Budapest (pp. 315–318).
Woodland, P. C., Leggetter, C. J., Odell, J. J., Valtchev, V., & Young, S. J. (1995). The development of the 1994 HTK large vocabulary speech recognition system. In Proc. ARPA spoken language systems technology workshop (pp. 104–109).
Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Munich, Germany (Vol. 2, pp. 719–722).
Xiao, X., Li, J., Chng, E. S., Li, H., & Lee, C. H. (2010). A study on the generalization capability of acoustic models for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1158–1169.
Young, S. J. (1992). The general use of tying in phoneme-based HMM speech recognizers. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 569–572).
Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based state tying for high accuracy acoustic modeling. In Proceedings of human language technology workshop (pp. 307–312).
Young, S. (1996). A review of large vocabulary continuous speech recognition. IEEE Signal Processing Magazine, 13, 45–57.
Young, S., Evermann, G., et al. (2009). The HTK book. Cambridge: Microsoft Corporation and Cambridge University Engineering Department.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aggarwal, R.K., Dave, M. Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). Int J Speech Technol 14, 297–308 (2011). https://doi.org/10.1007/s10772-011-9108-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-011-9108-2