Skip to main content
Log in

Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In automatic speech recognition (ASR) systems, the speech signal is captured and parameterized at front end and evaluated at back end using the statistical framework of hidden Markov model (HMM). The performance of these systems depend critically on both the type of models used and the methods adopted for signal analysis. Researchers have proposed a variety of modifications and extensions for HMM based acoustic models to overcome their limitations. In this review, we summarize most of the research work related to HMM-ASR which has been carried out during the last three decades. We present all these approaches under three categories, namely conventional methods, refinements and advancements of HMM. The review is presented in two parts (papers): (i) An overview of conventional methods for acoustic phonetic modeling, (ii) Refinements and advancements of acoustic models. Part I explores the architecture and working of the standard HMM with its limitations. It also covers different modeling units, language models and decoders. Part II presents a review on the advances and refinements of the conventional HMM techniques along with the current challenges and performance issues related to ASR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Anusuya, M. A., & Katti, S. K. (2011). Front end analysis of speech recognition: a review. International Journal of Speech Technology, 14, 99–145.

    Article  Google Scholar 

  • Aubert, X., Haeb-Umbach, R., & Ney, H. (1993). Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 648–651).

    Chapter  Google Scholar 

  • Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114.

    Article  Google Scholar 

  • Bakis, R. (1976). Continuous speech word recognition via centisecond acoustic states. In Proc. ASA meeting, Washington, DC, USA.

    Google Scholar 

  • Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73, 360–363.

    Article  MathSciNet  MATH  Google Scholar 

  • Bellegarda, I. R., & Nahamoo, D. (1990). Tied mixture continuous parameter modeling for speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(12), 2033–2045.

    Article  Google Scholar 

  • Beulen, K., Ortmanns, S., & Elting, C. (1999). Dynamic programming search techniques for across-word modeling in speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 609–612).

    Google Scholar 

  • Bilmes, J. A. (1999). Buried Markov models for speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. 713–716).

    Google Scholar 

  • Bilmes, J. A. (2003). Buried Markov models: a graphical-modeling approach to automatic speech recognition. Computer Speech and Language, 17(2–3), 213–231.

    Article  Google Scholar 

  • Bilmes, J. A., & Bartels, C. (2005). Graphical model architectures for speech recognition. IEEE Signal Processing Magazine, 22(5), 89–100.

    Article  Google Scholar 

  • Casar, M., & Fonollosa, J. A. R. (2006). Analysis of HMM temporal evolution for automatic speech recognition and utterance verification. In Proceedings of IEEE international conference on spoken language processing (ICSLP).

    Google Scholar 

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Deng, L., Lennig, M., Seitz, P., & Mermelstein, P. (1990). Large vocabulary word recognition using context-dependent allophonic hidden Markov models. Computer Speech and Language, 4, 345–357.

    Article  Google Scholar 

  • Deng, L., & O’Shaughnessy, D. (2003). Speech processing—a dynamic and optimization-oriented approach. New York: Dekker.

    Google Scholar 

  • Digalakis, V. V., & Murveit, H. (1994). Genones: optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 537–540).

    Google Scholar 

  • Digalakis, V. V., Monaco, P., & Murveit, H. (1996). Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers. IEEE Transactions on Speech and Audio Processing, 4(4), 281–289.

    Article  Google Scholar 

  • Evemnann, G., & Woodland, P. C. (2000). Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1655–1658).

    Google Scholar 

  • Ferguson, J. D. (1980). Variable duration models for speech. In Symposium: application of hidden Markov models to text and speech, Institute for Defense Analyses, Princeton, NJ (pp. 143–179).

    Google Scholar 

  • Fetter, P., Kaltenmeier, A., Kuhn, T., & Regel-Brietzmann, P. (1996). Improved modeling of OOV words in spontaneous speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 534–537).

    Google Scholar 

  • Gales, M., & Young, S. J. (1993). Cepstral parameter compensation for HMM recognition. Speech Communication, 12, 231–239.

    Article  Google Scholar 

  • Gales, M., & Young, S. (1996). Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.

    Article  Google Scholar 

  • Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.

    Article  MATH  Google Scholar 

  • Ganapathiraju, A., et al. (1998). Support vector machines for speech recognition. In Proceedings of IEEE ICSLP (pp. 2923–2926).

    Google Scholar 

  • Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.

    Article  Google Scholar 

  • Goel, N., Thomas, S., Agarwal, M., et al. (2010). Approaches to automatic lexicon learning with limited training examples. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 5094–5097).

    Chapter  Google Scholar 

  • Gong, Y. (2005). A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition. IEEE Transactions on Speech and Audio Processing, 13(5), 975–983.

    Article  Google Scholar 

  • Goronzy, S. (2002). LNAI: Vol. 2560. Robust adaptation to non-native accents in automatic speech recognition. Berlin: Springer.

    Book  MATH  Google Scholar 

  • He, X., & Deng, L. (2008). Discriminative learning for speech recognition: theory and practice. San Rafael: Morgan and Claypool Publishers.

    Google Scholar 

  • Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.

    Article  Google Scholar 

  • Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 289–292).

    Google Scholar 

  • Huang, X. D., & Jack, M. A. (1988). Performance comparison between semi-continuous and discrete hidden Markov models. IEE Electronics Letters, 24(3), 149–150.

    Article  Google Scholar 

  • Huang, X. D., Ariki, Y., & Jack, M. A. (1990). Hidden Markov models for speech recognition. Edinburg: Edinburg University Press.

    Google Scholar 

  • Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing: a guide to theory algorithm and system development. New York: Prentice Hall.

    Google Scholar 

  • Hwang, M., & Huang, X. (1992). Subphonetic modeling with Markov states—Senone. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 33–36).

    Google Scholar 

  • Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development, 13, 675–685.

    Article  MathSciNet  MATH  Google Scholar 

  • Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge: MIT Press.

    Google Scholar 

  • Krstulovic, S., Bimbot, F., et al. (2006). Optimizing the coverage of speech database through a selection of representative speaker recordings. Speech Communication, 48, 1319–1348.

    Article  Google Scholar 

  • Layton, M. I., & Gales, M. J. F. (2006). Augmented statistical models for speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, Article 1).

    Google Scholar 

  • Lee, C. H., Gauvain, J. L., Pieraccini, R., & Rabiner, L. R. (1993). Large vocabulary speech recognition using subword units. Speech Communication, 13, 263–279.

    Article  Google Scholar 

  • Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2), 171–185.

    Article  Google Scholar 

  • Li, J., Deng, L., Yu, D., Gong, Y., & Acero, A. (2007). High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In Proceedings IEEE automatic speech recognition and understanding workshop, Kyoto, Japan (pp. 65–70).

    Google Scholar 

  • Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.

    Article  Google Scholar 

  • Mihelic, F., & Zibert, J. (2008). Speech recognition technologies and applications. Vienna: I-TECH

    Google Scholar 

  • Murveit, H., Butzberger, J., Digalakis, V., & Weintraub, M. (1993). Large vocabulary dictation using SRI’s decipher speech recognition system: progressive search techniques. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Minneapolis, MN (Vol. 2, pp. 319–322).

    Chapter  Google Scholar 

  • Nanjo, H., & Kawahara, T. (2003). Unsupervised language model adaptation for lecture speech recognition. In Proceedings of the ISCA and IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan.

    Google Scholar 

  • Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 263–271.

    Article  Google Scholar 

  • Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 65–83.

    Article  Google Scholar 

  • Ney, H., & Ortmanns, S. (2000). Progress in dynamic programming search for large vocabulary continuous speech recognition. Proceedings of the IEEE, 88(8), 1224–1240.

    Article  Google Scholar 

  • O’Shaughnessy, D. (2003). Interacting with computers by voice-automatic speech recognitions and synthesis. Proceedings of the IEEE, 91(9), 1272–1305.

    Article  Google Scholar 

  • Odell, J. J., Valtchev, V., Woodland, P. C., & Young, S. J. (1994). A one pass decoder design for large vocabulary recognition. In Proc. ARPA human language technology workshop, Princeton, NJ (pp. 405–410).

    Chapter  Google Scholar 

  • Ortmanns, S., Ney, H., & Eiden, A. (1996a). Language-model look-ahead for large vocabulary speech recognition. In Proc. international conference on spoken language processing, Philadelphia, PA, USA (Vol. 4, pp. 2095–2098).

    Google Scholar 

  • Ortmanns, S., Ney, H., Eiden, A., & Coenen, N. (1996b). Look-ahead techniques for improved beam search. In Proc. CRIM-FORWISS workshop, Montreal, Canada (pp. 10–22).

    Google Scholar 

  • Ostendorf, M., Digalakis, V., & Kimball, O. A. (1996). From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–378.

    Article  Google Scholar 

  • Paul, D. B. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 449–452).

    Google Scholar 

  • Paul, D. B. (1991). Algorithms for an optimal A search and linearizing the search in the stack decoder. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Toronto, Canada (Vol. 1, pp. 693–696).

    Google Scholar 

  • Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.

    Article  Google Scholar 

  • Pylkkonen, J., & Kurimo, M. (2003). Duration modeling techniques for continuous speech recognition. In Proceedings of European conference on speech technology.

    Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21(2), 282–295.

    Article  Google Scholar 

  • Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 43–49.

    Article  MATH  Google Scholar 

  • Schwartz, R., Klovstad, J., Makhoul, J., & Sorensen, J. (1980). A preliminary design of a phonetic vocoder based on a diphone model. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 32–35).

    Google Scholar 

  • Schwartz, R., Chow, Y., et al. (1985). Context dependent modeling for acoustic-phonetic recognition of speech signals. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Tampa, FLA (pp. 1205–1208).

    Google Scholar 

  • Schwartz, R., Austin, S., Kubala, F., & Makhoul, J. (1992). New uses for N-best sentence hypothesis, within the BYBLOS speech recognition system. In Proceedings IEEE international conference on acoustics, speech, and signal processing, San Francisco, CA (Vol. 1, pp. 1–4).

    Google Scholar 

  • Shankar, A., Gadde, V. R. R., Stolcke, A., & Weng, F. (2002). Improved modeling and efficiency for automatic transcription of broadcast news. Speech Communication, 37(1–2), 133–158.

    Article  Google Scholar 

  • Sharma, A., Shrotriya, M. C., Farooq, O., & Abbasi, Z. A. (2008). Hybrid wavelet based LPC features for Hindi speech recognition. International Journal of Information and Communication Technology, 1, 373–381.

    Article  Google Scholar 

  • Sixtus, A., & Ney, H. (2002). From within-word model search to crossword model search in large vocabulary continuous speech recognition. Computer Speech and Language, 16(2), 245–271.

    Article  Google Scholar 

  • Tsao, Y., & Lee, C. H. (2007a). An ensemble modeling approach to joint characterization of speaker and speaking environments. In Proc. Eurospeech, Antwerp, Belgium (pp. 1050–1053).

    Google Scholar 

  • Tsao, Y., & Lee, C. H. (2007b). Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition. In Proceedings IEEE automatic speech recognition and understanding workshop, Kyoto, Japan (pp. 77–80).

    Google Scholar 

  • Vergin, R., O’Shaughnessy, D., & Farhat, A. (1999). Generalized Mel frequency cepstral coefficients for large vocabulary speaker-independent continuous-speech recognition. IEEE Transactions on Speech and Audio Processing, 7(5), 525–532.

    Article  Google Scholar 

  • Welch, L. R. (2003). HMMs and the Baum–Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.

    MathSciNet  Google Scholar 

  • Wessel, F., Macherey, K., & Schluter, R. (1999). A comparison of word graph and N-best list based confidence scores. In Proc. Eurospeech, Budapest (pp. 315–318).

    Google Scholar 

  • Woodland, P. C., Leggetter, C. J., Odell, J. J., Valtchev, V., & Young, S. J. (1995). The development of the 1994 HTK large vocabulary speech recognition system. In Proc. ARPA spoken language systems technology workshop (pp. 104–109).

    Google Scholar 

  • Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Munich, Germany (Vol. 2, pp. 719–722).

    Google Scholar 

  • Xiao, X., Li, J., Chng, E. S., Li, H., & Lee, C. H. (2010). A study on the generalization capability of acoustic models for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1158–1169.

    Article  Google Scholar 

  • Young, S. J. (1992). The general use of tying in phoneme-based HMM speech recognizers. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 569–572).

    Google Scholar 

  • Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based state tying for high accuracy acoustic modeling. In Proceedings of human language technology workshop (pp. 307–312).

    Chapter  Google Scholar 

  • Young, S. (1996). A review of large vocabulary continuous speech recognition. IEEE Signal Processing Magazine, 13, 45–57.

    Article  Google Scholar 

  • Young, S., Evermann, G., et al. (2009). The HTK book. Cambridge: Microsoft Corporation and Cambridge University Engineering Department.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajesh Kumar Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, R.K., Dave, M. Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). Int J Speech Technol 14, 297–308 (2011). https://doi.org/10.1007/s10772-011-9108-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-011-9108-2

Keywords

Navigation