Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

Aggarwal, Rajesh Kumar; Dave, Mayank

doi:10.1007/s10772-011-9108-2

Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

Published: 23 September 2011

Volume 14, pages 297–308, (2011)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Rajesh Kumar Aggarwal¹ &
Mayank Dave¹

630 Accesses
21 Citations
Explore all metrics

Abstract

In automatic speech recognition (ASR) systems, the speech signal is captured and parameterized at front end and evaluated at back end using the statistical framework of hidden Markov model (HMM). The performance of these systems depend critically on both the type of models used and the methods adopted for signal analysis. Researchers have proposed a variety of modifications and extensions for HMM based acoustic models to overcome their limitations. In this review, we summarize most of the research work related to HMM-ASR which has been carried out during the last three decades. We present all these approaches under three categories, namely conventional methods, refinements and advancements of HMM. The review is presented in two parts (papers): (i) An overview of conventional methods for acoustic phonetic modeling, (ii) Refinements and advancements of acoustic models. Part I explores the architecture and working of the standard HMM with its limitations. It also covers different modeling units, language models and decoders. Part II presents a review on the advances and refinements of the conventional HMM techniques along with the current challenges and performance issues related to ASR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Yogesh Kumar, Apeksha Koul & Chamkaur Singh

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Navdeep Kaur & Parminder Singh

References

Anusuya, M. A., & Katti, S. K. (2011). Front end analysis of speech recognition: a review. International Journal of Speech Technology, 14, 99–145.
Article Google Scholar
Aubert, X., Haeb-Umbach, R., & Ney, H. (1993). Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 648–651).
Chapter Google Scholar
Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114.
Article Google Scholar
Bakis, R. (1976). Continuous speech word recognition via centisecond acoustic states. In Proc. ASA meeting, Washington, DC, USA.
Google Scholar
Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73, 360–363.
Article MathSciNet MATH Google Scholar
Bellegarda, I. R., & Nahamoo, D. (1990). Tied mixture continuous parameter modeling for speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(12), 2033–2045.
Article Google Scholar
Beulen, K., Ortmanns, S., & Elting, C. (1999). Dynamic programming search techniques for across-word modeling in speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 609–612).
Google Scholar
Bilmes, J. A. (1999). Buried Markov models for speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. 713–716).
Google Scholar
Bilmes, J. A. (2003). Buried Markov models: a graphical-modeling approach to automatic speech recognition. Computer Speech and Language, 17(2–3), 213–231.
Article Google Scholar
Bilmes, J. A., & Bartels, C. (2005). Graphical model architectures for speech recognition. IEEE Signal Processing Magazine, 22(5), 89–100.
Article Google Scholar
Casar, M., & Fonollosa, J. A. R. (2006). Analysis of HMM temporal evolution for automatic speech recognition and utterance verification. In Proceedings of IEEE international conference on spoken language processing (ICSLP).
Google Scholar
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Article Google Scholar
Deng, L., Lennig, M., Seitz, P., & Mermelstein, P. (1990). Large vocabulary word recognition using context-dependent allophonic hidden Markov models. Computer Speech and Language, 4, 345–357.
Article Google Scholar
Deng, L., & O’Shaughnessy, D. (2003). Speech processing—a dynamic and optimization-oriented approach. New York: Dekker.
Google Scholar
Digalakis, V. V., & Murveit, H. (1994). Genones: optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 537–540).
Google Scholar
Digalakis, V. V., Monaco, P., & Murveit, H. (1996). Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers. IEEE Transactions on Speech and Audio Processing, 4(4), 281–289.
Article Google Scholar
Evemnann, G., & Woodland, P. C. (2000). Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1655–1658).
Google Scholar
Ferguson, J. D. (1980). Variable duration models for speech. In Symposium: application of hidden Markov models to text and speech, Institute for Defense Analyses, Princeton, NJ (pp. 143–179).
Google Scholar
Fetter, P., Kaltenmeier, A., Kuhn, T., & Regel-Brietzmann, P. (1996). Improved modeling of OOV words in spontaneous speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 534–537).
Google Scholar
Gales, M., & Young, S. J. (1993). Cepstral parameter compensation for HMM recognition. Speech Communication, 12, 231–239.
Article Google Scholar
Gales, M., & Young, S. (1996). Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.
Article Google Scholar
Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Article MATH Google Scholar
Ganapathiraju, A., et al. (1998). Support vector machines for speech recognition. In Proceedings of IEEE ICSLP (pp. 2923–2926).
Google Scholar
Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.
Article Google Scholar
Goel, N., Thomas, S., Agarwal, M., et al. (2010). Approaches to automatic lexicon learning with limited training examples. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 5094–5097).
Chapter Google Scholar
Gong, Y. (2005). A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition. IEEE Transactions on Speech and Audio Processing, 13(5), 975–983.
Article Google Scholar
Goronzy, S. (2002). LNAI: Vol. 2560. Robust adaptation to non-native accents in automatic speech recognition. Berlin: Springer.
Book MATH Google Scholar
He, X., & Deng, L. (2008). Discriminative learning for speech recognition: theory and practice. San Rafael: Morgan and Claypool Publishers.
Google Scholar
Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.
Article Google Scholar
Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 289–292).
Google Scholar
Huang, X. D., & Jack, M. A. (1988). Performance comparison between semi-continuous and discrete hidden Markov models. IEE Electronics Letters, 24(3), 149–150.
Article Google Scholar
Huang, X. D., Ariki, Y., & Jack, M. A. (1990). Hidden Markov models for speech recognition. Edinburg: Edinburg University Press.
Google Scholar
Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing: a guide to theory algorithm and system development. New York: Prentice Hall.
Google Scholar
Hwang, M., & Huang, X. (1992). Subphonetic modeling with Markov states—Senone. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 33–36).
Google Scholar
Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development, 13, 675–685.
Article MathSciNet MATH Google Scholar
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge: MIT Press.
Google Scholar
Krstulovic, S., Bimbot, F., et al. (2006). Optimizing the coverage of speech database through a selection of representative speaker recordings. Speech Communication, 48, 1319–1348.
Article Google Scholar
Layton, M. I., & Gales, M. J. F. (2006). Augmented statistical models for speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (Vol. 1, Article 1).
Google Scholar
Lee, C. H., Gauvain, J. L., Pieraccini, R., & Rabiner, L. R. (1993). Large vocabulary speech recognition using subword units. Speech Communication, 13, 263–279.
Article Google Scholar
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2), 171–185.
Article Google Scholar
Li, J., Deng, L., Yu, D., Gong, Y., & Acero, A. (2007). High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In Proceedings IEEE automatic speech recognition and understanding workshop, Kyoto, Japan (pp. 65–70).
Google Scholar
Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.
Article Google Scholar
Mihelic, F., & Zibert, J. (2008). Speech recognition technologies and applications. Vienna: I-TECH
Google Scholar
Murveit, H., Butzberger, J., Digalakis, V., & Weintraub, M. (1993). Large vocabulary dictation using SRI’s decipher speech recognition system: progressive search techniques. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Minneapolis, MN (Vol. 2, pp. 319–322).
Chapter Google Scholar
Nanjo, H., & Kawahara, T. (2003). Unsupervised language model adaptation for lecture speech recognition. In Proceedings of the ISCA and IEEE workshop on spontaneous speech processing and recognition, Tokyo, Japan.
Google Scholar
Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 263–271.
Article Google Scholar
Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 65–83.
Article Google Scholar
Ney, H., & Ortmanns, S. (2000). Progress in dynamic programming search for large vocabulary continuous speech recognition. Proceedings of the IEEE, 88(8), 1224–1240.
Article Google Scholar
O’Shaughnessy, D. (2003). Interacting with computers by voice-automatic speech recognitions and synthesis. Proceedings of the IEEE, 91(9), 1272–1305.
Article Google Scholar
Odell, J. J., Valtchev, V., Woodland, P. C., & Young, S. J. (1994). A one pass decoder design for large vocabulary recognition. In Proc. ARPA human language technology workshop, Princeton, NJ (pp. 405–410).
Chapter Google Scholar
Ortmanns, S., Ney, H., & Eiden, A. (1996a). Language-model look-ahead for large vocabulary speech recognition. In Proc. international conference on spoken language processing, Philadelphia, PA, USA (Vol. 4, pp. 2095–2098).
Google Scholar
Ortmanns, S., Ney, H., Eiden, A., & Coenen, N. (1996b). Look-ahead techniques for improved beam search. In Proc. CRIM-FORWISS workshop, Montreal, Canada (pp. 10–22).
Google Scholar
Ostendorf, M., Digalakis, V., & Kimball, O. A. (1996). From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–378.
Article Google Scholar
Paul, D. B. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 449–452).
Google Scholar
Paul, D. B. (1991). Algorithms for an optimal A ^∗ search and linearizing the search in the stack decoder. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Toronto, Canada (Vol. 1, pp. 693–696).
Google Scholar
Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.
Article Google Scholar
Pylkkonen, J., & Kurimo, M. (2003). Duration modeling techniques for continuous speech recognition. In Proceedings of European conference on speech technology.
Google Scholar
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21(2), 282–295.
Article Google Scholar
Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 43–49.
Article MATH Google Scholar
Schwartz, R., Klovstad, J., Makhoul, J., & Sorensen, J. (1980). A preliminary design of a phonetic vocoder based on a diphone model. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 32–35).
Google Scholar
Schwartz, R., Chow, Y., et al. (1985). Context dependent modeling for acoustic-phonetic recognition of speech signals. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Tampa, FLA (pp. 1205–1208).
Google Scholar
Schwartz, R., Austin, S., Kubala, F., & Makhoul, J. (1992). New uses for N-best sentence hypothesis, within the BYBLOS speech recognition system. In Proceedings IEEE international conference on acoustics, speech, and signal processing, San Francisco, CA (Vol. 1, pp. 1–4).
Google Scholar
Shankar, A., Gadde, V. R. R., Stolcke, A., & Weng, F. (2002). Improved modeling and efficiency for automatic transcription of broadcast news. Speech Communication, 37(1–2), 133–158.
Article Google Scholar
Sharma, A., Shrotriya, M. C., Farooq, O., & Abbasi, Z. A. (2008). Hybrid wavelet based LPC features for Hindi speech recognition. International Journal of Information and Communication Technology, 1, 373–381.
Article Google Scholar
Sixtus, A., & Ney, H. (2002). From within-word model search to crossword model search in large vocabulary continuous speech recognition. Computer Speech and Language, 16(2), 245–271.
Article Google Scholar
Tsao, Y., & Lee, C. H. (2007a). An ensemble modeling approach to joint characterization of speaker and speaking environments. In Proc. Eurospeech, Antwerp, Belgium (pp. 1050–1053).
Google Scholar
Tsao, Y., & Lee, C. H. (2007b). Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition. In Proceedings IEEE automatic speech recognition and understanding workshop, Kyoto, Japan (pp. 77–80).
Google Scholar
Vergin, R., O’Shaughnessy, D., & Farhat, A. (1999). Generalized Mel frequency cepstral coefficients for large vocabulary speaker-independent continuous-speech recognition. IEEE Transactions on Speech and Audio Processing, 7(5), 525–532.
Article Google Scholar
Welch, L. R. (2003). HMMs and the Baum–Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.
MathSciNet Google Scholar
Wessel, F., Macherey, K., & Schluter, R. (1999). A comparison of word graph and N-best list based confidence scores. In Proc. Eurospeech, Budapest (pp. 315–318).
Google Scholar
Woodland, P. C., Leggetter, C. J., Odell, J. J., Valtchev, V., & Young, S. J. (1995). The development of the 1994 HTK large vocabulary speech recognition system. In Proc. ARPA spoken language systems technology workshop (pp. 104–109).
Google Scholar
Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceedings IEEE international conference on acoustics, speech, and signal processing, Munich, Germany (Vol. 2, pp. 719–722).
Google Scholar
Xiao, X., Li, J., Chng, E. S., Li, H., & Lee, C. H. (2010). A study on the generalization capability of acoustic models for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1158–1169.
Article Google Scholar
Young, S. J. (1992). The general use of tying in phoneme-based HMM speech recognizers. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 569–572).
Google Scholar
Young, S. J., Odell, J. J., & Woodland, P. C. (1994). Tree-based state tying for high accuracy acoustic modeling. In Proceedings of human language technology workshop (pp. 307–312).
Chapter Google Scholar
Young, S. (1996). A review of large vocabulary continuous speech recognition. IEEE Signal Processing Magazine, 13, 45–57.
Article Google Scholar
Young, S., Evermann, G., et al. (2009). The HTK book. Cambridge: Microsoft Corporation and Cambridge University Engineering Department.
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Technology, Kurukshetra, India
Rajesh Kumar Aggarwal & Mayank Dave

Authors

Rajesh Kumar Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Mayank Dave
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh Kumar Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, R.K., Dave, M. Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). Int J Speech Technol 14, 297–308 (2011). https://doi.org/10.1007/s10772-011-9108-2

Download citation

Received: 12 May 2011
Accepted: 03 August 2011
Published: 23 September 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10772-011-9108-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation