Building accurate and robust HMM models for practical ASR systems

Kačur, Juraj; Rozinaj, Gregor

doi:10.1007/s11235-011-9660-8

Building accurate and robust HMM models for practical ASR systems

Published: 06 October 2011

Volume 52, pages 1683–1696, (2013)
Cite this article

Telecommunication Systems Aims and scope Submit manuscript

Juraj Kačur¹ &
Gregor Rozinaj¹

165 Accesses
2 Citations
Explore all metrics

Abstract

In this article the relevant training aspects for building robust and accurate HMM models for large vocabulary recognition system are discussed and adjusted, namely: speech features, training steps, and the tying options for context dependent (CD) phonemes. As the basis for building HMM models the well known MASPER training scheme is assumed. First the incorporation of the voicing information and its effect on the classical extraction methods like MFCC and PLP will be shown together with the derivative features, where the relative error reductions are up to 50%. Next the suggested enhancement of the standard training procedure by introducing garbled speech models will be presented and tested on real data. As it will be shown it brings more than a 5% drop in the error rate. Finally, the options for tying states of CD phonemes using decision trees and phoneme classification will be adjusted, tested, and explained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

Article 29 January 2018

Chinese dialect speech recognition: a comprehensive survey

Article Open access 31 January 2024

References

Nouza, J., Zdansky, J., David, P., Cerva, P., Kolorenc, J., & Nejedlova, D. (2005). Fully automated system for Czech spoken broadcast transcription with very large (300K+) lexicon. In Proceedings of interspeech 2005, Lisbon, Portugal, September, 2005 (pp. 1681–1684). ISSN 1018-4074.
Google Scholar
Baum, L., & Eagon, J. (1967). An inequality with applications to statistical estimation for probabilities functions of a Markov process and to models for ecology. Bulletin of the AMS, 73, 360–363.
Article Google Scholar
Huang, X., Ariki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburg University Press.
Jiang, H., & Li, X. (2007). A general approximation-optimization approach to large margin estimation of HMMs. In Robust speech recognition and understanding. I-Tech education and publishing, Croatia, ISBN 978-3-902613-08-0.
Bonafonte, A., Vidal, J., & Nogueiras, A. (1996). Duration modeling with expanded HMM applied to speech recognition. In Proceedings of ICSLP 96, Philadelphia, USA (Vol. 2, pp. 1097–1100). ISBN: 0-7803-3555-4.
Casar, M., & Fonllosa, J. (2007). Double layer architectures for automatic speech recognition using HMM. In Robust speech recognition and understanding. I-Tech education and publishing, Croatia. ISBN 978-3-902613-08-0.
Hermasky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4).
Nadeu, C., & Macho, D. (2001). Time and Frequency Filtering of Filter-Bank energies for robust HMM speech recognition. Speech Communication, 34.
Cheng, O., Abdulla, W., & Salcic, Z. (2005). Performance evaluation of front-end processing for speech recognition systems. School of Engineering Report No. 621, Electrical and Computer Engineering Department, School of Engineering, The University of Auckland.
Haque, S., Togneri, R., & Zaknich, A. (2009). Perceptual features for automatic speech recognition in noisy environments. Speech Communication, 51, 58–75.
Article Google Scholar
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11, 95–103.
Article Google Scholar
Darjaa, S., Rusko, M., & Trnka, M. (2006). MobilDat-SK–a mobile telephone extension to the SpeechDat-E SK telephone speech database in Slovak. In Proceedings of the 11-th international conference speech and computer (SPECOM’2006), St. Petersburg, Russia (pp. 449–454).
Google Scholar
Zgank, A., Kacic, Z., Diehel, F., Vicsi, K., Szaszak, G., Juhar, J., & Lihan, S. (2004). The Cost 278 MASPER initiative—crosslingual speech recognition with large telephone databases. In Proceedings of language resources and evaluation (LREC), Lisbon (pp. 2107–2110).
Lindberg, B., Johansen, F., Warakagoda, N., Lehtinen, G., Kacic, Z., Zgang, A., Elenius, K., & Salvi, G. (2000). A noise robust multilingual reference recognizer based on SpeechDat(II). In Proceedings of ICSLP 2000, Beijing, China, October 2000.
Rabiner, L., & Juan, B. (1993). Fundamentals of speech recognition. New Jersey: Prentice Hall. ISBN 0-13-015157-2
Google Scholar
Hönig, F., Stemmer, G., Hacker, Ch., & Brugnara, F. (2005). Revising perceptual linear prediction (PLP). In Proceedings of INTERSPEECH, Lisbon, Portugal, Sept. 2005 (pp. 2997–3000).
Google Scholar
Lee, K., Hon, H., & Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics Speech and Signal Processing, 38(1).
Hermansky, H., Hanson, B. A., & Wakita, H. (1985). Perceptually based linear predictive analysis of speech. New York: IEEE.
Google Scholar
Rabaoui, A., Kadri, H., Lachiri, Z., & Ellouze, N. (2008). Using robust features with multi-class SVMs to classify noisy sounds. In ISCCSP, Malta.
Google Scholar
Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 111(4).
Kacur, J., & Rozinaj, G. (2009). Adding voicing features into speech recognition based on HMM in Slovak. In IWSSIP09, Greece.
Google Scholar
Juhar, J., Ondas, S., Cizmar, A., Rusko, M., Rozinaj, G., & Jarina, R. (2006). Galaxy/VoiceXML based spoken Slovak dialogue system to access the Internet. In ECAI 2006 workshop on language-enabled educational technology and development and evaluation of robust spoken dialogue systems, Riva del Garda, Italy, August 29, 2006 (pp. 34–37).
Google Scholar
Johansen, F. T., Warakagoda, N., Lindberg, B., et al. (2000). The cost 249 SpeechDat multilingual reference recognizer. In 2nd international conference on language resources and evaluation (LREC-2000), Athens, May 2000.
Google Scholar
Höge, H., Draxler, C., Van den Heuvel, H., Johansen, F. T., Sanders, E., & Tropf, H. S. (1999). SpeechDat multilingual speech databases for teleservices: across the finish line. In Proc. Europ. conf. speech proc. and techn. (EUROSPEECH).
Google Scholar
Young, S., Evermann, G., & Hain, T. (2002). The HTK book V.3.2.1. Cambridge University Engineering Department.
Kacur, J., & Ceresna, M. (2007). A modified MASPER training procedure for ASR systems and its performance on Slovak MOBILDAT database. In IWSSIP07, Slovenia.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Information Technology, Slovak University of Technology, Bratislava, Slovakia
Juraj Kačur & Gregor Rozinaj

Authors

Juraj Kačur
View author publications
You can also search for this author in PubMed Google Scholar
Gregor Rozinaj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juraj Kačur.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kačur, J., Rozinaj, G. Building accurate and robust HMM models for practical ASR systems. Telecommun Syst 52, 1683–1696 (2013). https://doi.org/10.1007/s11235-011-9660-8

Download citation

Published: 06 October 2011
Issue Date: March 2013
DOI: https://doi.org/10.1007/s11235-011-9660-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building accurate and robust HMM models for practical ASR systems

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

Chinese dialect speech recognition: a comprehensive survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Building accurate and robust HMM models for practical ASR systems

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

Chinese dialect speech recognition: a comprehensive survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation