Skip to main content
Log in

Building accurate and robust HMM models for practical ASR systems

  • Published:
Telecommunication Systems Aims and scope Submit manuscript

Abstract

In this article the relevant training aspects for building robust and accurate HMM models for large vocabulary recognition system are discussed and adjusted, namely: speech features, training steps, and the tying options for context dependent (CD) phonemes. As the basis for building HMM models the well known MASPER training scheme is assumed. First the incorporation of the voicing information and its effect on the classical extraction methods like MFCC and PLP will be shown together with the derivative features, where the relative error reductions are up to 50%. Next the suggested enhancement of the standard training procedure by introducing garbled speech models will be presented and tested on real data. As it will be shown it brings more than a 5% drop in the error rate. Finally, the options for tying states of CD phonemes using decision trees and phoneme classification will be adjusted, tested, and explained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Nouza, J., Zdansky, J., David, P., Cerva, P., Kolorenc, J., & Nejedlova, D. (2005). Fully automated system for Czech spoken broadcast transcription with very large (300K+) lexicon. In Proceedings of interspeech 2005, Lisbon, Portugal, September, 2005 (pp. 1681–1684). ISSN 1018-4074.

    Google Scholar 

  2. Baum, L., & Eagon, J. (1967). An inequality with applications to statistical estimation for probabilities functions of a Markov process and to models for ecology. Bulletin of the AMS, 73, 360–363.

    Article  Google Scholar 

  3. Huang, X., Ariki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburg University Press.

  4. Jiang, H., & Li, X. (2007). A general approximation-optimization approach to large margin estimation of HMMs. In Robust speech recognition and understanding. I-Tech education and publishing, Croatia, ISBN 978-3-902613-08-0.

  5. Bonafonte, A., Vidal, J., & Nogueiras, A. (1996). Duration modeling with expanded HMM applied to speech recognition. In Proceedings of ICSLP 96, Philadelphia, USA (Vol. 2, pp. 1097–1100). ISBN: 0-7803-3555-4.

  6. Casar, M., & Fonllosa, J. (2007). Double layer architectures for automatic speech recognition using HMM. In Robust speech recognition and understanding. I-Tech education and publishing, Croatia. ISBN 978-3-902613-08-0.

  7. Hermasky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4).

  8. Nadeu, C., & Macho, D. (2001). Time and Frequency Filtering of Filter-Bank energies for robust HMM speech recognition. Speech Communication, 34.

  9. Cheng, O., Abdulla, W., & Salcic, Z. (2005). Performance evaluation of front-end processing for speech recognition systems. School of Engineering Report No. 621, Electrical and Computer Engineering Department, School of Engineering, The University of Auckland.

  10. Haque, S., Togneri, R., & Zaknich, A. (2009). Perceptual features for automatic speech recognition in noisy environments. Speech Communication, 51, 58–75.

    Article  Google Scholar 

  11. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11, 95–103.

    Article  Google Scholar 

  12. Darjaa, S., Rusko, M., & Trnka, M. (2006). MobilDat-SK–a mobile telephone extension to the SpeechDat-E SK telephone speech database in Slovak. In Proceedings of the 11-th international conference speech and computer (SPECOM’2006), St. Petersburg, Russia (pp. 449–454).

    Google Scholar 

  13. Zgank, A., Kacic, Z., Diehel, F., Vicsi, K., Szaszak, G., Juhar, J., & Lihan, S. (2004). The Cost 278 MASPER initiative—crosslingual speech recognition with large telephone databases. In Proceedings of language resources and evaluation (LREC), Lisbon (pp. 2107–2110).

  14. Lindberg, B., Johansen, F., Warakagoda, N., Lehtinen, G., Kacic, Z., Zgang, A., Elenius, K., & Salvi, G. (2000). A noise robust multilingual reference recognizer based on SpeechDat(II). In Proceedings of ICSLP 2000, Beijing, China, October 2000.

  15. Rabiner, L., & Juan, B. (1993). Fundamentals of speech recognition. New Jersey: Prentice Hall. ISBN 0-13-015157-2

    Google Scholar 

  16. Hönig, F., Stemmer, G., Hacker, Ch., & Brugnara, F. (2005). Revising perceptual linear prediction (PLP). In Proceedings of INTERSPEECH, Lisbon, Portugal, Sept. 2005 (pp. 2997–3000).

    Google Scholar 

  17. Lee, K., Hon, H., & Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics Speech and Signal Processing, 38(1).

  18. Hermansky, H., Hanson, B. A., & Wakita, H. (1985). Perceptually based linear predictive analysis of speech. New York: IEEE.

    Google Scholar 

  19. Rabaoui, A., Kadri, H., Lachiri, Z., & Ellouze, N. (2008). Using robust features with multi-class SVMs to classify noisy sounds. In ISCCSP, Malta.

    Google Scholar 

  20. Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 111(4).

  21. Kacur, J., & Rozinaj, G. (2009). Adding voicing features into speech recognition based on HMM in Slovak. In IWSSIP09, Greece.

    Google Scholar 

  22. Juhar, J., Ondas, S., Cizmar, A., Rusko, M., Rozinaj, G., & Jarina, R. (2006). Galaxy/VoiceXML based spoken Slovak dialogue system to access the Internet. In ECAI 2006 workshop on language-enabled educational technology and development and evaluation of robust spoken dialogue systems, Riva del Garda, Italy, August 29, 2006 (pp. 34–37).

    Google Scholar 

  23. Johansen, F. T., Warakagoda, N., Lindberg, B., et al. (2000). The cost 249 SpeechDat multilingual reference recognizer. In 2nd international conference on language resources and evaluation (LREC-2000), Athens, May 2000.

    Google Scholar 

  24. Höge, H., Draxler, C., Van den Heuvel, H., Johansen, F. T., Sanders, E., & Tropf, H. S. (1999). SpeechDat multilingual speech databases for teleservices: across the finish line. In Proc. Europ. conf. speech proc. and techn. (EUROSPEECH).

    Google Scholar 

  25. Young, S., Evermann, G., & Hain, T. (2002). The HTK book V.3.2.1. Cambridge University Engineering Department.

  26. Kacur, J., & Ceresna, M. (2007). A modified MASPER training procedure for ASR systems and its performance on Slovak MOBILDAT database. In IWSSIP07, Slovenia.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juraj Kačur.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kačur, J., Rozinaj, G. Building accurate and robust HMM models for practical ASR systems. Telecommun Syst 52, 1683–1696 (2013). https://doi.org/10.1007/s11235-011-9660-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11235-011-9660-8

Keywords

Navigation