Skip to main content

Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian

  • Conference paper
  • First Online:
Book cover Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

Abstract

This paper presents the results obtained using several variants of trigram language models in a large vocabulary continuous speech recognition (LVCSR) system for the Serbian language, based on the deep neural network (DNN) framework implemented within the Kaldi speech recognition toolkit. This training approach allows parallelization using several threads on either multiple GPUs or multiple CPUs, and provides a natural-gradient modification to the stochastic gradient descent (SGD) optimization method. Acoustic models are trained over a fixed number of training epochs with parameter averaging in the end. This paper discusses recognition using different language models trained with Kneser-Ney or Good-Turing smoothing methods, as well as several pruning parameter values. The results on a test set containing more than 120000 words and different utterance types are explored and compared to the referent results with GMM-HMM speaker-adapted models for the same speech database. Online and offline recognition results are compared to each other as well. Finally, the effect of additional discriminative training using a language model prior to the DNN stage is explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: Proceedings of the 10th Conference on Digital Speech and Image Processing, DOGS, Novi Sad, pp. 31–34 (2014)

    Google Scholar 

  2. Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for Serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015, LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015)

    Google Scholar 

  3. Povey, D., Kuo, H.-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008)

    Google Scholar 

  4. Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, University, Engineering Department, Cambridge (2003)

    Google Scholar 

  5. Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Orlando, pp. I-105–I-108 (2002)

    Google Scholar 

  6. Suzić, S., Ostrogonac, S., Pakoci, E., Bojanić, M.: Building a speech repository for a Serbian LVCSR system. Telfor J. 6(2), 109–114 (2014). Paunović, Đ., Milić, L. (Eds.) Telecommunications Society, Belgrade

    Article  Google Scholar 

  7. Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa, p. 5 (2011)

    Google Scholar 

  8. Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proceedings of the 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995)

    Google Scholar 

  9. Gale, W.A., Sampson, G.: Good-Turing smoothing without tears. J. Quant. Linguist. 2(3), 217–237 (1995). Köhler, R. (Ed.) Swets & Zeitlinger, Lisse

    Article  Google Scholar 

  10. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008)

    Google Scholar 

  11. Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. In: Proceedings of the 3rd International Conference on Learning Representations Workshop, ICLR, San Diego (2015)

    Google Scholar 

  12. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association, INTERSPEECH, Dresden, pp. 2–6 (2015)

    Google Scholar 

  13. Bhanuprasad, K., Svenson, D.: Errgrams – a way to improving ASR for highly inflective Dravidian languages. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, IJCNLP, Hyderabad, pp. 805–810 (2008)

    Google Scholar 

Download references

Acknowledgments

. The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, ID E! 9944, and the Provincial Secretariat for Higher Education and Scientific Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edvin Pakoci .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Pakoci, E., Popović, B., Pekar, D. (2017). Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics