Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian

Pakoci, Edvin; Popović, Branislav; Pekar, Darko

doi:10.1007/978-3-319-66429-3_48

Edvin Pakoci¹⁶,
Branislav Popović¹⁶ &
Darko Pekar¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2186 Accesses
5 Citations

Abstract

This paper presents the results obtained using several variants of trigram language models in a large vocabulary continuous speech recognition (LVCSR) system for the Serbian language, based on the deep neural network (DNN) framework implemented within the Kaldi speech recognition toolkit. This training approach allows parallelization using several threads on either multiple GPUs or multiple CPUs, and provides a natural-gradient modification to the stochastic gradient descent (SGD) optimization method. Acoustic models are trained over a fixed number of training epochs with parameter averaging in the end. This paper discusses recognition using different language models trained with Kneser-Ney or Good-Turing smoothing methods, as well as several pruning parameter values. The results on a test set containing more than 120000 words and different utterance types are explored and compared to the referent results with GMM-HMM speaker-adapted models for the same speech database. Online and offline recognition results are compared to each other as well. Finally, the effect of additional discriminative training using a language model prior to the DNN stage is explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: Proceedings of the 10th Conference on Digital Speech and Image Processing, DOGS, Novi Sad, pp. 31–34 (2014)
Google Scholar
Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for Serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015, LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015)
Google Scholar
Povey, D., Kuo, H.-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008)
Google Scholar
Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, University, Engineering Department, Cambridge (2003)
Google Scholar
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Orlando, pp. I-105–I-108 (2002)
Google Scholar
Suzić, S., Ostrogonac, S., Pakoci, E., Bojanić, M.: Building a speech repository for a Serbian LVCSR system. Telfor J. 6(2), 109–114 (2014). Paunović, Đ., Milić, L. (Eds.) Telecommunications Society, Belgrade
Article Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa, p. 5 (2011)
Google Scholar
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proceedings of the 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995)
Google Scholar
Gale, W.A., Sampson, G.: Good-Turing smoothing without tears. J. Quant. Linguist. 2(3), 217–237 (1995). Köhler, R. (Ed.) Swets & Zeitlinger, Lisse
Article Google Scholar
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008)
Google Scholar
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. In: Proceedings of the 3rd International Conference on Learning Representations Workshop, ICLR, San Diego (2015)
Google Scholar
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association, INTERSPEECH, Dresden, pp. 2–6 (2015)
Google Scholar
Bhanuprasad, K., Svenson, D.: Errgrams – a way to improving ASR for highly inflective Dravidian languages. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, IJCNLP, Hyderabad, pp. 805–810 (2008)
Google Scholar

Download references

Acknowledgments

. The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, ID E! 9944, and the Provincial Secretariat for Higher Education and Scientific Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.

Author information

Authors and Affiliations

Faculty of Technical Sciences, Department for Power, Electronic and Telecommunication Engineering, University of Novi Sad, Trg Dositeja Obradovića 6, Novi Sad, Serbia
Edvin Pakoci & Branislav Popović
AlfaNum Speech Technologies, Bulevar Vojvode Stepe 40, Novi Sad, Serbia
Darko Pekar

Authors

Edvin Pakoci
View author publications
You can also search for this author in PubMed Google Scholar
Branislav Popović
View author publications
You can also search for this author in PubMed Google Scholar
Darko Pekar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edvin Pakoci .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pakoci, E., Popović, B., Pekar, D. (2017). Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_48
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics