Skip to main content

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

Abstract

In the paper, we present our very large vocabulary continuous Russian speech recognition system based on various neural networks. We employed neural networks on both acoustic and language modeling stages. For training hybrid acoustic models, we experimented with several types of neural networks: feedforward deep neural network, time-delay neural network, Long Short-Term Memory, bidirectional Long Short-Term Memory. We created neural networks with various numbers of hidden layers and units in hidden layers. Language modeling was performed using recurrent neural network. At first, experiments on Russian speech recognition were carried out using hybrid acoustic models and 3-gram language model. Then 500-best list was rescored with recurrent neural network language model. The lowest word error rate equal to 15.13% was achieved using time-delay neural network for acoustic modeling and recurrent neural network language model interpolated with 3-gram model for 500-best list rescoring.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Yu, Dong, Deng, Li: Automatic Speech Recognition. SCT. Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3

    Book  MATH  Google Scholar 

  2. Yu, D., Li, J.: Recent progresses in deep learning based acoustic models. IEEE/CAA J. Automatica Sinica 4(3), 396–409 (2017)

    Article  Google Scholar 

  3. Kipyatkova, I., Karpov, A.: Variants of deep artificial neural networks for speech recognition systems. In: SPIIRAS Proceedings, vol. 6, no. 49, pp. 80–103 (2016). (in Rus.), http://dx.doi.org/10.15622/sp.49.5

  4. Peddini, V., Povey, D., Khundanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH-2015, pp. 3214–3218 (2015)

    Google Scholar 

  5. Sun, M., et al: Compressed time delay neural network for small-footprint keyword spotting. In: INTERSPEECH -2017, pp. 3607–3611 (2017)

    Google Scholar 

  6. Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH-2014, pp. 631–635 (2014)

    Google Scholar 

  7. Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 2462–2466 (2017)

    Google Scholar 

  8. Peddinti, V., Wang, Y., Povey, D., Khudanpur, S.: Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Sig. Process. Lett. 25(3), 373–377 (2018)

    Article  Google Scholar 

  9. Wang, Y., Chen, X., Gales, M., Ragni, A., Wong, J.: Phonetic and graphemic systems for multi-genre broadcast transcription. Preprint arXiv:1802.00254, https://arxiv.org/pdf/1802.06412.pdf (2018)

  10. Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH 2010, Makuhari, Chiba, Japan, pp. 1045–1048 (2010)

    Google Scholar 

  11. Su, C., Huang, H., Shi, S., Guo, Y., Wu, H.: A parallel recurrent neural network for language modeling with POS tags. In: Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation (PACLIC), https://paclic31.national-u.edu.ph/wp-content/uploads/2017/11/PACLIC_31_paper_125.pdf

  12. Soutner, D., Müller, L.: Application of LSTM Neural Networks in Language Modelling. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 105–112. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_14

    Chapter  Google Scholar 

  13. Chen, X., Ragni, A., Liu, X., Gales, M.J.: Investigating bidirectional recurrent neural network language models for speech recognition. In: INTERSPEECH-2017, pp. 269–273 (2017)

    Google Scholar 

  14. Chen, X., Liu, X., Ragni, A., Wang, Y., Gales, M.: Future word contexts in neural network language models. In: Preprint arXiv:1708.05592 (2017)

  15. Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The microsoft 2017 conversational Speech recognition system. Preprint arXiv:1708.06073, https://arxiv.org/abs/1708.06073 (2017)

  16. Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: INTERSPEECH-2014, pp. 2997–3001 (2014)

    Google Scholar 

  17. Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_29

    Chapter  Google Scholar 

  18. Vazhenina, D., Markov, K.: Evaluation of advanced language modeling techniques for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 124–131. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_17

    Chapter  Google Scholar 

  19. Kudinov, M.S.: On applicability of recurrent neural networks to language modelling for inflective languages. J. Siberian Federal Univ. Eng. Technol. 9(8), 1291–1301 (2016). (in Rus.)

    Google Scholar 

  20. Povey, D. et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding ASRU (2011)

    Google Scholar 

  21. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 55–59 (2013)

    Google Scholar 

  22. Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. Preprint arXiv:1410.7455, http://arxiv.org/pdf/1410.7455v8.pdf (2014)

  23. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219 (2014)

    Google Scholar 

  24. Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29

    Chapter  Google Scholar 

  25. Kipyatkova, I.: Experimenting with Hybrid TDNN/HMM acoustic models for Russian speech recognition. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 362–369. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_35

    Chapter  Google Scholar 

  26. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  27. Geiger, J.T., et al.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH-2014, pp. 631–635 (2014)

    Google Scholar 

  28. Kipyatkova, I., Karpov, A., Verkhodanova, V., Zelezny, M.: Modeling of pronunciation, language and nonverbal units at conversational Russian speech recognition. Int. J. Comput. Sci. Appl. 10(1), 11–30 (2013)

    Google Scholar 

  29. Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Automatic Speech Recognition and Understanding Workshop ASRU 2011 (2011)

    Google Scholar 

  30. Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J.: RNNLM - Recurrent Neural Network Language Modeling Toolkit. In: ASRU 2011 Demo Session (2011)

    Google Scholar 

  31. Mikolov, T., Deoras, A., Povey, D., Burget, L., Černocký, J.: Strategies for training large scale neural network language models. In: Proceedings of ASRU 2011, Hawaii, pp. 196–201 (2011)

    Google Scholar 

  32. Kipyatkova, I., Karpov, A.: Language models with RNNs for rescoring hypotheses of Russian ASR. In: Cheng, L., Liu, Q., Ronzhin, A. (eds.) ISNN 2016. LNCS, vol. 9719, pp. 418–425. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40663-3_48

    Chapter  Google Scholar 

  33. Jokisch, O., et al.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. In: Proceedings of SPECOM’ 2009, pp. 515–520 (2009)

    Google Scholar 

  34. Stepanova, S.B.: Phonetic features of Russian speech: realization and transcription. Ph.D. thesis (1988) (in Rus.)

    Google Scholar 

  35. State Standard P 50840–95. Speech transmission by communication paths. Evaluation methods of quality, intelligibility and recognizability. Moscow, Standartov Publ., 230 p. (1996) (in Rus.)

    Google Scholar 

  36. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_40

    Chapter  Google Scholar 

  37. Karpov, A., Markov, K., Kipyatkova, I., Vazhenina, D., Ronzhin, A.: Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Commun. 56, 213–228 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by the Russian Foundation for Basic Research (projects No. 18-07-01216 and 18-07-01407), by the Council for Grants of the President of the Russian Federation (project No. MK-1000.2017.8), as well as by the state research No. 0073-2018-0002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kipyatkova, I. (2018). Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99579-3_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99578-6

  • Online ISBN: 978-3-319-99579-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics