Skip to main content

Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

Abstract

This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set – OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model.

For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybrid ASR system demonstrates 33.5%, 20.9%, and 18.6% WER.

A. Andrusenko and A. Laptev—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/snakers4/open_stt/releases/download/v0.5-beta/public_exclude_file_v5.tar.gz

  2. 2.

    http://www.dev.voxforge.org/projects/Russian/export/2500/Trunk/AcousticModels/etc/msu_ru_nsh.dic.

  3. 3.

    https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/local/rnnlm/tuning/run_lstm_tdnn_1a.sh.

References

  1. Andrusenko, A., Laptev, A., Medennikov, I.: Towards a competitive end-to-end speech recognition for chime-6 dinner party transcription. arXiv preprint arXiv:2004.10799 (2020). https://arxiv.org/abs/2004.10799v2

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR, May 2015

    Google Scholar 

  3. Bataev, V., Korenevsky, M., Medennikov, I., Zatvornitskiy, A.: Exploring end-to-end techniques for low-resource speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 32–41. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_4

    Chapter  Google Scholar 

  4. Boyer, F., Rouas, J.L.: End-to-end speech recognition: a review for the French language. arXiv preprint arXiv:1910.08502 (2019). http://arxiv.org/abs/1910.08502

  5. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7472621

  6. Denisov, P.: Espnet recipe results for Russian open speech to text (2019). https://github.com/espnet/espnet/blob/master/egs/ru_open_stt/asr1/RESULTS.md

  7. Graves, A.: Sequence transduction with recurrent neural networks. In: Proceedings of the 29th International Conference on Machine Learning (2012)

    Google Scholar 

  8. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning - ICML, pp. 369–376. ACM Press (2006). https://doi.org/10.1145/1143844.1143891

  9. Hinton, G., Deng, l., Yu, D., Dahl, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. In: Signal Processing Magazine, IEEE, pp. 82–97, November 2012. https://doi.org/10.1109/MSP.2012.2205597

  10. Iakushkin, O., Fedoseev, G., Shaleva, A., Degtyarev, A., Sedova, O.: Russian-language speech recognition system based on deepspeech. In: Proceedings of the VIII International Conference on Distributed Computing and Grid-technologies in Science and Education (GRID 2018), September 2018. https://github.com/GeorgeFedoseev/DeepSpeech

  11. Karita, S., Wang, X., Watanabe, S., Yoshimura, T., et al.: A comparative study on transformer vs RNN in speech applications. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE, December 2019

    Google Scholar 

  12. Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE, March 2017. https://doi.org/10.1109/ICASSP.2017.7953075

  13. Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29

    Chapter  Google Scholar 

  14. Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71 (2018). https://github.com/google/sentencepiece

  15. Laptev, A., Korostik, R., Svischev, A., Andrusenko, A., et al.: You do not need more data: improving end-to-end speech recognition by text-to-speech data augmentation. arXiv preprint arXiv:2005.07157 (2020)

  16. Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., et al.: Jasper: an end-to-end convolutional neural acoustic model. In: Interspeech 2019, pp. 71–75. ISCA, September 2019. https://doi.org/10.21437/interspeech.2019-1819

  17. Lüscher, C., Beck, E., Irie, K., Kitza, M., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention. In: Interspeech 2019, pp. 231–235. ISCA, September 2019. https://doi.org/10.21437/Interspeech.2019-1780

  18. Markovnikov, N., Kipyatkova, I., Karpov, A., Filchenkov, A.: Deep neural networks in Russian speech recognition. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 54–67. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_5

    Chapter  Google Scholar 

  19. Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., et al.: The STC system for the CHiME-6 challenge. In: CHiME 2020 Workshop on Speech Processing in Everyday Environments (2020)

    Google Scholar 

  20. Medennikov, I., Prudnikov, A.: Advances in STC Russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116–123. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_13

    Chapter  Google Scholar 

  21. Novak, J., Minematsu, N., Hirose, K.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, pp. 1–32, September 2015. https://doi.org/10.1017/S1351324915000315. https://github.com/AdolfVonKleist/Phonetisaurus

  22. Park, D.S., Zhang, Y., Jia, Y., Han, W., et al.: Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629 (2020). https://arxiv.org/abs/2005.09629v1

  23. Povey, D., Cheng, G., Wang, Y., Li, K., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of the Interspeech 2018, pp. 3743–3747. ISCA, September 2018. https://doi.org/10.21437/Interspeech.2018-1417

  24. Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, December 2011. https://github.com/kaldi-asr/kaldi

  25. Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., et al.: Purely sequence-trained neural networks for ASR based on lattice-free mmi. In: Interspeech 2016, pp. 2751–2755. ISCA, September 2016. https://doi.org/10.21437/Interspeech.2016-595

  26. Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_29

    Chapter  Google Scholar 

  27. Ravanelli, M., Parcollet, T., Bengio, Y.: The pytorch-kaldi speech recognition toolkit. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2019. https://doi.org/10.1109/icassp.2019.8683713

  28. Seide, F., Agarwal, A.: CNTK: Microsoft’s open-source deep-learning toolkit. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery (2016). https://doi.org/10.1145/2939672.2945397. https://github.com/Microsoft/CNTK

  29. Slizhikova, A., Veysov, A., Nurtdinova, D., Voronin, D., Baburov, Y.: Russian open speech to text (STT/ASR) dataset v1.0 (2019). https://github.com/snakers4/open_stt/

  30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)

    Google Scholar 

  31. Veysov, A.: Toward’s an imagenet moment for speech-to-text. The Gradient (2020). https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/

  32. Veysov, A.: stt c 2020–05-21) [quality comparison of our stt system with other systems in the market (update 2020–05-21)] (2020). https://www.silero.ai/russian-stt-benchmarks-update1/

  33. Watanabe, S., Hori, T., Karita, S., Hayashi, T., et al.: ESPnet: end-to-end speech processing toolkit. In: Interspeech 2018, pp. 2207–2211. ISCA, September 2018. https://doi.org/10.21437/Interspeech.2018-1456. https://github.com/espnet/espnet

Download references

Acknowledgments

This work was partially financially supported by the Government of the Russian Federation (Grant 08-08).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksandr Laptev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Andrusenko, A., Laptev, A., Medennikov, I. (2020). Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60276-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60275-8

  • Online ISBN: 978-3-030-60276-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics