Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset

Andrusenko, Andrei; Laptev, Aleksandr; Medennikov, Ivan

doi:10.1007/978-3-030-60276-5_4

Andrei Andrusenko¹⁰,
Aleksandr Laptev¹⁰ &
Ivan Medennikov^10,11

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

International Conference on Speech and Computer

1646 Accesses
5 Citations

Abstract

This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set – OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model.

For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybrid ASR system demonstrates 33.5%, 20.9%, and 18.6% WER.

A. Andrusenko and A. Laptev—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Andrusenko, A., Laptev, A., Medennikov, I.: Towards a competitive end-to-end speech recognition for chime-6 dinner party transcription. arXiv preprint arXiv:2004.10799 (2020). https://arxiv.org/abs/2004.10799v2
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR, May 2015
Google Scholar
Bataev, V., Korenevsky, M., Medennikov, I., Zatvornitskiy, A.: Exploring end-to-end techniques for low-resource speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 32–41. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_4
Chapter Google Scholar
Boyer, F., Rouas, J.L.: End-to-end speech recognition: a review for the French language. arXiv preprint arXiv:1910.08502 (2019). http://arxiv.org/abs/1910.08502
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Denisov, P.: Espnet recipe results for Russian open speech to text (2019). https://github.com/espnet/espnet/blob/master/egs/ru_open_stt/asr1/RESULTS.md
Graves, A.: Sequence transduction with recurrent neural networks. In: Proceedings of the 29th International Conference on Machine Learning (2012)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning - ICML, pp. 369–376. ACM Press (2006). https://doi.org/10.1145/1143844.1143891
Hinton, G., Deng, l., Yu, D., Dahl, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. In: Signal Processing Magazine, IEEE, pp. 82–97, November 2012. https://doi.org/10.1109/MSP.2012.2205597
Iakushkin, O., Fedoseev, G., Shaleva, A., Degtyarev, A., Sedova, O.: Russian-language speech recognition system based on deepspeech. In: Proceedings of the VIII International Conference on Distributed Computing and Grid-technologies in Science and Education (GRID 2018), September 2018. https://github.com/GeorgeFedoseev/DeepSpeech
Karita, S., Wang, X., Watanabe, S., Yoshimura, T., et al.: A comparative study on transformer vs RNN in speech applications. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE, December 2019
Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE, March 2017. https://doi.org/10.1109/ICASSP.2017.7953075
Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29
Chapter Google Scholar
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71 (2018). https://github.com/google/sentencepiece
Laptev, A., Korostik, R., Svischev, A., Andrusenko, A., et al.: You do not need more data: improving end-to-end speech recognition by text-to-speech data augmentation. arXiv preprint arXiv:2005.07157 (2020)
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., et al.: Jasper: an end-to-end convolutional neural acoustic model. In: Interspeech 2019, pp. 71–75. ISCA, September 2019. https://doi.org/10.21437/interspeech.2019-1819
Lüscher, C., Beck, E., Irie, K., Kitza, M., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention. In: Interspeech 2019, pp. 231–235. ISCA, September 2019. https://doi.org/10.21437/Interspeech.2019-1780
Markovnikov, N., Kipyatkova, I., Karpov, A., Filchenkov, A.: Deep neural networks in Russian speech recognition. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 54–67. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_5
Chapter Google Scholar
Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., et al.: The STC system for the CHiME-6 challenge. In: CHiME 2020 Workshop on Speech Processing in Everyday Environments (2020)
Google Scholar
Medennikov, I., Prudnikov, A.: Advances in STC Russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116–123. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_13
Chapter Google Scholar
Novak, J., Minematsu, N., Hirose, K.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, pp. 1–32, September 2015. https://doi.org/10.1017/S1351324915000315. https://github.com/AdolfVonKleist/Phonetisaurus
Park, D.S., Zhang, Y., Jia, Y., Han, W., et al.: Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629 (2020). https://arxiv.org/abs/2005.09629v1
Povey, D., Cheng, G., Wang, Y., Li, K., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of the Interspeech 2018, pp. 3743–3747. ISCA, September 2018. https://doi.org/10.21437/Interspeech.2018-1417
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, December 2011. https://github.com/kaldi-asr/kaldi
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., et al.: Purely sequence-trained neural networks for ASR based on lattice-free mmi. In: Interspeech 2016, pp. 2751–2755. ISCA, September 2016. https://doi.org/10.21437/Interspeech.2016-595
Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_29
Chapter Google Scholar
Ravanelli, M., Parcollet, T., Bengio, Y.: The pytorch-kaldi speech recognition toolkit. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2019. https://doi.org/10.1109/icassp.2019.8683713
Seide, F., Agarwal, A.: CNTK: Microsoft’s open-source deep-learning toolkit. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery (2016). https://doi.org/10.1145/2939672.2945397. https://github.com/Microsoft/CNTK
Slizhikova, A., Veysov, A., Nurtdinova, D., Voronin, D., Baburov, Y.: Russian open speech to text (STT/ASR) dataset v1.0 (2019). https://github.com/snakers4/open_stt/
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)
Google Scholar
Veysov, A.: Toward’s an imagenet moment for speech-to-text. The Gradient (2020). https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
Veysov, A.: stt c 2020–05-21) [quality comparison of our stt system with other systems in the market (update 2020–05-21)] (2020). https://www.silero.ai/russian-stt-benchmarks-update1/
Watanabe, S., Hori, T., Karita, S., Hayashi, T., et al.: ESPnet: end-to-end speech processing toolkit. In: Interspeech 2018, pp. 2207–2211. ISCA, September 2018. https://doi.org/10.21437/Interspeech.2018-1456. https://github.com/espnet/espnet

Download references

Acknowledgments

This work was partially financially supported by the Government of the Russian Federation (Grant 08-08).

Author information

Authors and Affiliations

ITMO University, St. Petersburg, Russia
Andrei Andrusenko, Aleksandr Laptev & Ivan Medennikov
STC-innovations Ltd., St. Petersburg, Russia
Ivan Medennikov

Authors

Andrei Andrusenko
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandr Laptev
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Medennikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksandr Laptev .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Andrusenko, A., Laptev, A., Medennikov, I. (2020). Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_4
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics