Abstract
This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set – OpenSTT. We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model.
For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively. Under the same conditions, the hybrid ASR system demonstrates 33.5%, 20.9%, and 18.6% WER.
A. Andrusenko and A. Laptev—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrusenko, A., Laptev, A., Medennikov, I.: Towards a competitive end-to-end speech recognition for chime-6 dinner party transcription. arXiv preprint arXiv:2004.10799 (2020). https://arxiv.org/abs/2004.10799v2
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR, May 2015
Bataev, V., Korenevsky, M., Medennikov, I., Zatvornitskiy, A.: Exploring end-to-end techniques for low-resource speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 32–41. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_4
Boyer, F., Rouas, J.L.: End-to-end speech recognition: a review for the French language. arXiv preprint arXiv:1910.08502 (2019). http://arxiv.org/abs/1910.08502
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Denisov, P.: Espnet recipe results for Russian open speech to text (2019). https://github.com/espnet/espnet/blob/master/egs/ru_open_stt/asr1/RESULTS.md
Graves, A.: Sequence transduction with recurrent neural networks. In: Proceedings of the 29th International Conference on Machine Learning (2012)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning - ICML, pp. 369–376. ACM Press (2006). https://doi.org/10.1145/1143844.1143891
Hinton, G., Deng, l., Yu, D., Dahl, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. In: Signal Processing Magazine, IEEE, pp. 82–97, November 2012. https://doi.org/10.1109/MSP.2012.2205597
Iakushkin, O., Fedoseev, G., Shaleva, A., Degtyarev, A., Sedova, O.: Russian-language speech recognition system based on deepspeech. In: Proceedings of the VIII International Conference on Distributed Computing and Grid-technologies in Science and Education (GRID 2018), September 2018. https://github.com/GeorgeFedoseev/DeepSpeech
Karita, S., Wang, X., Watanabe, S., Yoshimura, T., et al.: A comparative study on transformer vs RNN in speech applications. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE, December 2019
Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE, March 2017. https://doi.org/10.1109/ICASSP.2017.7953075
Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71 (2018). https://github.com/google/sentencepiece
Laptev, A., Korostik, R., Svischev, A., Andrusenko, A., et al.: You do not need more data: improving end-to-end speech recognition by text-to-speech data augmentation. arXiv preprint arXiv:2005.07157 (2020)
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., et al.: Jasper: an end-to-end convolutional neural acoustic model. In: Interspeech 2019, pp. 71–75. ISCA, September 2019. https://doi.org/10.21437/interspeech.2019-1819
Lüscher, C., Beck, E., Irie, K., Kitza, M., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention. In: Interspeech 2019, pp. 231–235. ISCA, September 2019. https://doi.org/10.21437/Interspeech.2019-1780
Markovnikov, N., Kipyatkova, I., Karpov, A., Filchenkov, A.: Deep neural networks in Russian speech recognition. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 54–67. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_5
Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., et al.: The STC system for the CHiME-6 challenge. In: CHiME 2020 Workshop on Speech Processing in Everyday Environments (2020)
Medennikov, I., Prudnikov, A.: Advances in STC Russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116–123. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_13
Novak, J., Minematsu, N., Hirose, K.: Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, pp. 1–32, September 2015. https://doi.org/10.1017/S1351324915000315. https://github.com/AdolfVonKleist/Phonetisaurus
Park, D.S., Zhang, Y., Jia, Y., Han, W., et al.: Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629 (2020). https://arxiv.org/abs/2005.09629v1
Povey, D., Cheng, G., Wang, Y., Li, K., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of the Interspeech 2018, pp. 3743–3747. ISCA, September 2018. https://doi.org/10.21437/Interspeech.2018-1417
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, December 2011. https://github.com/kaldi-asr/kaldi
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., et al.: Purely sequence-trained neural networks for ASR based on lattice-free mmi. In: Interspeech 2016, pp. 2751–2755. ISCA, September 2016. https://doi.org/10.21437/Interspeech.2016-595
Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_29
Ravanelli, M., Parcollet, T., Bengio, Y.: The pytorch-kaldi speech recognition toolkit. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2019. https://doi.org/10.1109/icassp.2019.8683713
Seide, F., Agarwal, A.: CNTK: Microsoft’s open-source deep-learning toolkit. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery (2016). https://doi.org/10.1145/2939672.2945397. https://github.com/Microsoft/CNTK
Slizhikova, A., Veysov, A., Nurtdinova, D., Voronin, D., Baburov, Y.: Russian open speech to text (STT/ASR) dataset v1.0 (2019). https://github.com/snakers4/open_stt/
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)
Veysov, A.: Toward’s an imagenet moment for speech-to-text. The Gradient (2020). https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
Veysov, A.: stt c 2020–05-21) [quality comparison of our stt system with other systems in the market (update 2020–05-21)] (2020). https://www.silero.ai/russian-stt-benchmarks-update1/
Watanabe, S., Hori, T., Karita, S., Hayashi, T., et al.: ESPnet: end-to-end speech processing toolkit. In: Interspeech 2018, pp. 2207–2211. ISCA, September 2018. https://doi.org/10.21437/Interspeech.2018-1456. https://github.com/espnet/espnet
Acknowledgments
This work was partially financially supported by the Government of the Russian Federation (Grant 08-08).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Andrusenko, A., Laptev, A., Medennikov, I. (2020). Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)