End-to-End Speech Recognition in Agglutinative Languages

  • Orken MamyrbayevEmail author
  • Keylan Alimhan
  • Bagashar Zhumazhanov
  • Tolganay Turdalykyzy
  • Farida Gusmanova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12034)


This paper considers end-to-end speech recognition systems based on deep neural networks (DNN). The studies used different types of neural networks, CTC model and attention-based encoder-decoder models. As a result of the study, it was proved that the CTC model works without language models directly for agglutinative languages, but the best is ResNet with 11.52% of CER and 19.57% of WER of using the language model. An experiment with the BLSTM neural network using the attention-based encoder-decoder models showed 8.01% of CER of and 17.91% of WER. Using the experiment, it was proved that without integrating language models, good results can be achieved. The best result showed ResNet.


Speech recognition Agglutinative languages End-to-End models Deep learning CTC 



This work was supported by the Ministry of Education and Science of the Republic of Kazakhstan. IRN AP05131207 Development of technologies for multilingual automatic speech recognition using deep neural networks.


  1. 1.
    Perera, F.P., et al.: Relationship between polycyclic aromatic hydrocarbon–DNA adducts and proximity to the World Trade Center and effects on fetal growth. Environ. Health Perspect. 113, 1062–1067 (2005)CrossRefGoogle Scholar
  2. 2.
    Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Alimhan, K., Kydyrbekova, A., Turdalykyzy, T.: Automatic recognition of Kazakh speech using deep neural networks. In: Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawiński, B. (eds.) ACIIDS 2019. LNCS (LNAI), vol. 11432, pp. 465–474. Springer, Cham (2019). Scholar
  3. 3.
    Mikolov, T., et al.: Recurrent neural network based language model. Interspeech 2, 1045–1048 (2010)Google Scholar
  4. 4.
    Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229 (2015)Google Scholar
  5. 5.
    Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5887 (2011)Google Scholar
  6. 6.
    Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory. Colorado University at Boulder Department of Computer Science, pp. 194–281 (1986)Google Scholar
  7. 7.
    Vaněk, J., Zelinka, J., Soutner, D., Psutka, J.: A regularization post layer: an additional way how to make deep neural networks robust. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 204–214. Springer, Cham (2017). Scholar
  8. 8.
    Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)Google Scholar
  9. 9.
    Aida-Zade, K., Rustamov, S., Mustafayev, E.: Principles of construction of speech recognition system by the example of Azerbaijan language. In: International Symposium on Innovations in Intelligent Systems and Applications, pp. 378–382 (2009)Google Scholar
  10. 10.
    Hannun, A., et al.: DeepSpeech: scaling up end-to-end speech recognition, arXiv:1412.5567 (2014)
  11. 11.
    Zhang, Z., et al.: Deep recurrent convolutional neural network: improving performance for speech recognition (2016). preprint: arXiv:1611.07174.
  12. 12.
    Bahdanau, D., et al.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)Google Scholar
  13. 13.
    Zhang, Y., et al.: Towards end-to-end speech recognition with deep convolutional neural networks (2017). preprint: arXiv:1701.02720.
  14. 14.
    Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 4 p. IEEE Signal Processing Society (2011)Google Scholar
  15. 15.
    Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition. arXiv:1610.09975 (2016)
  16. 16.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)Google Scholar
  17. 17.
    Popović, B., Pakoci, E., Pekar, D.: End-to-End large vocabulary speech recognition for the Serbian language. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 343–352. Springer, Cham (2017). Scholar
  18. 18.
    Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: High-dimensional sequence transduction. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3178–3182 (2013)Google Scholar
  19. 19.
    Wang, Y., Deng, X., Pu, S., Huang, Z.: Residual convolutional CTC networks for automatic speech recognition (2017). preprint: arXiv:1702.07793.
  20. 20.
    Rustamov, S., Gasimov, E., Hasanov, R., Jahangirli, S., Mustafayev, E., Usikov, D.: Speech recognition in flight simulator. In: Aegean International Textile and Advanced Engineering Conference. IOP Conference Series: Materials Science and Engineering, vol. 459 (2018)Google Scholar
  21. 21.
    Gulmira, T., Alymzhan, T., Orken, M., Rustam, M.: Neural named entity recognition for Kazakh. In: 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), 7–13 April 2019, La Rochelle, France. Lecture Notes in Computer Science (2019)Google Scholar
  22. 22.
    Toleu, A., Tolegen, G., Makazhanov, A.: Character-aware neural morphological disambiguation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 666–671. Association for Computational Linguistics, Vancouver (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute of Information and Computational TechnologiesAlmatyKazakhstan
  2. 2.Tokyo Denki UniversityTokyoJapan
  3. 3.Al-Farabi, Kazakh National UniversityAlmatyKazakhstan

Personalised recommendations