Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition

  • Irina KipyatkovaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


In this paper, we study an application of time delay neural networks (TDNNs) in acoustic modeling for large vocabulary continuous Russian speech recognition. We created TDNNs with various numbers of hidden layers and units in the hidden layers with p-norm nonlinearity. Training of acoustic models was carried out on our own Russian speech corpus containing phonetically balanced phrases. Duration of the speech corpus is more than 30 h. Testing of TDNN-based acoustic models was performed in the very large vocabulary continuous Russian speech recognition task. Conducted experiments showed that TDNN models outperformed baseline deep neural network models in terms of the word error rate.


Time delay neural networks Acoustic models Automatic speech recognition Russian speech 



This research is partially supported by the Council for Grants of the President of the Russian Federation (project No. MK-1000.2017.8) and by the Russian Foundation for Basic Research (project No. 15–07–04322).


  1. 1.
    Yu, D., Deng, L.: Automatic Speech Recognition. A Deep Learning Approach. Springer, London (2015)zbMATHGoogle Scholar
  2. 2.
    Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sign. Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  3. 3.
    Kipyatkova I., Karpov, A.: Variants of deep artificial neural networks for speech recognition systems. In: SPIIRAS Proceedings, vol. 6(49), pp. 80–103 (2016). (in Russian) doi:
  4. 4.
    Deng, L.: Deep learning: from speech recognition to language and multimodal processing. APSIPA Trans. Sign. Inf. Process. 5, 1–15 (2016)Google Scholar
  5. 5.
    Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: INTERSPEECH 2011, pp. 437– 440 (2011)Google Scholar
  6. 6.
    Delcroix, M., Kinoshita, K., Ogawa, A., Yoshioka, T., Tran, D., Nakatani, T.: Context adaptive neural network for rapid adaptation of deep CNN based acoustic models. In: INTERSPEECH 2016, pp. 1573–1577 (2016)Google Scholar
  7. 7.
    Tran, D.T., Delcroix, M., Ogawa, A., Huemmer, C., Nakatani, T.: Feedback connection for deep neural network-based acoustic modeling. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017), pp. 5240–5244 (2017)Google Scholar
  8. 8.
    Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH 2014, pp. 631–635 (2014)Google Scholar
  9. 9.
    Peddini, V., Povey, D., Khundanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 3214–3218 (2015)Google Scholar
  10. 10.
    Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: INTERSPEECH 2014, pp. 2997–3001 (2014)Google Scholar
  11. 11.
    Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS(LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). doi: 10.1007/978-3-319-23132-7_29 CrossRefGoogle Scholar
  12. 12.
    Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 246–253. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_29 CrossRefGoogle Scholar
  13. 13.
    Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding ASRU (2011)Google Scholar
  14. 14.
    Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 55–59 (2013)Google Scholar
  15. 15.
    Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging (2014). Preprint: arXiv:1410.7455,
  16. 16.
    Zhang X., Trmal J., Povey D., Khudanpur S.: Improving deep neural network acoustic models using generalized maxout networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219 (2014)Google Scholar
  17. 17.
    Gapochkin, A.V.: Neural networks in speech recognition systems. Sci. Time 1(1), 29–36 (2014). (in Russian)Google Scholar
  18. 18.
    Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Sign. Process. 37(3), 328–339 (1989)CrossRefGoogle Scholar
  19. 19.
    Karpov, A., Markov, K., Kipyatkova, I., Vazhenina, D., Ronzhin, A.: Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Commun. 56, 213–228 (2014)CrossRefGoogle Scholar
  20. 20.
    Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop ASRU 2011 (2011)Google Scholar
  21. 21.
    Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS(LNAI), vol. 8113, pp. 219–226. Springer, Cham (2013). doi: 10.1007/978-3-319-01931-4_29 CrossRefGoogle Scholar
  22. 22.
    Kipyatkova, I., Karpov, A., Verkhodanova, V., Zelezny, M.: Modeling of pronunciation, language and nonverbal units at conversational russian speech recognition. Int. J. Comput. Sci. Appl. 10(1), 11–30 (2013)Google Scholar
  23. 23.
    Jokisch, O., Wagner, A., Sabo, R., Jaeckel, R., Cylwik, N., Rusko, M., Ronzhin A., Hoffmann, R.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. In: Proceedings of SPECOM 2009, pp. 515–520 (2009)Google Scholar
  24. 24.
    State Standard P 50840–95. Speech transmission by communication paths. Evaluation methods of quality, intelligibility and recognizability, p. 230. Standartov Publ., Moscow (1996). (in Russian)Google Scholar
  25. 25.
    Stepanova, S.B.: Phonetic features of Russian speech: realization and transcription, Ph.D. thesis (1988). (in Russian)Google Scholar
  26. 26.
    Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 338–345. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_40 CrossRefGoogle Scholar
  27. 27.
    Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)CrossRefGoogle Scholar
  28. 28.
    Kipyatkova, I., Karpov, A.: A study of neural network Russian language models for automatic continuous speech recognition systems. Autom. Remote Control 78(5), 858–867 (2017). SpringerCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS)St. PetersburgRussia
  2. 2.St. Petersburg State University of Aerospace Instrumentation (SUAI)St. PetersburgRussia

Personalised recommendations