Advances in STC Russian Spontaneous Speech Recognition System

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)

Abstract

In this paper we present the latest improvements to the Russian spontaneous speech recognition system developed in Speech Technology Center (STC). Significant word error rate (WER) reduction was obtained by applying hypothesis rescoring with sophisticated language models. These were the Recurrent Neural Network Language Model and regularized Long-Short Term Memory Language Model. For acoustic modeling we used the deep neural network (DNN) trained with speaker-dependent bottleneck features, similar to our previous system. This DNN was combined with the deep Bidirectional Long Short-Term Memory acoustic model by the use of score fusion. The resulting system achieves WER of 16.4 %, with an absolute reduction of 8.7 % and relative reduction of 34.7 % compared to our previous system result on this test set.

Keywords

Spontaneous speech recognition Bottleneck features Deep neural networks Recurrent neural networks 

References

  1. 1.
    Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2345–2349. Lyon (2013)Google Scholar
  2. 2.
    Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Olomouc (2013)Google Scholar
  3. 3.
    Soltau, H., Saon, G., Sainath, T.N.: Joint training of convolutional and non-convolutional neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5572–5576. Florence (2014)Google Scholar
  4. 4.
    Saon, G., Kuo, H.-K., Rennie, S., Picheny, M.: The IBM 2015 english conversational telephone speech recognition system. In: 16th Annual Conference of the International Speech Communication Association (Interspeech). Dresden (2015)Google Scholar
  5. 5.
    Mohamed, A., Seide, F., Yu, D., Droppo, J., Stolcke, A., Zweig, G., Penn, G.: Deep bi-directional recurrent networks over spectral windows. In: IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Scottsdale (2015)Google Scholar
  6. 6.
    Tampel, I.B.: Automatic speech recognition -the main stages over last 50 years. Sci. Tech. J. Inf. Technol. Mech. Opt. 15(6), 957–968 (2015). doi:10.17586/2226-1494-2015-15-6-957-968 Google Scholar
  7. 7.
    Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 234–242. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  8. 8.
    Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. Big Island (2011)Google Scholar
  9. 9.
    Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing, vol. 3, pp. 901–904 (2002)Google Scholar
  10. 10.
    Medennikov, I.P.: Speaker-dependent features for spontaneous speech recognition. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(1), 195–197 (2016). doi:10.17586/2226-1494-2016-16-1-195-197 Google Scholar
  11. 11.
    Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Medennikov, I.P.: Two-step algorithm of training initialization for acoustic models based on deep neural networks. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(2), 379–381 (2016). doi:10.17586/2226-1494-2016-16-2-379-381 Google Scholar
  13. 13.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: 15th Annual Conference of the International Speech Communication Association (Interspeech). Singapore (2014)Google Scholar
  14. 14.
    Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. In: 11th Annual Conference of the International Speech Communication Association (Interspeech), pp. 1045–1048. Makuhari (2010)Google Scholar
  15. 15.
    Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint (2014). arXiv:1409.2329
  16. 16.
    Abadi, M., et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015). http://tensorflow.org/
  17. 17.
    Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: 16th Annual Conference of the International Speech Communication Association (Interspeech). Dresden (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.STC-innovations LtdSt. PetersburgRussia
  2. 2.ITMO UniversitySt. PetersburgRussia
  3. 3.Speech Technology Center LtdSt. PetersburgRussia

Personalised recommendations