Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation

  • Ivan Medennikov
  • Aleksei Romanenko
  • Alexey Prudnikov
  • Valentin Mendelev
  • Yuri Khokhlov
  • Maxim Korenevsky
  • Natalia Tomashenko
  • Alexander Zatvornitskiy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


This paper describes in detail the acoustic modeling part of the keyword search system developed in the Speech Technology Center (STC) for the OpenKWS 2016 evaluation. The key idea was to utilize diversity of both sound representations and acoustic model architectures in the system. For the former, we extended speaker-dependent bottleneck (SDBN) approach to the multilingual case, which is the main contribution of the paper. Two types of multilingual SDBN features were applied in addition to conventional spectral and cepstral features. The acoustic model architectures employed in the final system are based on deep feedforward and recurrent neural networks. We also applied speaker adaptation of acoustic models using multilingual i-vectors, speed perturbation based data augmentation and semi-supervised training. Final STC system comprised 9 acoustic models, which allowed it to achieve strong performance and to be among the top three systems in the evaluation.


Acoustic models Low-resource speech recognition Multilingual speaker-dependent bottleneck features OpenKWS 2016 



This work was financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0121 (ID RFMEFI57915X0121).

This effort uses the IARPA Babel Program language collection release IARPA-babel{101b-v0.4c, 102b-v0.5a, 103b-v0.4b, 201b-v0.2b, 203b-v3.1a, 205b-v1.0a, 206b-v0.1e, 207b-v1.0e, 301b-v2.0b, 302b-v1.0a, 303b-v1.0a, 304b-v1.0b, 305b-v1.0c, 306b-v2.0c, 307b-v1.0b, 401b-v2.0b, 402b-v1.0b, 403b-v1.0b, 404b-v1.0a}, set of training transcriptions and BBN part of clean web data for Georgian language.


  1. 1.
  2. 2.
  3. 3.
    Khokhlov, Y., Medennikov, I., Mendelev, V., et al.: The STC keyword search system For OpenKWS 2016 evaluation. In: INTERSPEECH 2017 (accepted 2017)Google Scholar
  4. 4.
    Khokhlov, Y., Tomashenko, N., et al.: Fast and accurate OOV decoder on high-level features. In: INTERSPEECH 2017 (accepted 2017)Google Scholar
  5. 5.
    Lee, W., Kim, J., Lane, I.: Multi-stream combination for LVCSR and keyword search on GPU-accelerated platforms. In: ICASSP 2014, pp. 3296–3300 (2014)Google Scholar
  6. 6.
    Cai, M., et al.: High-performance Swahili keyword search with very limited language pack: the THUEE system for the OpenKWS15 evaluation. In: ASRU 2015, pp. 215–222 (2015)Google Scholar
  7. 7.
    Hartmann, W., et al.: Comparison of multiple system combination techniques for keyword spotting. In: INTERSPEECH 2016, pp. 1913–1917 (2016)Google Scholar
  8. 8.
    Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). doi: 10.1007/978-3-319-23132-7_29 CrossRefGoogle Scholar
  9. 9.
    Medennikov, I., Prudnikov, A.: Advances in STC russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116–123. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_13 CrossRefGoogle Scholar
  10. 10.
    Medennikov, I.P.: Speaker-dependent features for spontaneous speech recognition. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(1), 195–197 (2016). doi: 10.17586/2226-1494-2016-16-1-195-197
  11. 11.
    Medennikov, I., Prudnikov, A., Zatvornitskiy, A.: Improving english conversational telephone speech recognition. In: INTERSPEECH 2016, pp. 2–6 (2016)Google Scholar
  12. 12.
    Prudnikov, A., Korenevsky, M.: Training maxout neural networks for speech recognition tasks. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 443–451. Springer, Cham (2016). doi: 10.1007/978-3-319-45510-5_51 CrossRefGoogle Scholar
  13. 13.
    Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU 2013, pp. 55–59 (2013)Google Scholar
  14. 14.
    Rennie, S.J., Goel, V., Thomas, S.: Annealed dropout training of deep networks. In: 2014 IEEE Workshop on Spoken Language Technology (SLT), pp. 159–164 (2014)Google Scholar
  15. 15.
    Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 2440–2444 (2015)Google Scholar
  16. 16.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014 (2014)Google Scholar
  17. 17.
    Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH 2015 (2015)Google Scholar
  18. 18.
    Vesely, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: ASRU 2013, pp. 267–272 (2013)Google Scholar
  19. 19.
    Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)CrossRefGoogle Scholar
  20. 20.
    Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 278–285. Springer, Cham (2013). doi: 10.1007/978-3-319-01931-4_37 CrossRefGoogle Scholar
  21. 21.
    Lee, K.A., et al.: The 2015 NIST Language Recognition Evaluation: the Shared View of I2R, Fantastic4 and SingaMS. In: INTERSPEECH 2016, pp. 3211–3215 (2016)Google Scholar
  22. 22.
    Caruana, R.: Multitask learning. Mac. Learn. 28(1), 41–75 (1997)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Povey, D., et al.: The kaldi speech recognition toolkit. In: ASRU 2011, pp. 1–4 (2011)Google Scholar
  24. 24.
    Karpathy, A.: The Unreasonable Effectiveness of Recurrent Neural Networks,
  25. 25.
    Chen, G., Yilmaz, O., Trmal, J., Povey, D.: Khudanpur, S: Using proxies for OOV keywords in the keyword search task. In: ASRU 2013, pp. 416–421 (2013)Google Scholar
  26. 26.
    Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)CrossRefGoogle Scholar
  27. 27.
    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of momentum and initialization in deep learning. In: 30th International Conference on Machine Learning (2013)Google Scholar
  28. 28.
    Povey, D., et al.: The subspace Gaussian mixture model–a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)CrossRefGoogle Scholar
  29. 29.
    Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH 2013, pp. 2345–2349 (2013)Google Scholar
  30. 30.
    Trmal, J., et al.: A keyword search system using open source software. In: 2014 IEEE Workshop on Spoken Language Technology (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ivan Medennikov
    • 1
    • 2
  • Aleksei Romanenko
    • 1
    • 2
  • Alexey Prudnikov
    • 1
  • Valentin Mendelev
    • 1
    • 2
  • Yuri Khokhlov
    • 1
  • Maxim Korenevsky
    • 1
    • 2
  • Natalia Tomashenko
    • 1
    • 2
  • Alexander Zatvornitskiy
    • 1
    • 2
  1. 1.STC-innovations Ltd.St. PetersburgRussia
  2. 2.ITMO UniversitySt. PetersburgRussia

Personalised recommendations