International Conference on Speech and Computer

SPECOM 2015: Speech and Computer pp 234-242 | Cite as

Improving Acoustic Models for Russian Spontaneous Speech Recognition

  • Alexey Prudnikov
  • Ivan Medennikov
  • Valentin Mendelev
  • Maxim Korenevsky
  • Yuri Khokhlov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9319)

Abstract

The aim of the paper is to investigate the ways to improve acoustic models for Russian spontaneous speech recognition. We applied the main steps of the Kaldi Switchboard recipe to a Russian dataset but obtained low accuracy with respect to the results for English spontaneous telephone speech. We found two methods to be especially useful for Russian spontaneous speech: the i-vector based deep neural network adaptation and speaker-dependent bottleneck features which provide 8.6 % and 11.9 % relative word error rate reduction over the baseline system respectively.

Keywords

Speech recognition Russian spontaneous speech Deep neural networks Speaker adaptation I-vectors Bottleneck features 

Notes

Acknowledgements

The work was partially financially supported by the Government of the Russian Federation, Grant 074-U01, and by the Ministry of Education and Science of Russian Federation, contract 14.579.21.0057, ID RFMEFI57914X0057.

References

  1. 1.
    Soltau, H., Saon, G., Sainath, T.N.: Joint training of convolutional and non-convolutional neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5572–5576. Florence (2014)Google Scholar
  2. 2.
    Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2345–2349. Lyon (2014)Google Scholar
  3. 3.
    Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 13th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Olomouc (2013)Google Scholar
  4. 4.
    Godfrey, J.J., Holliman, E.C., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: 17th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 517–520. San Francisco (1992)Google Scholar
  5. 5.
    Povey, D. et al.: The Kaldi speech recognition toolkit. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 5572–5576. Big Island (2011)Google Scholar
  6. 6.
    Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition. Technical report, Cambridge University Engineering Department (1997)Google Scholar
  7. 7.
    Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. dissertation. University of Cambridge, Cambridge, UK (2003)Google Scholar
  8. 8.
    Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. Big Island (2011)Google Scholar
  9. 9.
    Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Commun. 49(10–11), 827–835 (2007)CrossRefGoogle Scholar
  10. 10.
    Yao K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: IEEE Spoken Language Technology Workshop (SLT), pp. 366–369. Miami (2012)Google Scholar
  11. 11.
    Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6399–6403. Florence (2014)Google Scholar
  12. 12.
    Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 31st International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toulouse (2006)Google Scholar
  13. 13.
    Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. Vancouver (2013)Google Scholar
  14. 14.
    Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 225–229. Florence (2014)Google Scholar
  15. 15.
    Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: 15th Annual Conference of the International Speech Communication Association, pp. 2997–3001. Singapore (2014)Google Scholar
  16. 16.
    Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 195–199. Florence (2014)Google Scholar
  17. 17.
    Rouvier, M., Favre, B.: Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 3007–3011. Singapore (2014)Google Scholar
  18. 18.
    Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID Speaker Recognition System for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  19. 19.
    Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 378–382. Singapore (2014)Google Scholar
  20. 20.
    Karafiat, M., Grezl, F., Hannemann, M., Cernocky, J.H.: But neural network features for spontaneous Vietnamese in BABEL. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5622–5626 (2014)Google Scholar
  21. 21.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report search in Computing Technology (Harvard University) (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Alexey Prudnikov
    • 1
    • 2
  • Ivan Medennikov
    • 2
    • 3
  • Valentin Mendelev
    • 1
  • Maxim Korenevsky
    • 1
    • 2
  • Yuri Khokhlov
    • 3
  1. 1.Speech Technology Center LtdSt. PetersburgRussia
  2. 2.ITMO UniversitySt. PetersburgRussia
  3. 3.STC-innovations LtdSt. PetersburgRussia

Personalised recommendations