LSTM-Based Language Models for Spontaneous Speech Recognition

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)

Abstract

The language models (LMs) used in speech recognition to predict the next word (given the context) often rely on too short context, which leads to recognition errors. In theory, using recurrent neural networks (RNN) should solve this problem, but in practice the RNNs do not fully utilize the potential of the long context. The RNN-based language models with long short-term memory (LSTM) units take better advantage of the long context and demonstrate good results in terms of perplexity for many datasets. We used LSTM-LMs trained with regularization to rescore the recognition word lattices and obtained much lower WER as compared to the n-gram and conventional RNN-based LMs for the Russian and English languages.

Keywords

Recurrent neural networks Long shorm-term memory Language models Automatic speech recognition 

References

  1. 1.
    Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: 1995 International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 181–184 (1995)Google Scholar
  2. 2.
    Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. Adv. Neural Inf. Process. Syst. 13, 932–938 (2001)MATHGoogle Scholar
  3. 3.
    Schwenk, H.: Continuous space language models. Comput. Speech Lang. 21, 492–518 (2007)CrossRefGoogle Scholar
  4. 4.
    Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. In: 11th Annual Conference of the International Speech Communication Association (Interspeech), pp. 1045–1048. Makuhari (2010)Google Scholar
  5. 5.
    Kipyatkova, I., Karpov, A.: A comparison of RNN LM and FLM for Russian speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 42–50. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  6. 6.
    Mikolov, T.: Statistical language models based on neural networks. Ph.D. thesis, Brno University Technology (2012)Google Scholar
  7. 7.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  9. 9.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)CrossRefGoogle Scholar
  10. 10.
    Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling (2016). arXiv preprint arXiv:1602.02410
  11. 11.
    Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, pp. 194–197 (2012)Google Scholar
  12. 12.
    Soutner, D., Müller, L.: Application of LSTM neural networks in language modelling. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 105–112. Springer, Heidelberg (2013)Google Scholar
  13. 13.
    Kim, Y., Jernite, Y., Sontag, D., Rush, A.: Character-aware neural language models (2015). arXiv preprint arXiv:1508.06615
  14. 14.
    Bayer, J., Osendorfer, C., Chen, N., Urban, S., van der Smagt, P.: On fast dropout and its applicability to recurrent networks (2013). arXiv preprint arXiv:1311.0701
  15. 15.
    Graves, A.: Generating sequences with recurrent neural networks (2013). arXiv preprint arXiv:1308.0850
  16. 16.
    Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization (2014). arXiv preprint arXiv:1409.2329
  17. 17.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/
  18. 18.
    Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), Big Island, pp. 1–4 (2011)Google Scholar
  19. 19.
    Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), Lyon, pp. 2345–2349 (2013)Google Scholar
  20. 20.
    Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 234–242. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  21. 21.
    Medennikov, I.P.: Speaker-dependent features for spontaneous speech recognition. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(1), 195–197 (2016). doi:10.17586/2226-1494-2016-16-1-195-197 Google Scholar
  22. 22.
    Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  23. 23.
    Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing, vol. 3, pp. 901–904 (2002)Google Scholar
  24. 24.
    Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks (2015). arXiv preprint arXiv:1503.08895

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.STC-innovations LtdSt. PetersburgRussia
  2. 2.ITMO UniversitySt. PetersburgRussia

Personalised recommendations