Progress in Neural Network Based Statistical Language Modeling

  • Anup Shrikant KunteEmail author
  • Vahida Z. AttarEmail author
Part of the Studies in Computational Intelligence book series (SCI, volume 866)


Statistical Language Modeling (LM) is one of the central steps in many Natural Language Processing (NLP)  tasks including Automatic Speech recognition (ASR), Statistical Machine Translation (SMT) , Sentence completion, Automatic Text Generation to name a few. Good Quality Language Model has been one of the key success factors for many commercial NLP applications. Since past three decades diverse research communities like psychology, neuroscience, data compression, machine translation, speech recognition, linguistics etc, have advanced research in the field of Language Modeling. First we understand the mathematical background of LM problem. Further we review various Neural Network based LM techniques in the order they were developed. We also review recent developments in Recurrent Neural Network (RNN) Based Language Models. Early LM research in ASR gave rise to commercially successful class of LMs called as N-gram LMs. These class of models were purely statistical based and lacked in utilising the linguistic information present in the text itself. With the advancement in the computing power, availability of humongous and rich sources of textual data Neural Network based LM paved their way into the arena. These techniques proved significant, since they mapped word tokens into a continuous space than treating them as discrete. As NNLM performance was proved to be comparable to existing state of the art N-gram LMs researchers also successfully used Deep Neural Network to LM. Researchers soon realised that the inherent sequential nature of textual input make LM problem a good Candidate for use of Recurrent Neural Network (RNN) architecture. Today RNN is the choice of Neural Architecture to solve LM by most practitioners. This chapter sheds light on variants of Neural Network Based LMs.


Statistical language modeling Natural language processing Artificial intelligence Machine learning Deep learning 


  1. 1.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)Google Scholar
  2. 2.
    Chen, J.G.S.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting of the ACL (1996)Google Scholar
  3. 3.
    Joshua, T., Goodman, J.: A bit of progress in language modeling extended version, Machine Learning and Applied Statistics Group Microsoft Research. Technical Report, MSR-TR-2001-72 (2001)Google Scholar
  4. 4.
    Jelinek, F., Merialdo, B., Roukos, S., Strauss, M.: A dynamic language model for speech recognition. HLT 91, 293–295 (1991)Google Scholar
  5. 5.
    Bellegarda, J.R.: A multispan language modeling framework for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 6(5), 456–467 (1998)CrossRefGoogle Scholar
  6. 6.
    Lau, R., Rosenfeld, R., Roukos, S.: Trigger-based language models: a maximum entropy approach. In: IEEE International Conference in Acoustics, Speech, and Signal Processing, ICASSP-93, vol. 2, pp. 45–48 (1993)Google Scholar
  7. 7.
    Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010)Google Scholar
  8. 8.
    Rosenfeld R.: Adaptive statistical language modeling: a maximum entropy approach. Ph.D. thesis, Carnegie Mellon University (1994)Google Scholar
  9. 9.
    Chen, S.F.: Shrinking exponential language models, In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics pp. 468–476 (2009)Google Scholar
  10. 10.
    Chen, S.F., Mangu, L., Ramabhadran, B., Sarikaya, R., Sethy, A.: Scaling shrinkage-based language models. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2009, pp. 299–304 (2009)Google Scholar
  11. 11.
    Mikolov, T.: Statistical language models based on neural networks. Ph.D. thesis, BRNO University of Technology, Faculty of information Technology (2012)Google Scholar
  12. 12.
    Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.-L.: Neural probabilistic language models. In: Innovations in Machine Learning, pp. 137–186, Springer (2006)Google Scholar
  13. 13.
    Arisoy, E., Sainath, T.N., Kingsbury, B., Ramabhadran, B.: Deep neural network language models. In: Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pp. 20–28 (2012)Google Scholar
  14. 14.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)CrossRefGoogle Scholar
  15. 15.
    De Mulder, W., Bethard, S., Moens, M.-F.: A survey on the application of recurrent neural networks to statistical language modeling. Comput. Speech Lang. 30(1), 61–98 (2015)CrossRefGoogle Scholar
  16. 16.
    Graves, A.: Supervised sequence labelling. In: Supervised Sequence Labelling with Recurrent Neural Networks, pp. 5–13. Springer (2012)Google Scholar
  17. 17.
    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078
  18. 18.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. ICML 3(28), 1310–1318 (2013)Google Scholar
  19. 19.
    Bengio, Y., Senécal, J.S.: Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans. Neural Netw. 19(4), 713–722 (2008)CrossRefGoogle Scholar
  20. 20.
    Bengio, Y., Boulanger-Lewandowski, N., Pascanu, R.: Advances in optimizing recurrent networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8624–8628 (2013)Google Scholar
  21. 21.
    Mikolov, T., Deoras, A., Povey, D., Burget, L., Černockỳ, J.: Strategies for training large scale neural network language models. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2011, pp. 196–201 (2011)Google Scholar
  22. 22.
    Ji, S., Vishwanathan, S., Satish, N., Anderson, M.J., Dubey, P.: Blackout Speeding up recurrent neural network language models with very large vocabularies (2015). arXiv preprint arXiv:1511.06909
  23. 23.
    Zoph, B., Vaswani, A., May, J., Knight, K.: Simple, fast noise-contrastive estimation for large rnn vocabularies. NAACL HLT, pp. 1217–1222 (2016)Google Scholar
  24. 24.
    Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling (2016). arXiv preprint arXiv:1602.02410
  25. 25.
    Cho, S.J.K., Memisevic, R., Bengio, Y.: On using very large target vocabulary for neural machine translation (2014). CoRR arXiv:1412.2007
  26. 26.
    Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation (2014). arXiv preprint arXiv:1410.8206
  27. 27.
    Ling, W., Luís, T., Marujo, L., Astudillo, R.F., Amir, S., Dyer, C., Black, A.W., Trancoso, I.: Finding function in form: compositional character models for open vocabulary word representation (2015). arXiv preprint arXiv:1508.02096
  28. 28.
    Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the softmax bottleneck: a high-rank RNN language model (2017). arXiv preprint arXiv:1711.03953
  29. 29.
    Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models (2017). arXiv preprint arXiv:1708.02182
  30. 30.
    Krause, B., Kahembwe, E., Murray, I., Renals, S.: Dynamic evaluation of neural sequence models (2017). arXiv preprint arXiv:1709.07432
  31. 31.
    Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., Robinson, T.: One billion word benchmark for measuring progress in statistical language modeling (2013). arXiv preprint arXiv:1312.3005
  32. 32.
    Kuchaiev, O., Ginsburg, B.: Factorization tricks for lstm networks (2017). arXiv preprint arXiv:1703.10722
  33. 33.
    Rae, J.W., Dyer, C., Dayan, P., Lillicrap, T.P.: Fast parametric learning with activation memorization (2018). arXiv preprint arXiv:1803.10049
  34. 34.
    Grave, E., Joulin, A., Usunier, N.: Improving neural language models with a continuous cache (2016). arXiv preprint arXiv:1612.04426
  35. 35.
    Sprechmann, P., Jayakumar, S.M., Rae, J.W., Pritzel, A., Badia, A.P., Uria, B., Vinyals, O., Hassabis, D., Pascanu, R., Blundell, C.: Memory-based parameter adaptation (2018). arXiv preprint arXiv:1802.10542

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.College of Engineering PunePuneIndia

Personalised recommendations