Long-Distance Continuous Space Language Modeling for Speech Recognition

  • Mohamed TalaatEmail author
  • Sherif Abdou
  • Mahmoud Shoman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9042)


The n-gram language models has been the most frequently used language model for a long time as they are easy to build models and require the minimum effort for integration in different NLP applications. Although of its popularity, n-gram models suffer from several drawbacks such as its ability to generalize for the unseen words in the training data, the adaptability to new domains, and the focus only on short distance word relations. To overcome the problems of the n-gram models the continuous parameter space LMs were introduced. In these models the words are treated as vectors of real numbers rather than of discrete entities. As a result, semantic relationships between the words could be quantified and can be integrated into the model. The infrequent words are modeled using the more frequent ones that are semantically similar. In this paper we present a long distance continuous language model based on a latent semantic analysis (LSA). In the LSA framework, the word-document co-occurrence matrix is commonly used to tell how many times a word occurs in a certain document. Also, the word-word co-occurrence matrix is used in many previous studies. In this research, we introduce a different representation for the text corpus, this by proposing long-distance word co-occurrence matrices. These matrices to represent the long range co-occurrences between different words on different distances in the corpus. By applying LSA to these matrices, words in the vocabulary are moved to the continuous vector space. We represent each word with a continuous vector that keeps the word order and position in the sentences. We use tied-mixture HMM modeling (TM-HMM) to robustly estimate the LM parameters and word probabilities. Experiments on the Arabic Gigaword corpus show improvements in the perplexity and the speech recognition results compared to the conventional n-gram.


Language model n-gram Continuous space Latent semantic analysis Word co-occurrence matrix Long distance Tied-mixture model 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Markov, A.A.: An example of statistical investigation in the text of ‘Eugene Onyegin’ illustrating coupling of ‘tests’ in chains. In: Proceedings of the Academy of Sciences. VI, vol. 7 , St. Petersburg, pp. 153–162 (1913)Google Scholar
  2. 2.
    Damerau, F.: Markov models and linguistic theory: an experimental study of a model for English, Janua linguarum: Series minor. Mouton (1971)Google Scholar
  3. 3.
    Jelinek, F.: Statistical Methods for Speech Recognition. Language, Speech, & Communication: A Bradford Book. MIT Press (1997)Google Scholar
  4. 4.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1995, vol. 1, pp. 181–184 (1995)Google Scholar
  5. 5.
    Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language 8, 1–38 (1994)CrossRefGoogle Scholar
  6. 6.
    Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40(3 and 4), 237–264 (1953)CrossRefzbMATHMathSciNetGoogle Scholar
  7. 7.
    Jelinek, F., Mercer, R.L.: Interpolated estimation of markov source parameters from sparse data. In: Proceedings of the Workshop on Pattern Recognition in Practice, pp. 381–397. North-Holland, Amsterdam (1980)Google Scholar
  8. 8.
    Katz, S.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35(3), 400–401 (1987)CrossRefGoogle Scholar
  9. 9.
    Lidstone, G.: Note on the general case of the Bayes–Laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries 8, 182–192 (1920)Google Scholar
  10. 10.
    Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Inc., Upper Saddle River (1990)Google Scholar
  11. 11.
    Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)Google Scholar
  12. 12.
    Broman, S., Kurimo, M.: Methods for combining language models in speech recognition. In: Interspeech, pp. 1317–1320 (September 2005)Google Scholar
  13. 13.
    Wada, Y., Kobayashi, N., Kobayashi, T.: Robust language modeling for a small corpus of target tasks using class-combined word statistics and selective use of a general corpus. Systems and Computers in Japan 34(12), 92–102 (2003)CrossRefGoogle Scholar
  14. 14.
    Niesler, T., Woodland, P.: Combination of word-based and category-based language models. In: Proceedings of the Fourth International Conference on Spoken Language, ICSLP 1996, vol. 1, pp. 220–223 (1996)Google Scholar
  15. 15.
    Afify, M., Siohan, O., Sarikaya, R.: Gaussian mixture language models for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. IV–29–IV–32 (2007)Google Scholar
  16. 16.
    Sarikaya, R., Afify, M., Kingsbury, B.: Tied-mixture language modeling in continuous space. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2009, pp. 459–467. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  17. 17.
    Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(6), 570–583 (1990)CrossRefGoogle Scholar
  18. 18.
    Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language 10(3), 187–228 (1996)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Nakagawa, S., Murase, I., Zhou, M.: Comparison of language models by stochastic context-free grammar, bigram and quasi-simplified-trigram (0300-1067). IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 0300–1067 (2008)Google Scholar
  20. 20.
    Niesler, T., Woodland, P.: A variable-length category-based n-gram language model. In: 1996 IEEE International Conference Proceedings on Acoustics, Speech, and Signal Processing, ICASSP 1996, vol. 1, pp. 164–167 (1996)Google Scholar
  21. 21.
    Bellegarda, J., Butzberger, J., Chow, Y.-L., Coccaro, N., Naik, D.: A novel word clustering algorithm based on latent semantic analysis. In: 1996 Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1996, vol. 1, pp. 172–175 (1996)Google Scholar
  22. 22.
    Bellegarda, J.: A multispan language modeling framework for large vocabulary speech recognition. IEEE Transactions on Speech and Audio Processing 6(5), 456–467 (1998)CrossRefGoogle Scholar
  23. 23.
    Bellegarda, J.: Latent semantic mapping (information retrieval). IEEE Signal Processing Magazine 22(5), 70–80 (2005)CrossRefGoogle Scholar
  24. 24.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  25. 25.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)zbMATHGoogle Scholar
  26. 26.
    Blat, F., Castro, M., Tortajada, S., Snchez, J.: A hybrid approach to statistical language modeling with multilayer perceptrons and unigrams. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 195–202. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  27. 27.
    Emami, A., Xu, P., Jelinek, F.: Using a connectionist model in a syntactical based language model. In: 2003 IEEE Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2003, vol. 1, pp. I–372–I–375 (2003)Google Scholar
  28. 28.
    Schwenk, H., Gauvain, J.: Connectionist language modeling for large vocabulary continuous speech recognition. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–765–I–768 (2002)Google Scholar
  29. 29.
    Schwenk, H., Gauvain, J.-L.: Neural network language models for conversational speech recognition. In: ICSLP (2004)Google Scholar
  30. 30.
    Schwenk, H., Gauvain, J.-L.: Building continuous space language models for transcribing european languages. In: INTERSPEECH, pp. 737–740. ISCA (2005)Google Scholar
  31. 31.
    Naptali, W., Tsuchiya, M., Nakagawa, S.: Language model based on word order sensitive matrix representation in latent semantic analysis for speech recognition. In: 2009 WRI World Congress on Computer Science and Information Engineering, vol. 7, pp. 252–256 (2009)Google Scholar
  32. 32.
    Fumitada: A linear space representation of language probability through SVD of n-gram matrix. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 86(8), 61–70 (2003)CrossRefGoogle Scholar
  33. 33.
    Rishel, T., Perkins, A.L., Yenduri, S., Zand, F., Iyengar, S.S.: Augmentation of a term/document matrix with part-of-speech tags to improve accuracy of latent semantic analysis. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, ACOS 2006, pp. 573–578. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point (2006)Google Scholar
  34. 34.
    Leggetter, C., Woodland, P.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language 9(2), 171–185 (1995)CrossRefGoogle Scholar
  35. 35.
    Bellegarda, J.R., Nahamoo, D.: Tied mixture continuous parameter modeling for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 38(12), 2033–2045 (1990)CrossRefGoogle Scholar
  36. 36.
    Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics 41(1), 164–171 (1970), doi:10.2307/2239727CrossRefzbMATHMathSciNetGoogle Scholar
  37. 37.
    Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech, and Language Processing 19(1), 166–175 (2011)CrossRefGoogle Scholar
  38. 38.
    Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, vol. 2, Denver, USA, pp. 901–904 (2002)Google Scholar
  39. 39.
    Young, S.J., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.C.: The HTK book, version 3.4. In: Cambridge University Engineering Department, Cambridge, UK (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Faculty of Computers and InformationCairo UniversityGizaEgypt

Personalised recommendations