Long Distance Dependency in Language Modeling: An Empirical Study

  • Jianfeng Gao
  • Hisami Suzuki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248)


This paper presents an extensive empirical study on two language modeling techniques, linguistically-motivated word skipping and predictive clustering, both of which are used in capturing long distance word dependencies that are beyond the scope of a word trigram model. We compare the techniques to others that were proposed previously for the same purpose. We evaluate the resulting models on the task of Japanese Kana-Kanji conversion. We show that the two techniques, while simple, outperform existing methods studied in this paper, and lead to language models that perform significantly better than a word trigram model. We also investigate how factors such as training corpus size and genre affect the performance of the models.


Language Modeling Function Word Distance Dependency Statistical Language Modeling Predictive Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Charniak, E.: Immediate-head parsing for language models. In: ACL/EACL 2001, pp.124–131 (2001)Google Scholar
  2. 2.
    Chelba, C., Jelinek, F.: Structured language modeling. Computer Speech and Language 14(4), 283–332 (2000)CrossRefGoogle Scholar
  3. 3.
    Chelba, C., Engle, D., Jelinek, F., Jimenez, V., Khudanpur, S., Mangu, L., Printz, H., Ristad, E.S., Rosenfeld, R., Stolcke, A., Wu, D.: Structure and performance of a dependency language model. In: Processing of Eurospeech, vol. 5, pp. 2775–2778 (1997)Google Scholar
  4. 4.
    Collins, Michael John, A new statistical parser based on bigram lexical dependencies. In: ACL-34, pp. 184–191 (1996)Google Scholar
  5. 5.
    Della Pietra, S., Della Pietra, V., Gillett, J., Lafferty, J., Printz, H., Ures, L.: Inference and estimation of a long-range trigram model. Technical report CMU-CS- 94-188, Department of Computer Science, CMU (1994)Google Scholar
  6. 6.
    Gao, J., Goodman, J., Miao, J.: The use of clustering techniques for language model. application to Asian language. Computational Linguistics and Chinese Language Processing 6(1), 27–60 (2001)Google Scholar
  7. 7.
    Gao, J., Goodman, J., Cao, G., Li, H.: Exploring asymmetric clustering for statistical language modeling. In: ACL 2002, pp. 183–190 (2002a)Google Scholar
  8. 8.
    Gao, J., Suzuki, H., Wen, Y.: Exploiting headword dependency and predictive clustering for language modeling. In: EMNLP 2002, pp. 248–256 (2002b)Google Scholar
  9. 9.
    Gao, J., Suzuki, H.: Unsupervised learning of dependency structure for language modeling. In: ACL 2003, pp. 521–528 (2003)Google Scholar
  10. 10.
    Gao, J., Suzuki, H.: Capturing long distance dependency in language modeling: an empirical study. In: IJCNLP 2004, pp. 53–60 (2004)Google Scholar
  11. 11.
    Goodman, J.: A bit of progress in language modeling, October 2001. Computer Speech and Language, pp. 403–434 (2001)Google Scholar
  12. 12.
    Geutner, P.: Introducing linguistic constraints into statistical language modeling. In: ICSLP 1996, pp. 402–405 (1996)Google Scholar
  13. 13.
    Katz, S.M.: Estimation of probabilities from sparse data for other language component of a speech recognizer. IEEE transactions on Acoustics, Speech and Signal Processing 35(3), 400–401 (1987)CrossRefGoogle Scholar
  14. 14.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: ICASSP 1995, pp. 181–184 (1995)Google Scholar
  15. 15.
    Isotani, R., Matsunaga, S.: A stochastic language model for speech recognition integrating local and global constraints. In: ICASSP 1994, pp. 5–8 (1994)Google Scholar
  16. 16.
    Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language 8, 1–38 (1994)CrossRefGoogle Scholar
  17. 17.
    Roark, B.: Probabilistic top-down parsing and language modeling. Computational Linguistics 17-2, 1–28 (2001)MathSciNetGoogle Scholar
  18. 18.
    Rosenfeld, R.: Adaptive statistical language modeling: a maximum entropy approach. Ph.D. thesis, Carnegie Mellon University (1994)Google Scholar
  19. 19.
    Siu, M., Ostendorf, M.: Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing 8, 63–75 (2000)CrossRefGoogle Scholar
  20. 20.
    Stolcke, A.: Entropy-based pruning of backoff language models. In: Proceeding of DARPA News Transcription and Understanding Workshop, Lansdowne, VA, pp. 270–274 (1998)Google Scholar
  21. 21.
    Yuret, D.: Discovery of linguistic relations using lexical attraction. Ph.D. thesis, MIT (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jianfeng Gao
    • 1
  • Hisami Suzuki
    • 2
  1. 1.Microsoft Research, AsiaBeijing
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations