Making Language Model as Small as Possible in Statistical Machine Translation

  • Yang Liu
  • Jiajun Zhang
  • Jie Hao
  • Dakun Zhang
Part of the Communications in Computer and Information Science book series (CCIS, volume 493)


As one of the key components, n-gram language model is most frequently used in statistical machine translation. Typically, higher order of the language model leads to better translation performance. However, higher order of the n-gram language model requires much more monolingual training data to avoid data sparseness. Furthermore, the model size increases exponentially when the n-gram order becomes higher and higher. In this paper, we investigate the language model pruning techniques that aim at making the model size as small as possible while keeping the translation quality. According to our investigation, we further propose to replace the higher order n-grams with a low-order cluster-based language model. The extensive experiments show that our method is very effective.


language model pruning frequent n-gram clustering statistical machine translation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brown, P., Della Pietra, V., de Souza, P., Lai, J., Mercer, R.: Class-based n-gram models of natural language. Computational Linguistics (18), 467–479 (1990)Google Scholar
  2. 2.
    Goodman, J., Gao, J.: Language model size reduction by pruning and clustering. In: Processings of ICSLP 2000, pp. 110–113 (2000)Google Scholar
  3. 3.
    Jelinek, F., Merialdo, B., Roukos, S., Strauss, M.: Self Organized Language modeling for Speech Recognition. In: Waibel, A., Lee, K.F. (eds.) Reading in Speech Recognition. Morgan Kaufmann (1990)Google Scholar
  4. 4.
    Koehn, P., Hoang, H., Birch, A., Callsion-Burch, C., et al.: Moses: Open Source Toolkit for Statistical Machine Translation. In: Proceedings of ACL, pp. 177–180 (2007)Google Scholar
  5. 5.
    Moore, R.C., Quirk, C.: Less is More: Significance-Based N-gram Selection for Smaller, Better Language Models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 746–755 (2009)Google Scholar
  6. 6.
    Och, F.J.: GIZA++:Training of statistical translation models (2000)Google Scholar
  7. 7.
  8. 8.
    Seymore, K., Rosenfeld, R.: Scalable Trigram Backoff Language Models. In: Proceedings of ICSLP 1996, pp. 232–235 (1996)Google Scholar
  9. 9.
    Stolcke, A.: Entropy-based pruning of backoff language model. In: Proceedings of the DARPA News Transcription and Understanding Workshop 1998, pp. 270–274 (1998)Google Scholar
  10. 10.
    Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistic 23(3), 377–403 (1997)Google Scholar
  11. 11.
    Xiong, D., Liu, Q., Lin, S.: Maximum entropy based phrase reordering model for statistical machine translation. In: Proceedings of COLING-ACL 2006, pp. 521–528 (2006)Google Scholar
  12. 12.
    Zhang, J., Zong, C.: A Framework for Effectively Integrating Hard and Soft Syntactic Rules into Phrase-Based Translation. In: Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC 23), Hong Kong, pp. 579–588 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Yang Liu
    • 1
  • Jiajun Zhang
    • 1
  • Jie Hao
    • 2
  • Dakun Zhang
    • 2
  1. 1.NLPR, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.Toshiba (China) R&D CenterBeijingChina

Personalised recommendations