Abstract
This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional methods have gained significant performance, they ignore the topic information and the distribution of words in calculating the sentence similarity. In this paper, the authors propose a topic model to discover the latent topics in the content of sentences, and combine the latent topic based similarity with TF-IDF into a unified framework for data selection. Furthermore, the authors combine a cross-lingual projecting method with the topic model, which makes the data selection depend on the source input directly. Large-scale experimental results demonstrate that the proposed approach significantly outperforms the traditional approaches on both LM perplexity and SMT performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Eck, M., Vogel, S., Waibel, A.: Language model adaptation for statistical machine translation based on information retrieval. In: Proceedings of LREC, pp. 327–330 (2004)
Zhao, B., Eck, M., Vogel, S.: Language model adaptation for statistical machine translation with structured query models. In: Proceedings of COLING, pp. 411–417 (2004)
Kim, W.: Language model adaptation for automatic speech recognition and statistical machine translation. Ph.D. thesis, The Johns Hopkins University (2005)
Masskey, S., Sethy, A.: Resampling auxiliary data for language model adaptation in machine translation for speech. In: Proceedings of ICASSP, pp. 4817–4820 (2010)
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP, pp. 355–362 (2011)
Foster, G., Kuhn, R.: Mixture-model adaptation for SMT. In: Proceedings of ACL, pp. 128–135 (2007)
Snover, M., Dorr, B., Marcu, R.: Language and translation model adaptation using comparable corpora. In: Proceedings of EMNLP, pp. 857–866 (2008)
Ananthakrishnan, S., Prasad, R., Natarajan, P.: On-line language model biasing for statistical machine translation. In: Proceedings of ACL, pp. 445–449 (2011)
Tam, Y.-C., Lane, I., Schultz, T.: Bilingual-LSA based LM adaptation for spoken language translation. In: Proceedings of ACL, pp. 520–527 (2007)
Tam, Y.-C., Lane, I., Schultz, T.: Bilingual-LSA based adaptation for statistical machine translation. Machine Translation 21, 187–207 (2008)
Nanjo, H., Kawahara, T.: Unsupervised language model adaptation for lecture speech recognition. In: Proceedings of ICSLP (2002)
Nanjo, H., Kawahara, T.: Language model and speaking rate adaptation for spontaneous presentation speech recognition. IEEE Tran. SAP 12(4), 301–400 (2004)
Leeuwis, E., Federico, M., Cettolo, M.: Language modeling and transcription of the TED corpus lectures. In: Proceedings of ICASSP (2003)
Park, A., Hazen, T., Glass, J.: Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling. In: Proceedings of ICASSP (2005)
Tam, Y.-C., Schultz, T.: Dynamic language model adaptation using variational bayes inference. In: Proceedings of INTEERSPEECH, pp. 5–8 (2005)
Tam, Y.-C., Schultz, T.: Unsupervised language model adaptation using latent semantic marginals. In: Proceedings of ICSLP, pp. 2206–2209 (2006)
Heidel, A., Chang, H.-A., Lee, L.-S.: Language model adaptation using latent dirichlet allocation and an efficient topic inference algorithm. In: Proceedings of INTERSPEECH (2007)
Chen, K.-Y., Chiu, H.-S., Chen, B.: Latent topic modeling of word vicinity information for speech recognition. In: Proceedings of ICASSP, pp. 5394–5397 (2010)
(Paul) Hsu, B.-J., Glass, J.: Style & topic language model adaptation using HMM-LDA. In: Proceedings of EMNLP, pp. 373–381 (2006)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. Cambridge Univ. Press (1992)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Gtiffiths, T.L.: Gibbs sampling in the generative model of latent dirichlet allocation (2002), http://wwwpsych.stanford.edu/gruffydd/cogsci02/lda.ps
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of ICSLP, pp. 901–904 (2002)
Chiang, D.: A hierarchical phrase-based model for statistical machine translation. In: Proceedings of ACL (2005)
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Wei, B., Pal, C.: Cross lingual adaptation: an experiment on sentiment classifications. In: Proceedings of ACL, pp. 258–262 (2010)
Lu, S., Wei, W., Fu, X., Xu, B.: Translation model based cross-lingual language model adaptation: from word models to phrase models. In: Proceedings of EMNLP-CoNLL, pp. 512–522 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lu, S., Wei, W., Fu, X., Fan, L., Xu, B. (2012). Learning Latent Topic Information for Language Model Adaptation. In: Zhou, M., Zhou, G., Zhao, D., Liu, Q., Zou, L. (eds) Natural Language Processing and Chinese Computing. NLPCC 2012. Communications in Computer and Information Science, vol 333. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34456-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-34456-5_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34455-8
Online ISBN: 978-3-642-34456-5
eBook Packages: Computer ScienceComputer Science (R0)