Learning Latent Topic Information for Language Model Adaptation

Lu, Shixiang; Wei, Wei; Fu, Xiaoyin; Fan, Lichun; Xu, Bo

doi:10.1007/978-3-642-34456-5_14

Shixiang Lu⁵,
Wei Wei⁵,
Xiaoyin Fu⁵,
Lichun Fan⁵ &
…
Bo Xu⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 333))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

794 Accesses

Abstract

This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional methods have gained significant performance, they ignore the topic information and the distribution of words in calculating the sentence similarity. In this paper, the authors propose a topic model to discover the latent topics in the content of sentences, and combine the latent topic based similarity with TF-IDF into a unified framework for data selection. Furthermore, the authors combine a cross-lingual projecting method with the topic model, which makes the data selection depend on the source input directly. Large-scale experimental results demonstrate that the proposed approach significantly outperforms the traditional approaches on both LM perplexity and SMT performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Eck, M., Vogel, S., Waibel, A.: Language model adaptation for statistical machine translation based on information retrieval. In: Proceedings of LREC, pp. 327–330 (2004)
Google Scholar
Zhao, B., Eck, M., Vogel, S.: Language model adaptation for statistical machine translation with structured query models. In: Proceedings of COLING, pp. 411–417 (2004)
Google Scholar
Kim, W.: Language model adaptation for automatic speech recognition and statistical machine translation. Ph.D. thesis, The Johns Hopkins University (2005)
Google Scholar
Masskey, S., Sethy, A.: Resampling auxiliary data for language model adaptation in machine translation for speech. In: Proceedings of ICASSP, pp. 4817–4820 (2010)
Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP, pp. 355–362 (2011)
Google Scholar
Foster, G., Kuhn, R.: Mixture-model adaptation for SMT. In: Proceedings of ACL, pp. 128–135 (2007)
Google Scholar
Snover, M., Dorr, B., Marcu, R.: Language and translation model adaptation using comparable corpora. In: Proceedings of EMNLP, pp. 857–866 (2008)
Google Scholar
Ananthakrishnan, S., Prasad, R., Natarajan, P.: On-line language model biasing for statistical machine translation. In: Proceedings of ACL, pp. 445–449 (2011)
Google Scholar
Tam, Y.-C., Lane, I., Schultz, T.: Bilingual-LSA based LM adaptation for spoken language translation. In: Proceedings of ACL, pp. 520–527 (2007)
Google Scholar
Tam, Y.-C., Lane, I., Schultz, T.: Bilingual-LSA based adaptation for statistical machine translation. Machine Translation 21, 187–207 (2008)
Article Google Scholar
Nanjo, H., Kawahara, T.: Unsupervised language model adaptation for lecture speech recognition. In: Proceedings of ICSLP (2002)
Google Scholar
Nanjo, H., Kawahara, T.: Language model and speaking rate adaptation for spontaneous presentation speech recognition. IEEE Tran. SAP 12(4), 301–400 (2004)
Google Scholar
Leeuwis, E., Federico, M., Cettolo, M.: Language modeling and transcription of the TED corpus lectures. In: Proceedings of ICASSP (2003)
Google Scholar
Park, A., Hazen, T., Glass, J.: Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling. In: Proceedings of ICASSP (2005)
Google Scholar
Tam, Y.-C., Schultz, T.: Dynamic language model adaptation using variational bayes inference. In: Proceedings of INTEERSPEECH, pp. 5–8 (2005)
Google Scholar
Tam, Y.-C., Schultz, T.: Unsupervised language model adaptation using latent semantic marginals. In: Proceedings of ICSLP, pp. 2206–2209 (2006)
Google Scholar
Heidel, A., Chang, H.-A., Lee, L.-S.: Language model adaptation using latent dirichlet allocation and an efficient topic inference algorithm. In: Proceedings of INTERSPEECH (2007)
Google Scholar
Chen, K.-Y., Chiu, H.-S., Chen, B.: Latent topic modeling of word vicinity information for speech recognition. In: Proceedings of ICASSP, pp. 5394–5397 (2010)
Google Scholar
(Paul) Hsu, B.-J., Glass, J.: Style & topic language model adaptation using HMM-LDA. In: Proceedings of EMNLP, pp. 373–381 (2006)
Google Scholar
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. Cambridge Univ. Press (1992)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Gtiffiths, T.L.: Gibbs sampling in the generative model of latent dirichlet allocation (2002), http://wwwpsych.stanford.edu/gruffydd/cogsci02/lda.ps
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of ICSLP, pp. 901–904 (2002)
Google Scholar
Chiang, D.: A hierarchical phrase-based model for statistical machine translation. In: Proceedings of ACL (2005)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Google Scholar
Wei, B., Pal, C.: Cross lingual adaptation: an experiment on sentiment classifications. In: Proceedings of ACL, pp. 258–262 (2010)
Google Scholar
Lu, S., Wei, W., Fu, X., Xu, B.: Translation model based cross-lingual language model adaptation: from word models to phrase models. In: Proceedings of EMNLP-CoNLL, pp. 512–522 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Interactive Digital Media Technology Research Center, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Haidian District, Beijing, 100190, China
Shixiang Lu, Wei Wei, Xiaoyin Fu, Lichun Fan & Bo Xu

Authors

Shixiang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyin Fu
View author publications
You can also search for this author in PubMed Google Scholar
Lichun Fan
View author publications
You can also search for this author in PubMed Google Scholar
Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Ming Zhou
Soochow University, 215006, Suzhou, China
Guodong Zhou
Institute of Computer Science & Technology, Peking University, 100871, Beijing, China
Dongyan Zhao & Lei Zou &
Institute of Computing Technology, Chinese Academy of Sciences, No.6 Kexueyuan South Road, Haidian District, 100190, Beijing, China
Qun Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, S., Wei, W., Fu, X., Fan, L., Xu, B. (2012). Learning Latent Topic Information for Language Model Adaptation. In: Zhou, M., Zhou, G., Zhao, D., Liu, Q., Zou, L. (eds) Natural Language Processing and Chinese Computing. NLPCC 2012. Communications in Computer and Information Science, vol 333. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34456-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-34456-5_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34455-8
Online ISBN: 978-3-642-34456-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics