Exploiting Comparable Corpora

Munteanu, Dragos Stefan; Marcu, Daniel

doi:10.1007/978-3-642-20128-8_11

Dragos Stefan Munteanu⁵ &
Daniel Marcu⁵

1153 Accesses

Abstract

Comparable corpora exhibit various degrees of parallelism. Fung and Cheung [3] describe corpora ranging from noisy parallel, to comparable, and finally to very non-parallel. The last category contains corpora composed of “... disparate, very non-parallel bilingual documents that could either be on the same topic (on-topic) or not”. This is the type of corpora that out work is attempting to exploit

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Cettolo, M., Federico, M., Bertoldi, N.: Mining parallel fragments from comparable texts. In: Proceedings of the 7th International Workshop on Spoken Language Translation, pp. 227–234 (2010)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 57–63 (2004)
Google Scholar
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and em. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 57–63 (2004)
Google Scholar
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING), pp. 1051–1057 (2004)
Google Scholar
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Dejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 527–534 (2004)
Google Scholar
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 388–395 (2004)
Google Scholar
Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 9–16 (2002)
Google Scholar
Melamed, I.D.: Models of translational equivalence among words. Comput. Linguist. 26(2), 221–249 (2000)
Article Google Scholar
Moore, R.C.: Improving IBM word-alignment model 1. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 519–526 (2004)
Google Scholar
Moore, R.C.: On log-likelihood-ratios and the significance of rare events. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 333–340 (2004)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Article Google Scholar
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 81–88 (2006)
Google Scholar
Och, F.J., Ney, H.: The alignment template approach to statistical machine translation. Comput. Linguist. 30(4), 417–450 (2003)
Article Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Quick, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to fragment extraction. In: Proceedings of MT Summit XI (2007)
Google Scholar
Rapp, R.: Identifying word translation in non-parallel texts. In: Proceedings of the Conference of the Association for Computational Linguistics, pp. 320–322 (1995)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 519–526 (1999)
Google Scholar
Resnik, P., Oard, D., Levow, G.: Improved cross-language retrieval using backoff translation. In: Proceedings of the 1st International Conference on Human Language Technology Research (2001)
Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Article Google Scholar
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 1101–1109 (2010)
Google Scholar
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 72–79 (2003)
Google Scholar
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing (IJCNLP), pp. 257–268 (2005)
Google Scholar
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: 2002 IEEE International Conference on Data Mining, pp. 745–748 (2002)
Google Scholar
Zhao, B., Vogel, S.: Full-text story alignment models for Chinese-English bilingual news corpora. In: Proceedings of the International Conference on Spoken Language Processing (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

SDL Language Weaver, Los Angeles, USA
Dragos Stefan Munteanu & Daniel Marcu

Authors

Dragos Stefan Munteanu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Marcu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dragos Stefan Munteanu .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Munteanu, D.S., Marcu, D. (2013). Exploiting Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_11
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics