Skip to main content

Exploiting Comparable Corpora

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

Comparable corpora exhibit various degrees of parallelism. Fung and Cheung [3] describe corpora ranging from noisy parallel, to comparable, and finally to very non-parallel. The last category contains corpora composed of “... disparate, very non-parallel bilingual documents that could either be on the same topic (on-topic) or not”. This is the type of corpora that out work is attempting to exploit

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.statmt.org/wpt05

  2. 2.

    http://www.ldc.upenn.edu

  3. 3.

    http://www.nist.gov/speech/tests/mt

References

  1. Cettolo, M., Federico, M., Bertoldi, N.: Mining parallel fragments from comparable texts. In: Proceedings of the 7th International Workshop on Spoken Language Translation, pp. 227–234 (2010)

    Google Scholar 

  2. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  3. Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 57–63 (2004)

    Google Scholar 

  4. Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and em. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 57–63 (2004)

    Google Scholar 

  5. Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING), pp. 1051–1057 (2004)

    Google Scholar 

  6. Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Dejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 527–534 (2004)

    Google Scholar 

  7. Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 388–395 (2004)

    Google Scholar 

  8. Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 9–16 (2002)

    Google Scholar 

  9. Melamed, I.D.: Models of translational equivalence among words. Comput. Linguist. 26(2), 221–249 (2000)

    Article  Google Scholar 

  10. Moore, R.C.: Improving IBM word-alignment model 1. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 519–526 (2004)

    Google Scholar 

  11. Moore, R.C.: On log-likelihood-ratios and the significance of rare events. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 333–340 (2004)

    Google Scholar 

  12. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)

    Article  Google Scholar 

  13. Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 81–88 (2006)

    Google Scholar 

  14. Och, F.J., Ney, H.: The alignment template approach to statistical machine translation. Comput. Linguist. 30(4), 417–450 (2003)

    Article  Google Scholar 

  15. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  17. Quick, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to fragment extraction. In: Proceedings of MT Summit XI (2007)

    Google Scholar 

  18. Rapp, R.: Identifying word translation in non-parallel texts. In: Proceedings of the Conference of the Association for Computational Linguistics, pp. 320–322 (1995)

    Google Scholar 

  19. Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 519–526 (1999)

    Google Scholar 

  20. Resnik, P., Oard, D., Levow, G.: Improved cross-language retrieval using backoff translation. In: Proceedings of the 1st International Conference on Human Language Technology Research (2001)

    Google Scholar 

  21. Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)

    Article  Google Scholar 

  22. Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 1101–1109 (2010)

    Google Scholar 

  23. Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 72–79 (2003)

    Google Scholar 

  24. Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing (IJCNLP), pp. 257–268 (2005)

    Google Scholar 

  25. Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: 2002 IEEE International Conference on Data Mining, pp. 745–748 (2002)

    Google Scholar 

  26. Zhao, B., Vogel, S.: Full-text story alignment models for Chinese-English bilingual news corpora. In: Proceedings of the International Conference on Spoken Language Processing (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dragos Stefan Munteanu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Munteanu, D.S., Marcu, D. (2013). Exploiting Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics