Advertisement

Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality

  • Bo LiEmail author
  • Eric Gaussier
Chapter

Abstract

We study in this chapter the problem of measuring the degree of comparability of bilingual corpora, with applications to bilingual lexicon extraction. We first develop a measure which can capture different comparability levels. This measure correlates very well with gold-standard comparability levels and is relatively robust to dictionary coverage. We then propose a well-founded algorithm to improve the quality, in terms of comparability scores, of existing comparable corpora, prior to showing that the bilingual lexicons extracted from corpora enhanced in this way are of better quality. All the experiments in this chapter are performed on French-English comparable corpora.

Keywords

Comparability Measure Comparability Score Parallel Corpus Test Corpus Correct Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This work was supported by the French National Research Agency grant ANR-08-CORD-009.

References

  1. 1.
    Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th ACM SIGIR, pp. 84–91, Philadelphia, Pennsylvania, USA (1997)Google Scholar
  2. 2.
    Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7, Taipei, Taiwan (2002)Google Scholar
  3. 3.
    Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202, Hong Kong (1997)Google Scholar
  4. 4.
    Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 17th International Conference on Computational linguistics, pp. 414–420, Montreal, Quebec, Canada (1998)Google Scholar
  5. 5.
    Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 129–137, Boulder, Colorado (2009)Google Scholar
  6. 6.
    Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 526–533, Barcelona, Spain (2004)Google Scholar
  7. 7.
    Goeuriot, L., Grabar, N., Daille, B.: Characterization of scientific and popular science discourse in French, Japanese and Russian. In: LREC. Marrakech, Morocco (2008)Google Scholar
  8. 8.
    Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit (2005)Google Scholar
  9. 9.
    Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 617–625, Beijing, China (Aug 2010)Google Scholar
  10. 10.
    Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 644–652, Beijing, China (2010)Google Scholar
  11. 11.
    Li, B., Gaussier, E., Aizawa, A.: Clustering comparable corpora for bilingual lexicon extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 473–478, Portland, Oregon, USA (June 2011)Google Scholar
  12. 12.
    Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 664–671, Prague, Czech Republic (2007)Google Scholar
  13. 13.
    Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)Google Scholar
  14. 14.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefzbMATHGoogle Scholar
  15. 15.
    Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)CrossRefGoogle Scholar
  16. 16.
    Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)Google Scholar
  17. 17.
    Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing corpora, pp. 1–6, Hong Kong (2000)Google Scholar
  18. 18.
    Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the web. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232, Trento, Italy (2006)Google Scholar
  19. 19.
    Sharoff, S.: Comparing corpora using frequency profiling. In: Proceedings of Web as Corpus Workshop. Louvain-la-Neuve (2007)Google Scholar
  20. 20.
    Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 98–107, Uppsala, Sweden (2010)Google Scholar
  21. 21.
    Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of the 8th Text Retrieval Conference, pp. 77–82 (1999)Google Scholar
  22. 22.
    Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124, Boulder, Colorado, USA (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Laboratoire d’Informatique de Grenoble (LIG)Université Joseph Fourier Grenoble 1 / CNRSGrenobleFrance

Personalised recommendations