Skip to main content

Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

We study in this chapter the problem of measuring the degree of comparability of bilingual corpora, with applications to bilingual lexicon extraction. We first develop a measure which can capture different comparability levels. This measure correlates very well with gold-standard comparability levels and is relatively robust to dictionary coverage. We then propose a well-founded algorithm to improve the quality, in terms of comparability scores, of existing comparable corpora, prior to showing that the bilingual lexicons extracted from corpora enhanced in this way are of better quality. All the experiments in this chapter are performed on French-English comparable corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.statmt.org/europarl/

  2. 2.

    http://trec.nist.gov/

  3. 3.

    http://www.clef-campaign.org/

  4. 4.

    The Wikipedia dump files can be downloaded at http://download.wikimedia.org.

  5. 5.

    This is also true for all the comparability levels of comparable corpora although we only plot 4 levels in the figure.

  6. 6.

    As the low comparability scores of document pairs are filtered out to reduce the computational complexity, one usually obtains, instead of a fully connected dendrogram, several dendrograms from the clustering process.

  7. 7.

    As \(\mathcal {R}_1\) and \(\mathcal {R}_2\) are clusters, then, by construction, their respective source and target language parts are comparable.

  8. 8.

    For space constraints, we do not show here that the new measure we introduce is indeed a lower bound of \(M\).

References

  1. Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th ACM SIGIR, pp. 84–91, Philadelphia, Pennsylvania, USA (1997)

    Google Scholar 

  2. Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7, Taipei, Taiwan (2002)

    Google Scholar 

  3. Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202, Hong Kong (1997)

    Google Scholar 

  4. Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 17th International Conference on Computational linguistics, pp. 414–420, Montreal, Quebec, Canada (1998)

    Google Scholar 

  5. Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 129–137, Boulder, Colorado (2009)

    Google Scholar 

  6. Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 526–533, Barcelona, Spain (2004)

    Google Scholar 

  7. Goeuriot, L., Grabar, N., Daille, B.: Characterization of scientific and popular science discourse in French, Japanese and Russian. In: LREC. Marrakech, Morocco (2008)

    Google Scholar 

  8. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit (2005)

    Google Scholar 

  9. Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 617–625, Beijing, China (Aug 2010)

    Google Scholar 

  10. Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 644–652, Beijing, China (2010)

    Google Scholar 

  11. Li, B., Gaussier, E., Aizawa, A.: Clustering comparable corpora for bilingual lexicon extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 473–478, Portland, Oregon, USA (June 2011)

    Google Scholar 

  12. Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 664–671, Prague, Czech Republic (2007)

    Google Scholar 

  13. Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)

    Google Scholar 

  14. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  15. Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)

    Article  Google Scholar 

  16. Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)

    Google Scholar 

  17. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing corpora, pp. 1–6, Hong Kong (2000)

    Google Scholar 

  18. Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the web. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232, Trento, Italy (2006)

    Google Scholar 

  19. Sharoff, S.: Comparing corpora using frequency profiling. In: Proceedings of Web as Corpus Workshop. Louvain-la-Neuve (2007)

    Google Scholar 

  20. Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 98–107, Uppsala, Sweden (2010)

    Google Scholar 

  21. Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of the 8th Text Retrieval Conference, pp. 77–82 (1999)

    Google Scholar 

  22. Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124, Boulder, Colorado, USA (2009)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the French National Research Agency grant ANR-08-CORD-009.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Li, B., Gaussier, E. (2013). Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics