Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality

Li, Bo; Gaussier, Eric

doi:10.1007/978-3-642-20128-8_7

Bo Li⁵ &
Eric Gaussier⁵

1159 Accesses
6 Citations

Abstract

We study in this chapter the problem of measuring the degree of comparability of bilingual corpora, with applications to bilingual lexicon extraction. We first develop a measure which can capture different comparability levels. This measure correlates very well with gold-standard comparability levels and is relatively robust to dictionary coverage. We then propose a well-founded algorithm to improve the quality, in terms of comparability scores, of existing comparable corpora, prior to showing that the bilingual lexicons extracted from corpora enhanced in this way are of better quality. All the experiments in this chapter are performed on French-English comparable corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.statmt.org/europarl/
2.
http://trec.nist.gov/
3.
http://www.clef-campaign.org/
4.
The Wikipedia dump files can be downloaded at http://download.wikimedia.org.
5.
This is also true for all the comparability levels of comparable corpora although we only plot 4 levels in the figure.
6.
As the low comparability scores of document pairs are filtered out to reduce the computational complexity, one usually obtains, instead of a fully connected dendrogram, several dendrograms from the clustering process.
7.
As \(\mathcal {R}_1\) and \(\mathcal {R}_2\) are clusters, then, by construction, their respective source and target language parts are comparable.
8.
For space constraints, we do not show here that the new measure we introduce is indeed a lower bound of \(M\).

References

Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th ACM SIGIR, pp. 84–91, Philadelphia, Pennsylvania, USA (1997)
Google Scholar
Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7, Taipei, Taiwan (2002)
Google Scholar
Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202, Hong Kong (1997)
Google Scholar
Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 17th International Conference on Computational linguistics, pp. 414–420, Montreal, Quebec, Canada (1998)
Google Scholar
Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 129–137, Boulder, Colorado (2009)
Google Scholar
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 526–533, Barcelona, Spain (2004)
Google Scholar
Goeuriot, L., Grabar, N., Daille, B.: Characterization of scientific and popular science discourse in French, Japanese and Russian. In: LREC. Marrakech, Morocco (2008)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit (2005)
Google Scholar
Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 617–625, Beijing, China (Aug 2010)
Google Scholar
Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 644–652, Beijing, China (2010)
Google Scholar
Li, B., Gaussier, E., Aizawa, A.: Clustering comparable corpora for bilingual lexicon extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 473–478, Portland, Oregon, USA (June 2011)
Google Scholar
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 664–671, Prague, Czech Republic (2007)
Google Scholar
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)
Article Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)
Google Scholar
Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing corpora, pp. 1–6, Hong Kong (2000)
Google Scholar
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the web. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232, Trento, Italy (2006)
Google Scholar
Sharoff, S.: Comparing corpora using frequency profiling. In: Proceedings of Web as Corpus Workshop. Louvain-la-Neuve (2007)
Google Scholar
Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 98–107, Uppsala, Sweden (2010)
Google Scholar
Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of the 8th Text Retrieval Conference, pp. 77–82 (1999)
Google Scholar
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124, Boulder, Colorado, USA (2009)
Google Scholar

Download references

Acknowledgments

This work was supported by the French National Research Agency grant ANR-08-CORD-009.

Author information

Authors and Affiliations

Laboratoire d’Informatique de Grenoble (LIG), Université Joseph Fourier Grenoble 1 / CNRS, Grenoble, France
Bo Li & Eric Gaussier

Authors

Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Eric Gaussier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Li .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, B., Gaussier, E. (2013). Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_7
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics