Abstract
We study in this chapter the problem of measuring the degree of comparability of bilingual corpora, with applications to bilingual lexicon extraction. We first develop a measure which can capture different comparability levels. This measure correlates very well with gold-standard comparability levels and is relatively robust to dictionary coverage. We then propose a well-founded algorithm to improve the quality, in terms of comparability scores, of existing comparable corpora, prior to showing that the bilingual lexicons extracted from corpora enhanced in this way are of better quality. All the experiments in this chapter are performed on French-English comparable corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
The Wikipedia dump files can be downloaded at http://download.wikimedia.org.
- 5.
This is also true for all the comparability levels of comparable corpora although we only plot 4 levels in the figure.
- 6.
As the low comparability scores of document pairs are filtered out to reduce the computational complexity, one usually obtains, instead of a fully connected dendrogram, several dendrograms from the clustering process.
- 7.
As \(\mathcal {R}_1\) and \(\mathcal {R}_2\) are clusters, then, by construction, their respective source and target language parts are comparable.
- 8.
For space constraints, we do not show here that the new measure we introduce is indeed a lower bound of \(M\).
References
Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th ACM SIGIR, pp. 84–91, Philadelphia, Pennsylvania, USA (1997)
Déjean, H., Gaussier, E., Sadat, F.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7, Taipei, Taiwan (2002)
Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202, Hong Kong (1997)
Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 17th International Conference on Computational linguistics, pp. 414–420, Montreal, Quebec, Canada (1998)
Garera, N., Callison-Burch, C., Yarowsky, D.: Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In: CoNLL 09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 129–137, Boulder, Colorado (2009)
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 526–533, Barcelona, Spain (2004)
Goeuriot, L., Grabar, N., Daille, B.: Characterization of scientific and popular science discourse in French, Japanese and Russian. In: LREC. Marrakech, Morocco (2008)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit (2005)
Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 617–625, Beijing, China (Aug 2010)
Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 644–652, Beijing, China (2010)
Li, B., Gaussier, E., Aizawa, A.: Clustering comparable corpora for bilingual lexicon extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 473–478, Portland, Oregon, USA (June 2011)
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 664–671, Prague, Czech Republic (2007)
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)
Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing corpora, pp. 1–6, Hong Kong (2000)
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., Utsuro, T.: Compiling French-Japanese terminologies from the web. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 225–232, Trento, Italy (2006)
Sharoff, S.: Comparing corpora using frequency profiling. In: Proceedings of Web as Corpus Workshop. Louvain-la-Neuve (2007)
Shezaf, D., Rappoport, A.: Bilingual lexicon generation using non-aligned signatures. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 98–107, Uppsala, Sweden (2010)
Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of the 8th Text Retrieval Conference, pp. 77–82 (1999)
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of HLT-NAACL 2009, pp. 121–124, Boulder, Colorado, USA (2009)
Acknowledgments
This work was supported by the French National Research Agency grant ANR-08-CORD-009.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Li, B., Gaussier, E. (2013). Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)