QAlign: A New Method for Bilingual Lexicon Extraction from Comparable Corpora

Hazem, Amir; Morin, Emmanuel

doi:10.1007/978-3-642-28601-8_8

Amir Hazem¹⁷ &
Emmanuel Morin¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7182))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1343 Accesses
4 Citations

Abstract

In this paper, we present a new way of looking at the problem of bilingual lexicon extraction from comparable corpora, mainly inspired from information retrieval (IR) domain and more specifically, from question-answering systems (QAS). By analogy to QAS, we consider a word to be translated as a part of a question extracted from a source language, and we try to find out the correct translation assuming that it is contained in the correct answer of that question extracted from the target language. The methods traditionally dedicated to the task of bilingual lexicon extraction from comparable corpora tend to represent the whole contexts of a word in a single vector and thus, give a general representation of all its contexts. We believe that a local representation of the contexts of a word, given by a window that corresponds to the query, is more appropriate as we give more importance to local information that could be swallowed up in the volume if represented and treated in a single whole context vector. We show that the empirical results obtained are competitive with the standard approach traditionally dedicated to this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Armstrong, S., Thompson, H.: A presentation of MLCC: Multilingual Corpora for Cooperation. In: Linguistic Database Workshop, Groningen (1995)
Google Scholar
Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Tapei, Taiwan, pp. 1208–1212 (2002)
Google Scholar
Chiao, Y.C., Zweigenbaum, P.: The Effect of a General Lexicon in Corpus-Based Identification of French-English Medical Word Translations. In: Baud, R., Fieschi, M., Le Beux, P., Ruch, P. (eds.) The New Navigators: from Professionals to Patients, Actes Medical Informatics Europe. Studies in Health Technology and Informatics, vol. 95, pp. 397–402. IOS Press, Amsterdam (2003)
Google Scholar
Church, K.W., Mercer, R.L.: Introduction to the Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics 19(1), 1–24 (1993), http://dblp.uni-trier.de
Google Scholar
Daille, B., Morin, E.: French-English Terminology Extraction from Comparable Corpora. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCLNP 2005), Jeju Island, Korea, pp. 707–718 (2005)
Google Scholar
Déjean, H., Gaussier, E.: Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables. Lexicometrica, Alignement lexical dans les corpus multilingues, pp. 1–22 (2002)
Google Scholar
Déjean, H., Sadat, F., Gaussier, E.: An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Tapei, Taiwan, pp. 218–224 (2002)
Google Scholar
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Fano, R.M.: Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge (1961)
Google Scholar
Fung, P.: Compiling Bilingual Lexicon Entries From a non-Parallel English-Chinese Corpus. In: Farwell, D., Gerber, L., Hovy, E. (eds.) Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA 1995), Langhorne, PA, USA, pp. 1–16 (1995)
Google Scholar
Fung, P.: A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 1–17. Springer, Heidelberg (1998)
Chapter Google Scholar
Fung, P., Lo, Y.Y.: An ir approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 17th International Conference on Computational Linguistics (COLING 1998), pp. 414–420 (1998)
Google Scholar
Fung, P., McKeown, K.: Finding Terminology Translations from Non-parallel Corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora (VLC 1997), Hong Kong, pp. 192–202 (1997)
Google Scholar
Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 526–533 (2004)
Google Scholar
Gillard, L., Bellot, P., El-Bèze, M.: D’une compacité positionnelle à une compacité probabiliste pour un système de questions / réponses. In: CORIA, pp. 271–286 (2007)
Google Scholar
Gillard, L., Sitbon, L., Blaudez, E., Bellot, P., El-Bèze, M.: Relevance Measures for Question Answering, The Lia at qa@clef-2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 440–449. Springer, Heidelberg (2007)
Chapter Google Scholar
Grefenstette, G.: Corpus-Derived First, Second and Third-Order Word Affinities. In: Proceedings of the 6th Congress of the European Association for Lexicography (EURALEX 1994), Amsterdam, The Netherlands, pp. 279–290 (1994)
Google Scholar
Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publisher, Boston (1994)
Book MATH Google Scholar
Hickl, A., Wang, P., Lehmann, J., Harabagiu, S.M.: Ferret: Interactive question-answering for real-world environments. In: ACL (2006)
Google Scholar
Huang, Z., Thint, M., Qin, Z.: Question classification using head words and their hypernyms. In: EMNLP, pp. 927–936 (2008)
Google Scholar
Laroche, A., Langlais, P.: Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, pp. 617–625 (2010)
Google Scholar
Lavenus, K., Grivolla, J., Gillard, L., Bellot, P.: Question-answer matching: Two complementary methods. In: RIAO, pp. 244–259 (2004)
Google Scholar
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual Terminology Mining – Using Brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 664–671 (2007)
Google Scholar
Peters, C., Picchi, E.: Cross-language information retrieval: A system for comparable corpus querying. In: Grefenstette, G. (ed.) Cross-Language Information Retrieval, ch.7, pp. 81–90. Kluwer Academic Publishers (1998)
Google Scholar
Rapp, R.: Identify Word Translations in Non-Parallel Texts. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1995), Boston, MA, USA, pp. 320–322 (1995)
Google Scholar
Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999), College Park, MD, USA, pp. 519–526 (1999)
Google Scholar
Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery 15(1), 8–36 (1968)
Article MATH Google Scholar
Voorhees, E.M.: Overview of the trec 2004 question answering track. In: TREC (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratore d’Informatique de Nantes-Atlantique (LINA), Université de Nantes, 44322, Nantes Cedex 3, France
Amir Hazem & Emmanuel Morin

Authors

Amir Hazem
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Morin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hazem, A., Morin, E. (2012). QAlign: A New Method for Bilingual Lexicon Extraction from Comparable Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-28601-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28600-1
Online ISBN: 978-3-642-28601-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics