Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining

Chebel, Mohamed; Latiri, Chiraz; Gaussier, Eric

doi:10.1007/978-3-319-57454-7_46

Mohamed Chebel¹⁹,
Chiraz Latiri¹⁹ &
Eric Gaussier²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10234))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3743 Accesses
6 Citations

Abstract

In this paper, we propose to complement the context vectors used in bilingual lexicon extraction from comparable corpora with concept vectors, that aim at capturing all the words related to the concepts associated with a given word. This allows one to rely on a representation that is less sparse, especially in specialized domains where the use of a general bilingual lexicon leaves many words untranslated. The concept vectors we are considering are based on closed concepts mining developed in Formal Concept Analysis (FCA). The obtained results on two different comparable corpora show that enriching context vectors with concept vectors leads to lexicons of higher quality, especially in specialized domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A parallel corpus is a collection of texts that are translation of one another.
2.
A comparable corpus is a collection of multilingual documents dealing with the same topics and generally produced at the same time. They are not necessarily translation of each other.
3.
https://code.google.com/p/word2vec/.
4.
One can also translate each element of the source context vectors into the target language.
5.
In this paper, we denote by |X| the cardinality of the set X.
6.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.
7.
http://www.clef-campaign.org/.
8.
www.elsevier.com.

References

Andrade, D., Matsuzaki, T., Tsujii, J: Effective use of dependency structure for bilingual Lexicon creation. In: Gelbukh, A. (ed.) CICLing 2011. LNCS, vol. 6609, pp. 80–92. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19437-5_7
Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000). doi:10.1007/3-540-45486-1_4
Chapter Google Scholar
Chebel, M., Latiri, C., Gaussier, E.: Extraction of interlingual documents clusters based on closed concepts mining. In: 19th International Conference KES 2015, Singapore, pp. 537–546 (2015)
Google Scholar
Fung, P.: A statistical view on bilingual Lexicon extraction: from parallel corpora to non-parallel corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 1–17. Springer, Heidelberg (1998). doi:10.1007/3-540-49478-2_1
Chapter Google Scholar
Ganter, B., Wille, R.: Formal Concept Analysis. Springer, Heidelberg (1999)
Book MATH Google Scholar
Baroni, M., Georgiana, D., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: 52nd Annual Meeting ACL 2014, Baltimore, Maryland (2014)
Google Scholar
Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: 23rd International Conference COLING 2010, Beijing, China, pp. 617–625 (2010)
Google Scholar
Li, B., Gaussier, E.: An information-based cross-language information retrieval model. In: Baeza-Yates, R., Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 281–292. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28997-2_24
Chapter Google Scholar
Linard, A., Daille, B., Emmanuel, M.: Attempting to bypass alignment from comparable corpora via pivot language. In: 8th Workshop on BUCC, Beijing, pp. 32–37 (2015)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, vol. 2013, pp. 3111–3119 (2013)
Google Scholar
Morin, E., Hazem, A.: Looking at unbalanced specialized comparable corpora for bilingual Lexicon extraction. In: ACL 2014, Baltimore, USA, pp. 284–293 (2014)
Google Scholar
Gamallo Otero, P.: Comparing window and syntax based strategies for semantic extraction. In: Teixeira, A., Lima, V.L.S., Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS (LNAI), vol. 5190, pp. 41–50. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85980-2_5
Chapter Google Scholar
Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L.: Generating a condensed representation for association rules. J. Intell. Inf. Syst. 2005, 29–60 (2005)
Article MATH Google Scholar
Prochasson, E., Morin, E.l., Kageura, K.: Anchor points for bilingual Lexicon extraction from small comparable corpora. In: Machine Translation Summit, France (2009)
Google Scholar
Ronan, C., Jason, W.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: ICML2008, pp. 160–167 (2008)
Google Scholar
Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing Management. Pergamon Press Inc, Tarrytown (1988)
Google Scholar
Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. Knowl. Data Eng. 17, 462–478 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Laboratory LIPAH, Faculty of Sciences of Tunis, University Tunis El Manar, Tunis, Tunisia
Mohamed Chebel & Chiraz Latiri
Research Laboratory LIG, AMA Group, University Joseph Fourier, Grenoble I, France
Eric Gaussier

Authors

Mohamed Chebel
View author publications
You can also search for this author in PubMed Google Scholar
Chiraz Latiri
View author publications
You can also search for this author in PubMed Google Scholar
Eric Gaussier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Chebel .

Editor information

Editors and Affiliations

Kangwon National University, Chuncheon, Korea (Republic of)
Jinho Kim
Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
University of Technology Sydney, Sydney, New South Wales, Australia
Longbing Cao
KAIST, Daejeon, Korea (Republic of)
Jae-Gil Lee
University of New South Wales, Sydney, New South Wales, Australia
Xuemin Lin
Kangwon National University, Chuncheon, Korea (Republic of)
Yang-Sae Moon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chebel, M., Latiri, C., Gaussier, E. (2017). Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-57454-7_46
Published: 23 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57453-0
Online ISBN: 978-3-319-57454-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics