Skip to main content

Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10234))

Included in the following conference series:

Abstract

In this paper, we propose to complement the context vectors used in bilingual lexicon extraction from comparable corpora with concept vectors, that aim at capturing all the words related to the concepts associated with a given word. This allows one to rely on a representation that is less sparse, especially in specialized domains where the use of a general bilingual lexicon leaves many words untranslated. The concept vectors we are considering are based on closed concepts mining developed in Formal Concept Analysis (FCA). The obtained results on two different comparable corpora show that enriching context vectors with concept vectors leads to lexicons of higher quality, especially in specialized domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A parallel corpus is a collection of texts that are translation of one another.

  2. 2.

    A comparable corpus is a collection of multilingual documents dealing with the same topics and generally produced at the same time. They are not necessarily translation of each other.

  3. 3.

    https://code.google.com/p/word2vec/.

  4. 4.

    One can also translate each element of the source context vectors into the target language.

  5. 5.

    In this paper, we denote by |X| the cardinality of the set X.

  6. 6.

    http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.

  7. 7.

    http://www.clef-campaign.org/.

  8. 8.

    www.elsevier.com.

References

  1. Andrade, D., Matsuzaki, T., Tsujii, J: Effective use of dependency structure for bilingual Lexicon creation. In: Gelbukh, A. (ed.) CICLing 2011. LNCS, vol. 6609, pp. 80–92. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19437-5_7

  2. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000). doi:10.1007/3-540-45486-1_4

    Chapter  Google Scholar 

  3. Chebel, M., Latiri, C., Gaussier, E.: Extraction of interlingual documents clusters based on closed concepts mining. In: 19th International Conference KES 2015, Singapore, pp. 537–546 (2015)

    Google Scholar 

  4. Fung, P.: A statistical view on bilingual Lexicon extraction: from parallel corpora to non-parallel corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 1–17. Springer, Heidelberg (1998). doi:10.1007/3-540-49478-2_1

    Chapter  Google Scholar 

  5. Ganter, B., Wille, R.: Formal Concept Analysis. Springer, Heidelberg (1999)

    Book  MATH  Google Scholar 

  6. Baroni, M., Georgiana, D., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: 52nd Annual Meeting ACL 2014, Baltimore, Maryland (2014)

    Google Scholar 

  7. Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: 23rd International Conference COLING 2010, Beijing, China, pp. 617–625 (2010)

    Google Scholar 

  8. Li, B., Gaussier, E.: An information-based cross-language information retrieval model. In: Baeza-Yates, R., Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 281–292. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28997-2_24

    Chapter  Google Scholar 

  9. Linard, A., Daille, B., Emmanuel, M.: Attempting to bypass alignment from comparable corpora via pivot language. In: 8th Workshop on BUCC, Beijing, pp. 32–37 (2015)

    Google Scholar 

  10. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, vol. 2013, pp. 3111–3119 (2013)

    Google Scholar 

  12. Morin, E., Hazem, A.: Looking at unbalanced specialized comparable corpora for bilingual Lexicon extraction. In: ACL 2014, Baltimore, USA, pp. 284–293 (2014)

    Google Scholar 

  13. Gamallo Otero, P.: Comparing window and syntax based strategies for semantic extraction. In: Teixeira, A., Lima, V.L.S., Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS (LNAI), vol. 5190, pp. 41–50. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85980-2_5

    Chapter  Google Scholar 

  14. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L.: Generating a condensed representation for association rules. J. Intell. Inf. Syst. 2005, 29–60 (2005)

    Article  MATH  Google Scholar 

  15. Prochasson, E., Morin, E.l., Kageura, K.: Anchor points for bilingual Lexicon extraction from small comparable corpora. In: Machine Translation Summit, France (2009)

    Google Scholar 

  16. Ronan, C., Jason, W.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: ICML2008, pp. 160–167 (2008)

    Google Scholar 

  17. Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing Management. Pergamon Press Inc, Tarrytown (1988)

    Google Scholar 

  18. Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. Knowl. Data Eng. 17, 462–478 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Chebel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chebel, M., Latiri, C., Gaussier, E. (2017). Bilingual Lexicon Extraction from Comparable Corpora Based on Closed Concepts Mining. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57454-7_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57453-0

  • Online ISBN: 978-3-319-57454-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics