Cross-lingual document similarity estimation and dictionary generation with comparable corpora

  • Tadej Štajner
  • Dunja Mladenić
Short Paper


This paper proposes an approach for performing bilingual dictionary generation even when trained on widely available comparable bilingual corpora. We also show its capability to provide cross-lingual similarity estimates that correlate well with human judgments. We implement an approach using a nonlinear bilingual translation model that we train using comparable corpora. We propose a method using word embeddings and kernel approximation to train scalable nonlinear transformations. We demonstrate that this novel method works better on a majority of evaluated language pairs.


Cross-lingual text analysis Vector space machine translation Representation learning Comparable corpora Similarity learning Dictionary generation 



This work was supported by the Slovenian Research Agency and the IST Programme of the EC under XLike (ICT-STREP-288342), LT-Web (ICT-287815-CSA) and RENDER (ICT-257790-STREP).


  1. 1.
    Barrón-Cedeno A, Paramita ML, Clough P, Rosso P (2014) A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In: ECIR, pp 424–429Google Scholar
  2. 2.
    Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Springer, Paris, pp 177–187.
  3. 3.
    Cassidy T, Ji H, Deng H, Zheng J, Han J (2012) Analysis and refinement of cross-lingual entity linking. In: Information access evaluation. Multilinguality, multimodality, and visual analytics. Springer, New York, pp 1–12Google Scholar
  4. 4.
    Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: ACL (Short Papers), pp 429–433Google Scholar
  5. 5.
    Dumais ST, Letsche TA, Littman ML, Landauer TK (1997) Automatic cross-language retrieval using latent semantic indexing. In: AAAI spring symposium on cross-language text and speech retrieval, vol 15, p 21Google Scholar
  6. 6.
    Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. Learning with multiple views, workshop at the ICMLGoogle Scholar
  7. 7.
    Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp 1–17. SpringerGoogle Scholar
  8. 8.
    Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42CrossRefzbMATHGoogle Scholar
  9. 9.
    Hellmann S, Brekle J, Auer S (2013) Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In: Semantic Technology. Springer, pp 191–206Google Scholar
  10. 10.
    Lauly S, Boulanger A, Larochelle H (2014) Learning multilingual word representations using a bag-of-words autoencoder. arXiv preprint arXiv:1401.1803
  11. 11.
    Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems, pp 2177–2185Google Scholar
  12. 12.
    Littman ML, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. In: Cross-language information retrieval. Springer, pp 51–62Google Scholar
  13. 13.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
  14. 14.
    Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. CoRR. arXiv:1309.4168
  15. 15.
    Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546
  16. 16.
    Ni J, Dinu G, Florian R (2017) Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. arXiv preprint arXiv:1707.02483
  17. 17.
    Paramita ML, Clough P, Aker A, Gaizauskas RJ (2012) Correlation between similarity measures for inter-language linked wikipedia articles. In: LREC, pp 790–797Google Scholar
  18. 18.
    Ruder S (2017) A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902
  19. 19.
    Rupnik J, Fortuna B (2008) Regression canonical correlation analysis. Learning from (2008)Google Scholar
  20. 20.
    Rupnik J, Muhic A, Leban G, Skraba P, Fortuna B, Grobelnik M (2016) News across languages-cross-lingual document similarity and event tracking. J Artif Intell Res 55:283–316MathSciNetGoogle Scholar
  21. 21.
    Rupnik J, Muhic A, Škraba P (2011) Low-rank approximations for large, multi-lingual data.
  22. 22.
    Skadiņa I, Aker A, Mastropavlos N, Su F, Tufis D, Verlic M, Vasiļjevs A, Babych B, Clough P, Gaizauskas R, et al (2012) Collecting and using comparable corpora for statistical machine translation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC), Istanbul, TurkeyGoogle Scholar
  23. 23.
    Sorg P, Cimiano P (2012) Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 74:26–45CrossRefGoogle Scholar
  24. 24.
    Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492CrossRefGoogle Scholar
  25. 25.
    Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the 14th annual conference on neural information processing systems, EPFL-CONF-161322, pp 682–688Google Scholar
  26. 26.
    Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: NIPS, pp 485–493Google Scholar
  27. 27.
    Zhang L, Rettinger A, Färber M, Tadić M (2013) A comparative evaluation of cross-lingual text annotation techniques. In: Information access evaluation. Multilinguality, multimodality, and visualization. Springer, pp 124–135Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Jožef Stefan Institute, Jožef Stefan International Postgraduate SchoolLjubljanaSlovenia

Personalised recommendations