Advertisement

Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance

  • Georgios BalikasEmail author
  • Charlotte Laclau
  • Ievgen Redko
  • Massih-Reza Amini
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10772)

Abstract

Many information retrieval algorithms rely on the notion of a good distance that allows to efficiently compare objects of different nature. Recently, a new promising metric called Word Mover’s Distance was proposed to measure the divergence between text passages. In this paper, we demonstrate that this metric can be extended to incorporate term-weighting schemes and provide more accurate and computationally efficient matching between documents using entropic regularization. We evaluate the benefits of both extensions in the task of cross-lingual document retrieval (CLDR). Our experimental results on eight CLDR problems suggest that the proposed methods achieve remarkable improvements in terms of Mean Reciprocal Rank compared to several baselines.

References

  1. 1.
    Acs, J.: Pivot-based multilingual dictionary building using Wiktionary. In: LREC (2014)Google Scholar
  2. 2.
    Acs, J., Pajkossy, K., Kornai, A.: Building basic vocabulary across 40 languages. In: Sixth Workshop on Building and Using Comparable Corpora@ACL (2013)Google Scholar
  3. 3.
    Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 2(37), A1111–A1138 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: EMNLP-CoNLL (2012)Google Scholar
  5. 5.
    Broder, A.: A taxonomy of web search. In: SIGIR. ACM (2002)Google Scholar
  6. 6.
    Courty, N., Flamary, R., Tuia, D.: Domain adaptation with regularized optimal transport. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 274–289. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-662-44848-9_18 Google Scholar
  7. 7.
    Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: NIPS, pp. 2292–2300 (2013)Google Scholar
  8. 8.
    Flamary, R., Courty, N.: Pot python optimal transport library (2017)Google Scholar
  9. 9.
    Fukumasu, K., Eguchi, K., Xing, E.P.: Symmetric correspondence topic models for multilingual text analysis. In: NIPS (2012)Google Scholar
  10. 10.
    Kantorovich, L.: On the translocation of masses. C.R. (Doklady) Acad. Sci. URSS(N.S.) 37(10), 199–201 (1942)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q., et al.: From word embeddings to document distances. In: ICML (2015)Google Scholar
  12. 12.
    Laclau, C., Redko, I., Matei, B., Bennani, Y., Brault, V.: Co-clustering through optimal transport. In: ICML (2017)Google Scholar
  13. 13.
    Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34, 1388–1429 (2010)CrossRefGoogle Scholar
  14. 14.
    Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pp. 666–704 (1781)Google Scholar
  15. 15.
    Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in python. JMLR 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Richard, S., Paul, K.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV (1998)Google Scholar
  18. 18.
    Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)CrossRefzbMATHGoogle Scholar
  19. 19.
    Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. In: AAAI (2017)Google Scholar
  20. 20.
    Speer, R., Lowry-Duda, J.: Conceptnet at semeval-2017 task 2: extending word embeddings with multilingual relational knowledge. arXiv:1704.03560 (2017)
  21. 21.
    Voorhees, E.M.: Overview of TREC 2003. In: TREC (2003)Google Scholar
  22. 22.
    Voorhees, E.M., et al.: The TREC-8 question answering track report. In: TREC (1999)Google Scholar
  23. 23.
    Vulić, I., De Smet, W., Tang, J., Moens, M.F.: Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf. Process. Manage. 51, 111–147 (2015)CrossRefGoogle Scholar
  24. 24.
    Vulić, I., Moens, M.F.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)MathSciNetzbMATHGoogle Scholar
  25. 25.
    van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011)CrossRefGoogle Scholar
  26. 26.
    Wang, Y.C., Wu, C.K., Tsai, R.T.H.: Cross-language article linking with different knowledge bases using bilingual topic model and translation features. Knowl.-Based Syst. 111, 228–236 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Georgios Balikas
    • 1
    Email author
  • Charlotte Laclau
    • 1
  • Ievgen Redko
    • 2
  • Massih-Reza Amini
    • 1
  1. 1.Univ. Grenoble Alpes, CNRS, Grenoble INP, LIGGrenobleFrance
  2. 2.Univ. Lyon, INSA-Lyon, Univ. Claude Bernard Lyon 1, UJM-Saint Etienne, CNRS, Inserm, CREATIS UMR 5220, U1206, F69XXXLyonFrance

Personalised recommendations