Evaluating cross-lingual textual similarity on dictionary alignment problem


Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual setting. In this paper, Wordnet definitions in 7 different languages are used to create a semantic textual similarity testbed to evaluate cross-lingual textual semantic similarity methods. A document alignment task is created to be used between Wordnet glosses of synsets in 7 different languages. Unsupervised textual similarity methods—Wasserstein distance, Sinkhorn distance and cosine similarity—are compared with a supervised Siamese deep learning model. The task is modeled both as a retrieval task and an alignment task to investigate the hubness of the semantic similarity functions. Our findings indicate that considering the problem as a retrieval and alignment problem has a detrimental effect on the results. Furthermore, we show that cross-lingual textual semantic similarity can be used as an automated Wordnet construction method.

This study was supported in part by The Scientific and Technological Research Council of Turkey (TUBITAK), with award no. 114E776.

Sever, Y., Ercan, G. Evaluating cross-lingual textual similarity on dictionary alignment problem. Lang Resources & Evaluation 54, 1059–1078 (2020).

  • Cross-lingual textual semantic similarity
  • Word embeddings
  • Wasserstein distance
  • Sinkhorn distance
  • Siamese neural network