Skip to main content
Log in

Evaluating cross-lingual textual similarity on dictionary alignment problem

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual setting. In this paper, Wordnet definitions in 7 different languages are used to create a semantic textual similarity testbed to evaluate cross-lingual textual semantic similarity methods. A document alignment task is created to be used between Wordnet glosses of synsets in 7 different languages. Unsupervised textual similarity methods—Wasserstein distance, Sinkhorn distance and cosine similarity—are compared with a supervised Siamese deep learning model. The task is modeled both as a retrieval task and an alignment task to investigate the hubness of the semantic similarity functions. Our findings indicate that considering the problem as a retrieval and alignment problem has a detrimental effect on the results. Furthermore, we show that cross-lingual textual semantic similarity can be used as an automated Wordnet construction method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://compling.hss.ntu.edu.sg/omw/.

  2. https://yigitsever.github.io/Evaluating-Dictionary-Alignment/.

  3. https://fasttext.cc/.

  4. https://github.com/artetxem/vecmap.

  5. http://opus.nlpl.eu/index.php.

  6. https://github.com/balikasg/WassersteinRetrieval.

  7. https://github.com/src-d/lapjv.

References

  • Alonge, A., Bertagna, F., Calzolari, N., Roventini, A. (1999). The italian wordnet. Deliverable D032D033, EuroWordNet.

  • Alvarez-Melis, D., Jaakkola, T. (2018). Gromov-Wasserstein Alignment of Word Embedding Spaces. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 1881–1890.

  • Arora, S., Liang, Y., Ma, T. (2016). A Simple but Tough-to-Beat Baseline for Sentence Embeddings.

  • Artetxe, M., Labaka, G., Agirre, E. (2018a). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 789–798.

  • Artetxe, M., Labaka, G., Agirre, E. (2018b). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:180506297.

  • Balikas, G., Laclau, C., Redko, I., Amini, M.R. (2018). Cross-lingual Document Retrieval using Regularized Wasserstein Distance. In: Proceedings of the 40th European Conference ECIR conference on Information Retrieval, ECIR 2018, Grenoble, France, March 26-29, 2018.

  • Balkova, V., Sukhonogov, A., Yablonsky, S. (2004). Russian wordnet. In: Proceedings of the Second Global Wordnet Conference.

  • Barrón-Cedeño, A., Rosso, P., Agirre, E., Labaka, G. (2010). Plagiarism Detection Across Distant Language Pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, COLING ’10, pp 37–45, http://dl.acm.org/citation.cfm?id=1873781.1873786, event-place: Beijing, China.

  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

    Google Scholar 

  • Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:160704606.

  • Bond, F., & Paik, K. (2012). A Survey of WordNets and their Licenses. GWC 2012 6th International Global Wordnet Conference, 8, 64.

    Google Scholar 

  • Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. arXiv:13060895 [stat] 1306.0895.

  • Diab, M. (2004). The Feasibility of Bootstrapping an Arabic Wordnet Leveraging Parallel Corpora and an English Wordnet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo.

  • Ercan, G., & Haziyev, F. (2019). Synset expansion on translation graph for automatic wordnet construction. Information Processing & Management, 56(1), 130–150.

    Article  Google Scholar 

  • Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Language, Speech, and Communication. New York: MIT Press.

    Book  Google Scholar 

  • Fišer, D., Novak, J., Erjavec, T. (2012). sloWNet 3.0: Development, extension and cleaning. In: Proceedings of 6th International Global Wordnet Conference (GWC 2012), pp. 113–117.

  • Franco-Salvador, M., Rosso, P., & Montes-y Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550–570.

    Article  Google Scholar 

  • Glavas, G., Litschko, R., Ruder, S., Vulic, I. (2019). How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. arXiv:190200508 [cs] 1902.00508.

  • Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256.

  • Gouws, S., Bengio, Y., Corrado, G. (2015). Bilbowa: Fast bilingual distributed representations without word alignments. In: Proceedings of the 32nd International Conference on Machine Learning, http://jmlr.org/proceedings/papers/v37/gouws15.pdf.

  • Grigoriadou, M., Kornilakis, H., Galiotou, E., Stamou, S., & Papakitsos, E. (2004). The software infrastructure for the development and validation of the Greek WordNet. Romanian Journal of Information Science and Technology, 7(1–2), 89–105.

    Google Scholar 

  • Hamp, B., Feldweg, H. (1997). Germanet-a lexical-semantic net for german. In: Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications.

  • Jawanpuria, P., Balgovind, A., Kunchukuttan, A., & Mishra, B. (2019). Learning multilingual word embeddings in latent metric space: A geometric approach. Transactions of the Association for Computational Linguistics, 7, 107–120. https://doi.org/10.1162/tacl_a_00257.

    Article  Google Scholar 

  • Johnson, A., Karanasou, P., Gaspers, J., Klakow, D. (2019). Cross-lingual transfer learning for japanese named entity recognition. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 182–189.

  • Khodak, M., Risteski, A., Fellbaum, C., Arora, S. (2017). Automated WordNet Construction Using Word Embeddings. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 12–23.

  • Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. Proceedings of COLING, 2012, 1459–1474.

    Google Scholar 

  • Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q. (2015). From Word Embeddings to Document Distances. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, JMLR.org, ICML’15, pp. 957–966.

  • Lam, K.N., Tarouti, F.A., Kalita, J. (2014). Automatically constructing Wordnet Synsets. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 106–111, https://doi.org/10.3115/v1/P14-2018

  • Leng, Y., Tan, X., Qin, T., Li, X.Y., Liu, T.Y. (2019). Unsupervised Pivot Translation for Distant Languages. arXiv:190602461 [cs] 1906.02461.

  • Lison, P., Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. International Conference on Language Resources and Evaluation.

  • Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I. (2018). Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, pp 1253–1256.

  • Luong, T., Pham, H., Manning, C.D. (2015). Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159.

  • Mikolov, T., Le, Q.V., Sutskever, I. (2013a). Exploiting Similarities among Languages for Machine Translation. arXiv:13094168 [cs] 1309.4168.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013b). Distributed Representations of Words and Phrases and their Compositionality. arXiv:13104546 [cs, stat] http://arxiv.org/abs/1310.4546, arXiv: 1310.4546.

  • Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A. (2018). Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).

  • Mogadala, A., Rettinger, A. (2016). Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 692–702.

  • Pascanu, R., Mikolov, T., Bengio, Y. (2012). On the difficulty of training Recurrent Neural Networks. arXiv:12115063 [cs] 1211.5063.

  • Pedersen, B. S., Nimb, S., Asmussen, J., Sørensen, N. H., Trap-Jensen, L., & Lorentzen, H. (2009). DanNet: The challenge of compiling a wordnet for Danish by reusing a monolingual dictionary. Language Resources and Evaluation, 43(3), 269–299. https://doi.org/10.1007/s10579-009-9092-1.

    Article  Google Scholar 

  • Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.

  • Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62.

    Article  Google Scholar 

  • Rubner, Y., Tomasi, C., Guibas, L.J. (1998). A metric for distributions with applications to image databases. In: Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 59–66 https://doi.org/10.1109/ICCV.1998.710701.

  • Ruci, E. (2008). On the current state of Albanet and related applications. Tech. rep., Technical report, University of Vlora.(http://fjalnet. com...).

  • Ruder, S., Vulić, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research,.

  • Ruiz-Casado, M., Alfonseca, E., Castells, P. (2005). Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In: International Atlantic Web Intelligence Conference, Springer, pp 380–386.

  • Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., & Grobelnik, M. (2016). News across languages-cross-lingual document similarity and event tracking. Journal of Artificial Intelligence Research, 55, 283–316.

    Article  Google Scholar 

  • Sagot, B., Fišer, D. (2008). Building a free French wordnet from multilingual resources. In: OntoLex.

  • Sand, H., Velldal, E., Øvrelid, L. (2017). Wordnet extension via word embeddings: Experiments on the Norwegian Wordnet. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 298–302.

  • Simov, K.I., Osenova, P. (2010). Constructing of an Ontology-based Lexicon for Bulgarian. In: LREC, Citeseer.

  • Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348.

    Article  Google Scholar 

  • Stamou, S., Nenadic, G., Christodoulakis, D. (2004). Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing. In: LREC

  • Taghizadeh, N., & Faili, H. (2016). Automatic wordnet development for low-resource languages using cross-lingual wsd. Journal of Artificial Intelligence Research, 56, 61–87.

    Article  Google Scholar 

  • Toral, A., Bracale, S., Monachini, M., Soria, C. (2010). Rejuvenating the Italian WordNet: Upgrading, standardising, extending. In: Proceedings of the 5th International Conference of the Global WordNet Association (GWC-2010), Mumbai

  • Tufiş, D., Barbu, E., Mititelu, V. B., Ion, R., & Bozianu, L. (2004). The romanian wordnet. Romanian Journal on Information Science and Technology, 7(2–3), 105–122.

    Google Scholar 

  • Tufiş, D., Ion, R., Bozianu, L., Ceauşu, A., Ştefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In: Proceedings of 4th Global WordNet Conference, GWC, pp. 441–452.

  • Upadhyay, S., Faruqui, M., Dyer, C., Roth, D. (2016). Cross-lingual Models of Word Embeddings: An Empirical Comparison. arXiv:160400425 [cs] 1604.00425.

  • Vossen, P. (1998). Introduction to EuroWordNet. Computers and the Humanities, 32(2/3), 73–89.

    Article  Google Scholar 

  • Vulić, I., Moens, M.F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, ACM, pp. 363–372.

  • Zeiler, M.D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:12125701 [cs] 1212.5701.

Download references

Acknowledgements

This study was supported in part by The Scientific and Technological Research Council of Turkey (TUBITAK), with award no. 114E776.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gönenç Ercan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sever, Y., Ercan, G. Evaluating cross-lingual textual similarity on dictionary alignment problem. Lang Resources & Evaluation 54, 1059–1078 (2020). https://doi.org/10.1007/s10579-020-09498-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09498-1

Keywords

Navigation