Abstract
Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual setting. In this paper, Wordnet definitions in 7 different languages are used to create a semantic textual similarity testbed to evaluate cross-lingual textual semantic similarity methods. A document alignment task is created to be used between Wordnet glosses of synsets in 7 different languages. Unsupervised textual similarity methods—Wasserstein distance, Sinkhorn distance and cosine similarity—are compared with a supervised Siamese deep learning model. The task is modeled both as a retrieval task and an alignment task to investigate the hubness of the semantic similarity functions. Our findings indicate that considering the problem as a retrieval and alignment problem has a detrimental effect on the results. Furthermore, we show that cross-lingual textual semantic similarity can be used as an automated Wordnet construction method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alonge, A., Bertagna, F., Calzolari, N., Roventini, A. (1999). The italian wordnet. Deliverable D032D033, EuroWordNet.
Alvarez-Melis, D., Jaakkola, T. (2018). Gromov-Wasserstein Alignment of Word Embedding Spaces. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 1881–1890.
Arora, S., Liang, Y., Ma, T. (2016). A Simple but Tough-to-Beat Baseline for Sentence Embeddings.
Artetxe, M., Labaka, G., Agirre, E. (2018a). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 789–798.
Artetxe, M., Labaka, G., Agirre, E. (2018b). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:180506297.
Balikas, G., Laclau, C., Redko, I., Amini, M.R. (2018). Cross-lingual Document Retrieval using Regularized Wasserstein Distance. In: Proceedings of the 40th European Conference ECIR conference on Information Retrieval, ECIR 2018, Grenoble, France, March 26-29, 2018.
Balkova, V., Sukhonogov, A., Yablonsky, S. (2004). Russian wordnet. In: Proceedings of the Second Global Wordnet Conference.
Barrón-Cedeño, A., Rosso, P., Agirre, E., Labaka, G. (2010). Plagiarism Detection Across Distant Language Pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, COLING ’10, pp 37–45, http://dl.acm.org/citation.cfm?id=1873781.1873786, event-place: Beijing, China.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:160704606.
Bond, F., & Paik, K. (2012). A Survey of WordNets and their Licenses. GWC 2012 6th International Global Wordnet Conference, 8, 64.
Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. arXiv:13060895 [stat] 1306.0895.
Diab, M. (2004). The Feasibility of Bootstrapping an Arabic Wordnet Leveraging Parallel Corpora and an English Wordnet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo.
Ercan, G., & Haziyev, F. (2019). Synset expansion on translation graph for automatic wordnet construction. Information Processing & Management, 56(1), 130–150.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Language, Speech, and Communication. New York: MIT Press.
Fišer, D., Novak, J., Erjavec, T. (2012). sloWNet 3.0: Development, extension and cleaning. In: Proceedings of 6th International Global Wordnet Conference (GWC 2012), pp. 113–117.
Franco-Salvador, M., Rosso, P., & Montes-y Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550–570.
Glavas, G., Litschko, R., Ruder, S., Vulic, I. (2019). How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. arXiv:190200508 [cs] 1902.00508.
Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256.
Gouws, S., Bengio, Y., Corrado, G. (2015). Bilbowa: Fast bilingual distributed representations without word alignments. In: Proceedings of the 32nd International Conference on Machine Learning, http://jmlr.org/proceedings/papers/v37/gouws15.pdf.
Grigoriadou, M., Kornilakis, H., Galiotou, E., Stamou, S., & Papakitsos, E. (2004). The software infrastructure for the development and validation of the Greek WordNet. Romanian Journal of Information Science and Technology, 7(1–2), 89–105.
Hamp, B., Feldweg, H. (1997). Germanet-a lexical-semantic net for german. In: Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications.
Jawanpuria, P., Balgovind, A., Kunchukuttan, A., & Mishra, B. (2019). Learning multilingual word embeddings in latent metric space: A geometric approach. Transactions of the Association for Computational Linguistics, 7, 107–120. https://doi.org/10.1162/tacl_a_00257.
Johnson, A., Karanasou, P., Gaspers, J., Klakow, D. (2019). Cross-lingual transfer learning for japanese named entity recognition. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 182–189.
Khodak, M., Risteski, A., Fellbaum, C., Arora, S. (2017). Automated WordNet Construction Using Word Embeddings. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 12–23.
Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. Proceedings of COLING, 2012, 1459–1474.
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q. (2015). From Word Embeddings to Document Distances. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, JMLR.org, ICML’15, pp. 957–966.
Lam, K.N., Tarouti, F.A., Kalita, J. (2014). Automatically constructing Wordnet Synsets. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 106–111, https://doi.org/10.3115/v1/P14-2018
Leng, Y., Tan, X., Qin, T., Li, X.Y., Liu, T.Y. (2019). Unsupervised Pivot Translation for Distant Languages. arXiv:190602461 [cs] 1906.02461.
Lison, P., Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. International Conference on Language Resources and Evaluation.
Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I. (2018). Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, pp 1253–1256.
Luong, T., Pham, H., Manning, C.D. (2015). Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159.
Mikolov, T., Le, Q.V., Sutskever, I. (2013a). Exploiting Similarities among Languages for Machine Translation. arXiv:13094168 [cs] 1309.4168.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013b). Distributed Representations of Words and Phrases and their Compositionality. arXiv:13104546 [cs, stat] http://arxiv.org/abs/1310.4546, arXiv: 1310.4546.
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A. (2018). Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
Mogadala, A., Rettinger, A. (2016). Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 692–702.
Pascanu, R., Mikolov, T., Bengio, Y. (2012). On the difficulty of training Recurrent Neural Networks. arXiv:12115063 [cs] 1211.5063.
Pedersen, B. S., Nimb, S., Asmussen, J., Sørensen, N. H., Trap-Jensen, L., & Lorentzen, H. (2009). DanNet: The challenge of compiling a wordnet for Danish by reusing a monolingual dictionary. Language Resources and Evaluation, 43(3), 269–299. https://doi.org/10.1007/s10579-009-9092-1.
Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.
Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62.
Rubner, Y., Tomasi, C., Guibas, L.J. (1998). A metric for distributions with applications to image databases. In: Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 59–66 https://doi.org/10.1109/ICCV.1998.710701.
Ruci, E. (2008). On the current state of Albanet and related applications. Tech. rep., Technical report, University of Vlora.(http://fjalnet. com...).
Ruder, S., Vulić, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research,.
Ruiz-Casado, M., Alfonseca, E., Castells, P. (2005). Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In: International Atlantic Web Intelligence Conference, Springer, pp 380–386.
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., & Grobelnik, M. (2016). News across languages-cross-lingual document similarity and event tracking. Journal of Artificial Intelligence Research, 55, 283–316.
Sagot, B., Fišer, D. (2008). Building a free French wordnet from multilingual resources. In: OntoLex.
Sand, H., Velldal, E., Øvrelid, L. (2017). Wordnet extension via word embeddings: Experiments on the Norwegian Wordnet. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 298–302.
Simov, K.I., Osenova, P. (2010). Constructing of an Ontology-based Lexicon for Bulgarian. In: LREC, Citeseer.
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348.
Stamou, S., Nenadic, G., Christodoulakis, D. (2004). Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing. In: LREC
Taghizadeh, N., & Faili, H. (2016). Automatic wordnet development for low-resource languages using cross-lingual wsd. Journal of Artificial Intelligence Research, 56, 61–87.
Toral, A., Bracale, S., Monachini, M., Soria, C. (2010). Rejuvenating the Italian WordNet: Upgrading, standardising, extending. In: Proceedings of the 5th International Conference of the Global WordNet Association (GWC-2010), Mumbai
Tufiş, D., Barbu, E., Mititelu, V. B., Ion, R., & Bozianu, L. (2004). The romanian wordnet. Romanian Journal on Information Science and Technology, 7(2–3), 105–122.
Tufiş, D., Ion, R., Bozianu, L., Ceauşu, A., Ştefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In: Proceedings of 4th Global WordNet Conference, GWC, pp. 441–452.
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D. (2016). Cross-lingual Models of Word Embeddings: An Empirical Comparison. arXiv:160400425 [cs] 1604.00425.
Vossen, P. (1998). Introduction to EuroWordNet. Computers and the Humanities, 32(2/3), 73–89.
Vulić, I., Moens, M.F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, ACM, pp. 363–372.
Zeiler, M.D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:12125701 [cs] 1212.5701.
Acknowledgements
This study was supported in part by The Scientific and Technological Research Council of Turkey (TUBITAK), with award no. 114E776.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sever, Y., Ercan, G. Evaluating cross-lingual textual similarity on dictionary alignment problem. Lang Resources & Evaluation 54, 1059–1078 (2020). https://doi.org/10.1007/s10579-020-09498-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-020-09498-1