Evaluating cross-lingual textual similarity on dictionary alignment problem

Sever, Yiğit; Ercan, Gönenç

doi:10.1007/s10579-020-09498-1

Evaluating cross-lingual textual similarity on dictionary alignment problem

Original Paper
Published: 29 June 2020

Volume 54, pages 1059–1078, (2020)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

403 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual setting. In this paper, Wordnet definitions in 7 different languages are used to create a semantic textual similarity testbed to evaluate cross-lingual textual semantic similarity methods. A document alignment task is created to be used between Wordnet glosses of synsets in 7 different languages. Unsupervised textual similarity methods—Wasserstein distance, Sinkhorn distance and cosine similarity—are compared with a supervised Siamese deep learning model. The task is modeled both as a retrieval task and an alignment task to investigate the hubness of the semantic similarity functions. Our findings indicate that considering the problem as a retrieval and alignment problem has a detrimental effect on the results. Furthermore, we show that cross-lingual textual semantic similarity can be used as an automated Wordnet construction method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

A survey on deep learning approaches for text-to-SQL

Article Open access 23 January 2023

George Katsogiannis-Meimarakis & Georgia Koutrika

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Article Open access 17 February 2024

Marco Cascella, Federico Semeraro, … Elena Bignami

Notes

References

Alonge, A., Bertagna, F., Calzolari, N., Roventini, A. (1999). The italian wordnet. Deliverable D032D033, EuroWordNet.
Alvarez-Melis, D., Jaakkola, T. (2018). Gromov-Wasserstein Alignment of Word Embedding Spaces. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 1881–1890.
Arora, S., Liang, Y., Ma, T. (2016). A Simple but Tough-to-Beat Baseline for Sentence Embeddings.
Artetxe, M., Labaka, G., Agirre, E. (2018a). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 789–798.
Artetxe, M., Labaka, G., Agirre, E. (2018b). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:180506297.
Balikas, G., Laclau, C., Redko, I., Amini, M.R. (2018). Cross-lingual Document Retrieval using Regularized Wasserstein Distance. In: Proceedings of the 40th European Conference ECIR conference on Information Retrieval, ECIR 2018, Grenoble, France, March 26-29, 2018.
Balkova, V., Sukhonogov, A., Yablonsky, S. (2004). Russian wordnet. In: Proceedings of the Second Global Wordnet Conference.
Barrón-Cedeño, A., Rosso, P., Agirre, E., Labaka, G. (2010). Plagiarism Detection Across Distant Language Pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, COLING ’10, pp 37–45, http://dl.acm.org/citation.cfm?id=1873781.1873786, event-place: Beijing, China.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:160704606.
Bond, F., & Paik, K. (2012). A Survey of WordNets and their Licenses. GWC 2012 6th International Global Wordnet Conference, 8, 64.
Google Scholar
Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. arXiv:13060895 [stat] 1306.0895.
Diab, M. (2004). The Feasibility of Bootstrapping an Arabic Wordnet Leveraging Parallel Corpora and an English Wordnet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo.
Ercan, G., & Haziyev, F. (2019). Synset expansion on translation graph for automatic wordnet construction. Information Processing & Management, 56(1), 130–150.
Article Google Scholar
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Language, Speech, and Communication. New York: MIT Press.
Book Google Scholar
Fišer, D., Novak, J., Erjavec, T. (2012). sloWNet 3.0: Development, extension and cleaning. In: Proceedings of 6th International Global Wordnet Conference (GWC 2012), pp. 113–117.
Franco-Salvador, M., Rosso, P., & Montes-y Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550–570.
Article Google Scholar
Glavas, G., Litschko, R., Ruder, S., Vulic, I. (2019). How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. arXiv:190200508 [cs] 1902.00508.
Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256.
Gouws, S., Bengio, Y., Corrado, G. (2015). Bilbowa: Fast bilingual distributed representations without word alignments. In: Proceedings of the 32nd International Conference on Machine Learning, http://jmlr.org/proceedings/papers/v37/gouws15.pdf.
Grigoriadou, M., Kornilakis, H., Galiotou, E., Stamou, S., & Papakitsos, E. (2004). The software infrastructure for the development and validation of the Greek WordNet. Romanian Journal of Information Science and Technology, 7(1–2), 89–105.
Google Scholar
Hamp, B., Feldweg, H. (1997). Germanet-a lexical-semantic net for german. In: Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications.
Jawanpuria, P., Balgovind, A., Kunchukuttan, A., & Mishra, B. (2019). Learning multilingual word embeddings in latent metric space: A geometric approach. Transactions of the Association for Computational Linguistics, 7, 107–120. https://doi.org/10.1162/tacl_a_00257.
Article Google Scholar
Johnson, A., Karanasou, P., Gaspers, J., Klakow, D. (2019). Cross-lingual transfer learning for japanese named entity recognition. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 182–189.
Khodak, M., Risteski, A., Fellbaum, C., Arora, S. (2017). Automated WordNet Construction Using Word Embeddings. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pp. 12–23.
Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. Proceedings of COLING, 2012, 1459–1474.
Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q. (2015). From Word Embeddings to Document Distances. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, JMLR.org, ICML’15, pp. 957–966.
Lam, K.N., Tarouti, F.A., Kalita, J. (2014). Automatically constructing Wordnet Synsets. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 106–111, https://doi.org/10.3115/v1/P14-2018
Leng, Y., Tan, X., Qin, T., Li, X.Y., Liu, T.Y. (2019). Unsupervised Pivot Translation for Distant Languages. arXiv:190602461 [cs] 1906.02461.
Lison, P., Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. International Conference on Language Resources and Evaluation.
Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I. (2018). Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, ACM, pp 1253–1256.
Luong, T., Pham, H., Manning, C.D. (2015). Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159.
Mikolov, T., Le, Q.V., Sutskever, I. (2013a). Exploiting Similarities among Languages for Machine Translation. arXiv:13094168 [cs] 1309.4168.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013b). Distributed Representations of Words and Phrases and their Compositionality. arXiv:13104546 [cs, stat] http://arxiv.org/abs/1310.4546, arXiv: 1310.4546.
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A. (2018). Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
Mogadala, A., Rettinger, A. (2016). Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 692–702.
Pascanu, R., Mikolov, T., Bengio, Y. (2012). On the difficulty of training Recurrent Neural Networks. arXiv:12115063 [cs] 1211.5063.
Pedersen, B. S., Nimb, S., Asmussen, J., Sørensen, N. H., Trap-Jensen, L., & Lorentzen, H. (2009). DanNet: The challenge of compiling a wordnet for Danish by reusing a monolingual dictionary. Language Resources and Evaluation, 43(3), 269–299. https://doi.org/10.1007/s10579-009-9092-1.
Article Google Scholar
Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.
Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62.
Article Google Scholar
Rubner, Y., Tomasi, C., Guibas, L.J. (1998). A metric for distributions with applications to image databases. In: Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 59–66 https://doi.org/10.1109/ICCV.1998.710701.
Ruci, E. (2008). On the current state of Albanet and related applications. Tech. rep., Technical report, University of Vlora.(http://fjalnet. com...).
Ruder, S., Vulić, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research,.
Ruiz-Casado, M., Alfonseca, E., Castells, P. (2005). Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In: International Atlantic Web Intelligence Conference, Springer, pp 380–386.
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., & Grobelnik, M. (2016). News across languages-cross-lingual document similarity and event tracking. Journal of Artificial Intelligence Research, 55, 283–316.
Article Google Scholar
Sagot, B., Fišer, D. (2008). Building a free French wordnet from multilingual resources. In: OntoLex.
Sand, H., Velldal, E., Øvrelid, L. (2017). Wordnet extension via word embeddings: Experiments on the Norwegian Wordnet. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 298–302.
Simov, K.I., Osenova, P. (2010). Constructing of an Ontology-based Lexicon for Bulgarian. In: LREC, Citeseer.
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348.
Article Google Scholar
Stamou, S., Nenadic, G., Christodoulakis, D. (2004). Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing. In: LREC
Taghizadeh, N., & Faili, H. (2016). Automatic wordnet development for low-resource languages using cross-lingual wsd. Journal of Artificial Intelligence Research, 56, 61–87.
Article Google Scholar
Toral, A., Bracale, S., Monachini, M., Soria, C. (2010). Rejuvenating the Italian WordNet: Upgrading, standardising, extending. In: Proceedings of the 5th International Conference of the Global WordNet Association (GWC-2010), Mumbai
Tufiş, D., Barbu, E., Mititelu, V. B., Ion, R., & Bozianu, L. (2004). The romanian wordnet. Romanian Journal on Information Science and Technology, 7(2–3), 105–122.
Google Scholar
Tufiş, D., Ion, R., Bozianu, L., Ceauşu, A., Ştefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In: Proceedings of 4th Global WordNet Conference, GWC, pp. 441–452.
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D. (2016). Cross-lingual Models of Word Embeddings: An Empirical Comparison. arXiv:160400425 [cs] 1604.00425.
Vossen, P. (1998). Introduction to EuroWordNet. Computers and the Humanities, 32(2/3), 73–89.
Article Google Scholar
Vulić, I., Moens, M.F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, ACM, pp. 363–372.
Zeiler, M.D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:12125701 [cs] 1212.5701.

Download references

Acknowledgements

This study was supported in part by The Scientific and Technological Research Council of Turkey (TUBITAK), with award no. 114E776.

Author information

Authors and Affiliations

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
Yiğit Sever
Institute of Informatics, Hacettepe University, Ankara, Turkey
Gönenç Ercan

Authors

Yiğit Sever
View author publications
You can also search for this author in PubMed Google Scholar
Gönenç Ercan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gönenç Ercan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sever, Y., Ercan, G. Evaluating cross-lingual textual similarity on dictionary alignment problem. Lang Resources & Evaluation 54, 1059–1078 (2020). https://doi.org/10.1007/s10579-020-09498-1

Download citation

Published: 29 June 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10579-020-09498-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating cross-lingual textual similarity on dictionary alignment problem

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on deep learning approaches for text-to-SQL

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating cross-lingual textual similarity on dictionary alignment problem

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on deep learning approaches for text-to-SQL

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation