Abstract
Most of the bilingual lexicon induction (BLI) models learn a mapping function that can transfer word embedding (WE) spaces from one language to another. This usually relies on the isomorphism hypothesis, which posits that words in different languages share the same structures and relationships (i.e. similar in geometric structure). However, WE’s isomorphism weakens substantially in distant language pairs, resulting in low accuracy of BLI. To address this problem, we propose a novel BLI method incorporating synonymous knowledge. The main idea is to stabilize the distance between words to optimize the monolingual WE space, yielding higher isomorphism. Specifically, we first induce monolingual synonym pairs from Wordnet and construct monolingual synonym lexicons. We then generate pseudo-sentences by substituting words in the training corpus with synonyms. Finally, the original sentences and pseudo-sentences are jointly used to generate monolingual WEs, enabling the word vectors of synonyms to be closer naturally. Comprehensive experiments on standard BLI datasets in diverse distant languages demonstrate that our method significantly outperforms the strong BLI systems in word translation.
Similar content being viewed by others
Availability of data and materials
The datasets used and analyzed in this study are publicly available at https://github.com/facebookresearch/MUSE and https://dumps.wikimedia.org.
References
Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the 33rd annual meeting on association for computational linguistics, pp 320–322
Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd annual meeting of the association for computational linguistics, pp 526–533
Shi H, Zettlemoyer L, Wang SI (2021) Bilingual lexicon induction via unsupervised bitext construction and word alignment. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 813–826
Heyman G, Vulić I, Moens MF (2017) Bilingual lexicon induction by learning to combine word-level and character-level representations. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 1085–1095
Zhang J, et al (2021)Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Findings of the association for computational linguistics, pp 2943–2955
Ruder S, Vulić I, Søgaard A (2019) A survey of cross-lingual word embedding models. J Artif Intell Res 65:569–631
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv e-prints arXiv–1309
Xing C, Wang D, Liu C, Lin Y (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1006–1011
Artetxe M, Labaka G, Agirre E (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 2289–2294
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: 6th International conference on learning representations
Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 789–798
Glavaš G, Litschko R, Ruder S, Vulić I (2019) How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 710–721
Li Y, Liu F, Collier N, Korhonen A, Vulić I (2022) Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th annual meeting of the association for computational linguistics, pp 4353–4374
Søgaard A, Ruder S, Vulić I (2018) On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 778–788
Mohiuddin MT, Bari MS, Joty S (2020) Lnmap: departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 2712–2723
Vulić I, Glavaš G, Reichart R, Korhonen A (2019) Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 4407–4418
McCrae JP, Rademaker A, Bond F, Rudnicka E, Fellbaum C (2019) English wordnet 2019–an open-source wordnet for english. In: Proceedings of the 10th global WordNet conference, pp 245–252
Bond F, Foster R (2013) Linking and extending an open multilingual wordnet. In: Proceedings of the 51st annual meeting of the association for computational linguistics, pp 1352–1362
Siegel M, Bond F (2021) OdeNet: compiling a German wordnet from other resources. In: Proceedings of the 11th global Wordnet conference, pp 192–198
Peng X, Lin C, Stevenson M (2021) Cross-lingual word embedding refinement by l1 norm optimisation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 2690–2701
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguistics 5:135–146
Hartmann M, Kementchedjhieva Y, Søgaard A (2019) Comparing unsupervised word translation methods step by step. In: Proceedings of the 33rd international conference on neural information processing systems, pp 6033–6043
Li Y, et al (2020) A simple and effective approach to robust unsupervised bilingual dictionary induction. In: Proceedings of the 28th international conference on computational linguistics, pp 5990–6001
Touvron H, et al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Schuster T, Ram O, Barzilay R, Globerson A (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1599–1613
Joulin A, Bojanowski P, Mikolov T, Jégou H, Grave É (2018) Loss in translation: Learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2979–2984
Patra B, Moniz JRA, Garg S, Gormley MR, Neubig G (2019) Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 184–193
Jawanpuria P, Balgovind A, Kunchukuttan A, Mishra B (2019) Learning multilingual word embeddings in latent metric space: a geometric approach. Trans Assoc Comput Linguistics 7:107–120
Sachidananda V, Yang Z, Zhu C (2020) Filtered inner product projection for crosslingual embedding alignment. In: International conference on learning representations
Feng Z, Cao H, Zhao T, Wang W, Peng W (2022) Cross-lingual feature extraction from monolingual corpora for low-resource unsupervised bilingual lexicon induction. In: Proceedings of the 29th international conference on computational linguistics, pp 5278–5287
Cao H, Zhao T, Wang W, Peng W (2023) Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction. Inf Fusion 97:101818
Glavaš G, Vulić I (2020) Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7548–7555
Ganesan A, Ferraro F, Oates T (2021) Learning a reversible embedding mapping using bi-directional manifold alignment. Find Assoc Comput Linguistics ACL-IJCNLP 2021:3132–3139
Acknowledgements
The authors would like to thank the anonymous reviewers for their helpful comments.
Funding
This work was supported by the National Natural Science Foundation of China under Grant 62376076 and the Key R &D Program of Yunnan under Grant 202203AA080004.
Author information
Authors and Affiliations
Contributions
QD: Methodology. HC: Conceptualization. ZF: Investigation. TZ: Supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors do not have any Conflict of interest or any Conflict of interest.
Ethics approval
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ding, Q., Cao, H., Feng, Z. et al. Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09837-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00521-024-09837-1