Skip to main content
Log in

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Most of the bilingual lexicon induction (BLI) models learn a mapping function that can transfer word embedding (WE) spaces from one language to another. This usually relies on the isomorphism hypothesis, which posits that words in different languages share the same structures and relationships (i.e. similar in geometric structure). However, WE’s isomorphism weakens substantially in distant language pairs, resulting in low accuracy of BLI. To address this problem, we propose a novel BLI method incorporating synonymous knowledge. The main idea is to stabilize the distance between words to optimize the monolingual WE space, yielding higher isomorphism. Specifically, we first induce monolingual synonym pairs from Wordnet and construct monolingual synonym lexicons. We then generate pseudo-sentences by substituting words in the training corpus with synonyms. Finally, the original sentences and pseudo-sentences are jointly used to generate monolingual WEs, enabling the word vectors of synonyms to be closer naturally. Comprehensive experiments on standard BLI datasets in diverse distant languages demonstrate that our method significantly outperforms the strong BLI systems in word translation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The datasets used and analyzed in this study are publicly available at https://github.com/facebookresearch/MUSE and https://dumps.wikimedia.org.

References

  1. Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the 33rd annual meeting on association for computational linguistics, pp 320–322

  2. Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd annual meeting of the association for computational linguistics, pp 526–533

  3. Shi H, Zettlemoyer L, Wang SI (2021) Bilingual lexicon induction via unsupervised bitext construction and word alignment. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 813–826

  4. Heyman G, Vulić I, Moens MF (2017) Bilingual lexicon induction by learning to combine word-level and character-level representations. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 1085–1095

  5. Zhang J, et al (2021)Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Findings of the association for computational linguistics, pp 2943–2955

  6. Ruder S, Vulić I, Søgaard A (2019) A survey of cross-lingual word embedding models. J Artif Intell Res 65:569–631

    Article  MathSciNet  Google Scholar 

  7. Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv e-prints arXiv–1309

  8. Xing C, Wang D, Liu C, Lin Y (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1006–1011

  9. Artetxe M, Labaka G, Agirre E (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 2289–2294

  10. Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: 6th International conference on learning representations

  11. Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 789–798

  12. Glavaš G, Litschko R, Ruder S, Vulić I (2019) How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 710–721

  13. Li Y, Liu F, Collier N, Korhonen A, Vulić I (2022) Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th annual meeting of the association for computational linguistics, pp 4353–4374

  14. Søgaard A, Ruder S, Vulić I (2018) On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 778–788

  15. Mohiuddin MT, Bari MS, Joty S (2020) Lnmap: departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 2712–2723

  16. Vulić I, Glavaš G, Reichart R, Korhonen A (2019) Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 4407–4418

  17. McCrae JP, Rademaker A, Bond F, Rudnicka E, Fellbaum C (2019) English wordnet 2019–an open-source wordnet for english. In: Proceedings of the 10th global WordNet conference, pp 245–252

  18. Bond F, Foster R (2013) Linking and extending an open multilingual wordnet. In: Proceedings of the 51st annual meeting of the association for computational linguistics, pp 1352–1362

  19. Siegel M, Bond F (2021) OdeNet: compiling a German wordnet from other resources. In: Proceedings of the 11th global Wordnet conference, pp 192–198

  20. Peng X, Lin C, Stevenson M (2021) Cross-lingual word embedding refinement by l1 norm optimisation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 2690–2701

  21. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguistics 5:135–146

    Article  Google Scholar 

  22. Hartmann M, Kementchedjhieva Y, Søgaard A (2019) Comparing unsupervised word translation methods step by step. In: Proceedings of the 33rd international conference on neural information processing systems, pp 6033–6043

  23. Li Y, et al (2020) A simple and effective approach to robust unsupervised bilingual dictionary induction. In: Proceedings of the 28th international conference on computational linguistics, pp 5990–6001

  24. Touvron H, et al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  25. Schuster T, Ram O, Barzilay R, Globerson A (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1599–1613

  26. Joulin A, Bojanowski P, Mikolov T, Jégou H, Grave É (2018) Loss in translation: Learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2979–2984

  27. Patra B, Moniz JRA, Garg S, Gormley MR, Neubig G (2019) Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 184–193

  28. Jawanpuria P, Balgovind A, Kunchukuttan A, Mishra B (2019) Learning multilingual word embeddings in latent metric space: a geometric approach. Trans Assoc Comput Linguistics 7:107–120

    Article  Google Scholar 

  29. Sachidananda V, Yang Z, Zhu C (2020) Filtered inner product projection for crosslingual embedding alignment. In: International conference on learning representations

  30. Feng Z, Cao H, Zhao T, Wang W, Peng W (2022) Cross-lingual feature extraction from monolingual corpora for low-resource unsupervised bilingual lexicon induction. In: Proceedings of the 29th international conference on computational linguistics, pp 5278–5287

  31. Cao H, Zhao T, Wang W, Peng W (2023) Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction. Inf Fusion 97:101818

    Article  Google Scholar 

  32. Glavaš G, Vulić I (2020) Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7548–7555

  33. Ganesan A, Ferraro F, Oates T (2021) Learning a reversible embedding mapping using bi-directional manifold alignment. Find Assoc Comput Linguistics ACL-IJCNLP 2021:3132–3139

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62376076 and the Key R &D Program of Yunnan under Grant 202203AA080004.

Author information

Authors and Affiliations

Authors

Contributions

QD: Methodology. HC: Conceptualization. ZF: Investigation. TZ: Supervision.

Corresponding author

Correspondence to Hailong Cao.

Ethics declarations

Conflict of interest

The authors do not have any Conflict of interest or any Conflict of interest.

Ethics approval

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ding, Q., Cao, H., Feng, Z. et al. Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09837-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09837-1

Keywords

Navigation