Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Ding, Qiuyu; Cao, Hailong; Feng, Zihao; Zhao, Tiejun

doi:10.1007/s00521-024-09837-1

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Original Article
Published: 13 May 2024

(2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Qiuyu Ding¹,
Hailong Cao¹,
Zihao Feng¹ &
…
Tiejun Zhao¹

30 Accesses
Explore all metrics

Abstract

Most of the bilingual lexicon induction (BLI) models learn a mapping function that can transfer word embedding (WE) spaces from one language to another. This usually relies on the isomorphism hypothesis, which posits that words in different languages share the same structures and relationships (i.e. similar in geometric structure). However, WE’s isomorphism weakens substantially in distant language pairs, resulting in low accuracy of BLI. To address this problem, we propose a novel BLI method incorporating synonymous knowledge. The main idea is to stabilize the distance between words to optimize the monolingual WE space, yielding higher isomorphism. Specifically, we first induce monolingual synonym pairs from Wordnet and construct monolingual synonym lexicons. We then generate pseudo-sentences by substituting words in the training corpus with synonyms. Finally, the original sentences and pseudo-sentences are jointly used to generate monolingual WEs, enabling the word vectors of synonyms to be closer naturally. Comprehensive experiments on standard BLI datasets in diverse distant languages demonstrate that our method significantly outperforms the strong BLI systems in word translation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Bilingual Lexicon Induction on Distant Language Pairs

Learning Bilingual Lexicon for Low-Resource Language Pairs

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Availability of data and materials

The datasets used and analyzed in this study are publicly available at https://github.com/facebookresearch/MUSE and https://dumps.wikimedia.org.

References

Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the 33rd annual meeting on association for computational linguistics, pp 320–322
Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd annual meeting of the association for computational linguistics, pp 526–533
Shi H, Zettlemoyer L, Wang SI (2021) Bilingual lexicon induction via unsupervised bitext construction and word alignment. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 813–826
Heyman G, Vulić I, Moens MF (2017) Bilingual lexicon induction by learning to combine word-level and character-level representations. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 1085–1095
Zhang J, et al (2021)Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Findings of the association for computational linguistics, pp 2943–2955
Ruder S, Vulić I, Søgaard A (2019) A survey of cross-lingual word embedding models. J Artif Intell Res 65:569–631
Article MathSciNet Google Scholar
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv e-prints arXiv–1309
Xing C, Wang D, Liu C, Lin Y (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1006–1011
Artetxe M, Labaka G, Agirre E (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 2289–2294
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: 6th International conference on learning representations
Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 789–798
Glavaš G, Litschko R, Ruder S, Vulić I (2019) How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 710–721
Li Y, Liu F, Collier N, Korhonen A, Vulić I (2022) Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th annual meeting of the association for computational linguistics, pp 4353–4374
Søgaard A, Ruder S, Vulić I (2018) On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 778–788
Mohiuddin MT, Bari MS, Joty S (2020) Lnmap: departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 2712–2723
Vulić I, Glavaš G, Reichart R, Korhonen A (2019) Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 4407–4418
McCrae JP, Rademaker A, Bond F, Rudnicka E, Fellbaum C (2019) English wordnet 2019–an open-source wordnet for english. In: Proceedings of the 10th global WordNet conference, pp 245–252
Bond F, Foster R (2013) Linking and extending an open multilingual wordnet. In: Proceedings of the 51st annual meeting of the association for computational linguistics, pp 1352–1362
Siegel M, Bond F (2021) OdeNet: compiling a German wordnet from other resources. In: Proceedings of the 11th global Wordnet conference, pp 192–198
Peng X, Lin C, Stevenson M (2021) Cross-lingual word embedding refinement by l1 norm optimisation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 2690–2701
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguistics 5:135–146
Article Google Scholar
Hartmann M, Kementchedjhieva Y, Søgaard A (2019) Comparing unsupervised word translation methods step by step. In: Proceedings of the 33rd international conference on neural information processing systems, pp 6033–6043
Li Y, et al (2020) A simple and effective approach to robust unsupervised bilingual dictionary induction. In: Proceedings of the 28th international conference on computational linguistics, pp 5990–6001
Touvron H, et al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Schuster T, Ram O, Barzilay R, Globerson A (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1599–1613
Joulin A, Bojanowski P, Mikolov T, Jégou H, Grave É (2018) Loss in translation: Learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2979–2984
Patra B, Moniz JRA, Garg S, Gormley MR, Neubig G (2019) Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 184–193
Jawanpuria P, Balgovind A, Kunchukuttan A, Mishra B (2019) Learning multilingual word embeddings in latent metric space: a geometric approach. Trans Assoc Comput Linguistics 7:107–120
Article Google Scholar
Sachidananda V, Yang Z, Zhu C (2020) Filtered inner product projection for crosslingual embedding alignment. In: International conference on learning representations
Feng Z, Cao H, Zhao T, Wang W, Peng W (2022) Cross-lingual feature extraction from monolingual corpora for low-resource unsupervised bilingual lexicon induction. In: Proceedings of the 29th international conference on computational linguistics, pp 5278–5287
Cao H, Zhao T, Wang W, Peng W (2023) Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction. Inf Fusion 97:101818
Article Google Scholar
Glavaš G, Vulić I (2020) Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7548–7555
Ganesan A, Ferraro F, Oates T (2021) Learning a reversible embedding mapping using bi-directional manifold alignment. Find Assoc Comput Linguistics ACL-IJCNLP 2021:3132–3139
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62376076 and the Key R &D Program of Yunnan under Grant 202203AA080004.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
Qiuyu Ding, Hailong Cao, Zihao Feng & Tiejun Zhao

Authors

Qiuyu Ding
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Cao
View author publications
You can also search for this author in PubMed Google Scholar
Zihao Feng
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

QD: Methodology. HC: Conceptualization. ZF: Investigation. TZ: Supervision.

Corresponding author

Correspondence to Hailong Cao.

Ethics declarations

Conflict of interest

The authors do not have any Conflict of interest or any Conflict of interest.

Ethics approval

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ding, Q., Cao, H., Feng, Z. et al. Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09837-1

Download citation

Received: 05 October 2023
Accepted: 12 April 2024
Published: 13 May 2024
DOI: https://doi.org/10.1007/s00521-024-09837-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Abstract

Access this article

Similar content being viewed by others

Improving Bilingual Lexicon Induction on Distant Language Pairs

Learning Bilingual Lexicon for Low-Resource Language Pairs

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Abstract

Access this article

Similar content being viewed by others

Improving Bilingual Lexicon Induction on Distant Language Pairs

Learning Bilingual Lexicon for Low-Resource Language Pairs

CD-BLI: Confidence-Based Dual Refinement for Unsupervised Bilingual Lexicon Induction

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation