Basic Principles of Cross-Lingual Models

Sharoff, Serge; Rapp, Reinhard; Zweigenbaum, Pierre

doi:10.1007/978-3-031-31384-4_2

Serge Sharoff⁵,
Reinhard Rapp⁶ &
Pierre Zweigenbaum⁷

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

49 Accesses

Abstract

When we start working across languages, we need to determine ways of measuring similarity of entities (such as a word, a phrase, a sentence or a link between entities) within each language and across languages. Modern approaches usually rely on the Vector Space Model (VSM), which uses a numerical vector \(x\) of a specified dimensionality \(D\) to represent any entity. The entities are similar when their vectors are not too far apart.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Hardcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The frequencies have been taken from the British National Corpus [3], which was compiled in the beginning of the 1990s. This timeframe leads to the specific Sun-IBM collocation.

References

Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
Google Scholar
Aston G, Burnard L (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh
Google Scholar
Church K, Hanks P (1990) Word association norms, mutual information, and Lexicography. Comput Linguist 16(1):22–29
Google Scholar
Firth JR (1951) Modes of meaning. In: Firth JR (ed) Papers in linguistics 1934–1951. Oxford University Press, Oxford, 1957, pp 190–215
Google Scholar
Harris Z (1954) Distributional structure. Word 10(23):146–162
Article Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188
Article MathSciNet MATH Google Scholar
Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11(Sep):2487–2531
Google Scholar
Rapp R (2004) A freely available automatically generated thesaurus of related words. In: Proceedings of the language resources and evaluation conference, (LREC’04), Lisbon, pp 395–398
Google Scholar
Sahlgren M, Holst A, Kanerva P (2008) Permutations as a means to encode order in word space. In: Proceedings of the 30th annual meeting of the cognitive science society, Washington DC, July 2008
Google Scholar
Goldberg Y (2017) Neural network methods for natural language processing. Synthesis lectures on human language technologies. Morgan & Claypool Publishers
Google Scholar
Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting versus context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD, June 2014
Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Google Scholar
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Linguist 8:842–866
Article Google Scholar
Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, AL-Smadi M, Al-Ayyoub M, Zhao Y, Qin B, De Clercq O, Hoste V, Apidianaki M, Tannier X, Loukachevitch N, Kotelnikov E, Bel N, iménez-Zafra SM, Eryiğit G (2016) SemEval-2016 task 5: aspect based sentiment analysis. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), San Diego, CA, June 2016. Association for Computational Linguistics, pp 19–30. https://doi.org/10.18653/v1/S16-1002. https://www.aclweb.org/anthology/S16-1002
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, vol 1 (Long and Short Papers), Minneapolis, MN, June 2019. Association for Computational Linguistics, pp 4171–4186. https://aclanthology.org/N19-1423
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692
Lyu S, Son B, Yang K, Bae J (2020) Revisiting modularized multilingual NMT to meet industrial demands. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, Nov 2020. Association for Computational Linguistics, pp 5905–5918. https://doi.org/10.18653/v1/2020.emnlp-main.476. https://aclanthology.org/2020.emnlp-main.476
Massey G (2017) Translation competence development and process-oriented pedagogy. In: The handbook of translation and cognition. Wiley Online Library, pp 496–518
Google Scholar
McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. arxiv:1708.00107
Melamed ID (1999) Bitext maps and alignments via pattern recognition. Comput Linguist 25(1):107–130
Google Scholar
Melamed ID (2000) Models of translation equivalence among words. Comput Linguist 26(2):221–250. https://aclanthology.org/J00-2004
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv:1906.01502
Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536
Wenzek G, Lachaux M-A, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) CCNET: extracting high quality monolingual datasets from Web crawl data. arXiv:1911.00359
Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL’94), Stroudsburg, PA, USA. Association for Computational Linguistics, pp 80–87
Google Scholar
Wu S, Dredze M (2019) Beto, Bentz, Becas: the surprising cross-lingual effectiveness of BERT. arXiv:1904.09077
Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP, Online, July 2020. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16
Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing
Google Scholar
Xu H, Koehn P (2021) Cross-lingual BERT contextual embedding space mapping with isotropic and isometric conditions. arXiv:2107.09186

Download references

Author information

Authors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, UK
Serge Sharoff
Faculty of Translation Studies, Linguistics and Cultural Studies, University of Mainz, Mainz, Germersheim, Germany
Reinhard Rapp
CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Université Paris-Saclay, Orsay, France
Pierre Zweigenbaum

Authors

Serge Sharoff
View author publications
You can also search for this author in PubMed Google Scholar
Reinhard Rapp
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Zweigenbaum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Serge Sharoff .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Basic Principles of Cross-Lingual Models. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-31384-4_2
Published: 24 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31383-7
Online ISBN: 978-3-031-31384-4
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 12

Publish with us

Policies and ethics

Basic Principles of Cross-Lingual Models