Abstract
When we start working across languages, we need to determine ways of measuring similarity of entities (such as a word, a phrase, a sentence or a link between entities) within each language and across languages. Modern approaches usually rely on the Vector Space Model (VSM), which uses a numerical vector \(x\) of a specified dimensionality \(D\) to represent any entity. The entities are similar when their vectors are not too far apart.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The frequencies have been taken from the British National Corpus [3], which was compiled in the beginning of the 1990s. This timeframe leads to the specific Sun-IBM collocation.
References
Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
Aston G, Burnard L (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh
Church K, Hanks P (1990) Word association norms, mutual information, and Lexicography. Comput Linguist 16(1):22–29
Firth JR (1951) Modes of meaning. In: Firth JR (ed) Papers in linguistics 1934–1951. Oxford University Press, Oxford, 1957, pp 190–215
Harris Z (1954) Distributional structure. Word 10(23):146–162
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188
Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11(Sep):2487–2531
Rapp R (2004) A freely available automatically generated thesaurus of related words. In: Proceedings of the language resources and evaluation conference, (LREC’04), Lisbon, pp 395–398
Sahlgren M, Holst A, Kanerva P (2008) Permutations as a means to encode order in word space. In: Proceedings of the 30th annual meeting of the cognitive science society, Washington DC, July 2008
Goldberg Y (2017) Neural network methods for natural language processing. Synthesis lectures on human language technologies. Morgan & Claypool Publishers
Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting versus context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD, June 2014
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Linguist 8:842–866
Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, AL-Smadi M, Al-Ayyoub M, Zhao Y, Qin B, De Clercq O, Hoste V, Apidianaki M, Tannier X, Loukachevitch N, Kotelnikov E, Bel N, iménez-Zafra SM, Eryiğit G (2016) SemEval-2016 task 5: aspect based sentiment analysis. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), San Diego, CA, June 2016. Association for Computational Linguistics, pp 19–30. https://doi.org/10.18653/v1/S16-1002. https://www.aclweb.org/anthology/S16-1002
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, vol 1 (Long and Short Papers), Minneapolis, MN, June 2019. Association for Computational Linguistics, pp 4171–4186. https://aclanthology.org/N19-1423
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692
Lyu S, Son B, Yang K, Bae J (2020) Revisiting modularized multilingual NMT to meet industrial demands. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, Nov 2020. Association for Computational Linguistics, pp 5905–5918. https://doi.org/10.18653/v1/2020.emnlp-main.476. https://aclanthology.org/2020.emnlp-main.476
Massey G (2017) Translation competence development and process-oriented pedagogy. In: The handbook of translation and cognition. Wiley Online Library, pp 496–518
McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. arxiv:1708.00107
Melamed ID (1999) Bitext maps and alignments via pattern recognition. Comput Linguist 25(1):107–130
Melamed ID (2000) Models of translation equivalence among words. Comput Linguist 26(2):221–250. https://aclanthology.org/J00-2004
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv:1906.01502
Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536
Wenzek G, Lachaux M-A, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) CCNET: extracting high quality monolingual datasets from Web crawl data. arXiv:1911.00359
Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL’94), Stroudsburg, PA, USA. Association for Computational Linguistics, pp 80–87
Wu S, Dredze M (2019) Beto, Bentz, Becas: the surprising cross-lingual effectiveness of BERT. arXiv:1904.09077
Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP, Online, July 2020. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16
Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing
Xu H, Koehn P (2021) Cross-lingual BERT contextual embedding space mapping with isotropic and isometric conditions. arXiv:2107.09186
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Basic Principles of Cross-Lingual Models. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-31384-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31383-7
Online ISBN: 978-3-031-31384-4
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 12