Skip to main content

Abstract

When we start working across languages, we need to determine ways of measuring similarity of entities (such as a word, a phrase, a sentence or a link between entities) within each language and across languages. Modern approaches usually rely on the Vector Space Model (VSM), which uses a numerical vector \(x\) of a specified dimensionality \(D\) to represent any entity. The entities are similar when their vectors are not too far apart.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The frequencies have been taken from the British National Corpus [3], which was compiled in the beginning of the 1990s. This timeframe leads to the specific Sun-IBM collocation.

References

  1. Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  2. Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA

    Google Scholar 

  3. Aston G, Burnard L (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh

    Google Scholar 

  4. Church K, Hanks P (1990) Word association norms, mutual information, and Lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  5. Firth JR (1951) Modes of meaning. In: Firth JR (ed) Papers in linguistics 1934–1951. Oxford University Press, Oxford, 1957, pp 190–215

    Google Scholar 

  6. Harris Z (1954) Distributional structure. Word 10(23):146–162

    Article  Google Scholar 

  7. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188

    Article  MathSciNet  MATH  Google Scholar 

  8. Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11(Sep):2487–2531

    Google Scholar 

  9. Rapp R (2004) A freely available automatically generated thesaurus of related words. In: Proceedings of the language resources and evaluation conference, (LREC’04), Lisbon, pp 395–398

    Google Scholar 

  10. Sahlgren M, Holst A, Kanerva P (2008) Permutations as a means to encode order in word space. In: Proceedings of the 30th annual meeting of the cognitive science society, Washington DC, July 2008

    Google Scholar 

  11. Goldberg Y (2017) Neural network methods for natural language processing. Synthesis lectures on human language technologies. Morgan & Claypool Publishers

    Google Scholar 

  12. Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting versus context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD, June 2014

    Google Scholar 

  13. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Google Scholar 

  14. Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  15. Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Linguist 8:842–866

    Article  Google Scholar 

  16. Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, AL-Smadi M, Al-Ayyoub M, Zhao Y, Qin B, De Clercq O, Hoste V, Apidianaki M, Tannier X, Loukachevitch N, Kotelnikov E, Bel N, iménez-Zafra SM, Eryiğit G (2016) SemEval-2016 task 5: aspect based sentiment analysis. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), San Diego, CA, June 2016. Association for Computational Linguistics, pp 19–30. https://doi.org/10.18653/v1/S16-1002. https://www.aclweb.org/anthology/S16-1002

  17. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, vol 1 (Long and Short Papers), Minneapolis, MN, June 2019. Association for Computational Linguistics, pp 4171–4186. https://aclanthology.org/N19-1423

  18. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692

  19. Lyu S, Son B, Yang K, Bae J (2020) Revisiting modularized multilingual NMT to meet industrial demands. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, Nov 2020. Association for Computational Linguistics, pp 5905–5918. https://doi.org/10.18653/v1/2020.emnlp-main.476. https://aclanthology.org/2020.emnlp-main.476

  20. Massey G (2017) Translation competence development and process-oriented pedagogy. In: The handbook of translation and cognition. Wiley Online Library, pp 496–518

    Google Scholar 

  21. McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. arxiv:1708.00107

  22. Melamed ID (1999) Bitext maps and alignments via pattern recognition. Comput Linguist 25(1):107–130

    Google Scholar 

  23. Melamed ID (2000) Models of translation equivalence among words. Comput Linguist 26(2):221–250. https://aclanthology.org/J00-2004

  24. Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108

  25. Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv:1906.01502

  26. Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536

  27. Wenzek G, Lachaux M-A, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) CCNET: extracting high quality monolingual datasets from Web crawl data. arXiv:1911.00359

  28. Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL’94), Stroudsburg, PA, USA. Association for Computational Linguistics, pp 80–87

    Google Scholar 

  29. Wu S, Dredze M (2019) Beto, Bentz, Becas: the surprising cross-lingual effectiveness of BERT. arXiv:1904.09077

  30. Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP, Online, July 2020. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16

  31. Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing

    Google Scholar 

  32. Xu H, Koehn P (2021) Cross-lingual BERT contextual embedding space mapping with isotropic and isometric conditions. arXiv:2107.09186

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Basic Principles of Cross-Lingual Models. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_2

Download citation

Publish with us

Policies and ethics