Skip to main content

Abstract

In the beginning of the 2000s the use of comparable corpora was on the margins of NLP research. Existing MT systems were nearly always based on fully parallel corpora, while NLP applications were mostly built separately in each language without the advantages of cross-lingual transfer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186. https://aclanthology.org/N19-1423

  2. Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536

  3. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67

    MathSciNet  MATH  Google Scholar 

  4. Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: Transfer learning from unlabeled data. In: Proceedings of the ICML

    Google Scholar 

  5. Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the ACL. Cambridge, pp 320–322

    Google Scholar 

  6. Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting of the association for computational linguistics. Association for Computational Linguistics, College Park, pp 519–526. https://doi.org/10.3115/1034678.1034756. https://aclanthology.org/P99-1067

  7. Rapp R (2004) A freely available automatically generated thesaurus of related words. In: Proceedings of the language resources and evaluation conference, (LREC’04). Lisbon, pp 395–398

    Google Scholar 

  8. Rapp R, How to build your own high-quality neural machine translation system using marian nmt. Tcworld magazine, in print

    Google Scholar 

  9. Rapp R, Vide CM (2007) Statistical machine translation without parallel corpora. In: Datenstrukturen für linguistische Ressourcen und ihre Anwendungen/data structures for linguistic resources and applications. Proceedings of the biennial GLDV conference 2007). Gunter Narr Verlag, Tubingen, pp 231–240

    Google Scholar 

  10. Dodge J, Sap M, Marasović A, Agnew W, Ilharco G, Groeneveld D, Mitchell M, Gardner M (2021) Documenting large webtext corpora: a case study on the colossal clean crawled corpus. In: Proceedings of the 2021 conference on empirical methods in natural language processing. Punta Cana, Dominican Republic. Association for Computational Linguistics, pp 1286–1305. https://doi.org/10.18653/v1/2021.emnlp-main.98. https://aclanthology.org/2021.emnlp-main.98

  11. Dou Q, Knight K (2012) Large scale decipherment for out-of-domain machine translation. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Association for Computational Linguistics, Jeju Island, pp 266–275. https://aclanthology.org/D12-1025

  12. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E et al (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258

  13. Anna R, Olga K, Anna R (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Linguist 8:842–866

    Article  Google Scholar 

  14. Serge S (2020) Finding next of kin: cross-lingual embedding spaces for related languages. J Nat Lang Eng 26:163–182

    Article  Google Scholar 

  15. Del M, Fishel M (2021) Establishing interlingua in multilingual language models. arXiv preprint arXiv:2109.01207

  16. Ahuja K, Anastasopoulos A, Patra B, Neubig G, Choudhury M, Dandapat S, Sitaram S, Chaudhary V (2022) The SUMEval 2022 shared task on performance prediction of multilingual pre-trained language models. In: Proceedings of the first workshop on scaling up multilingual evaluation. Association for Computational Linguistics, pp 1–7. https://aclanthology.org/2022.sumeval-1.1

  17. Alvarez-Melis D, Jaakkola T (2018) Gromov-Wasserstein alignment of word embedding spaces. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 1881–1890. https://doi.org/10.18653/v1/D18-1214. https://aclanthology.org/D18-1214

  18. Antonova A, Misyurev A (2011) Building a web-based parallel corpus and filtering out machine-translated text. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web. Association for Computational Linguistics, Portland, pp 136–144. https://aclanthology.org/W11-1218

  19. Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, pp 3197–3203. https://doi.org/10.18653/v1/P19-1309. https://aclanthology.org/P19-1309

  20. Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610. https://doi.org/10.1162/tacl_a_00288. https://aclanthology.org/Q19-1038

  21. Artetxe M, Labaka G, Agirre E (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 2289–2294. https://doi.org/10.18653/v1/D16-1250. https://aclanthology.org/D16-1250

  22. Artetxe M, Labaka G, Agirre E (2017) Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the ACL. Vancouver, pp 451–462

    Google Scholar 

  23. Bateman JA (1997) Enabling technology for multilingual natural language generation: the KPML development environment. Nat Lang Eng 3(1):15–55

    Article  Google Scholar 

  24. Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16

  25. Minixhofer B, Paischer F, Rekabsaz N (2022) WECHSEL: effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Seattle, pp 3992–4006. https://doi.org/10.18653/v1/2022.naacl-main.293. https://aclanthology.org/2022.naacl-main.293

  26. Moore RC (2002) Fast and accurate sentence alignment of bilingual corpora. In: Machine translation: from research to real users. In: 5th conference of the association for machine translation in the Americas. Springer, Heidelberg, pp 135–244

    Google Scholar 

  27. Munday J (2011) Looming large: a cross-linguistic analysis of semantic prosodies in comparable reference corpora. In: Cruger A, Wallmach K, Munday J (eds) Corpus-based translation studies: research and applications. St Jerome Manchester, pp 169–186

    Google Scholar 

  28. Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504

    Article  Google Scholar 

  29. Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Sydney, pp 81–88. https://doi.org/10.3115/1220175.1220186. https://www.aclweb.org/anthology/P06-1011

  30. Wang X, Ruder S, Neubig G (2022) Expanding pretrained models to thousands more languages via lexicon-based adaptation. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), Dublin, pp 863–877. https://doi.org/10.18653/v1/2022.acl-long.61. https://aclanthology.org/2022.acl-long.61

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Conclusions and Future Research. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_8

Download citation

Publish with us

Policies and ethics