Abstract
In the beginning of the 2000s the use of comparable corpora was on the margins of NLP research. Existing MT systems were nearly always based on fully parallel corpora, while NLP applications were mostly built separately in each language without the advantages of cross-lingual transfer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186. https://aclanthology.org/N19-1423
Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67
Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: Transfer learning from unlabeled data. In: Proceedings of the ICML
Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the ACL. Cambridge, pp 320–322
Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting of the association for computational linguistics. Association for Computational Linguistics, College Park, pp 519–526. https://doi.org/10.3115/1034678.1034756. https://aclanthology.org/P99-1067
Rapp R (2004) A freely available automatically generated thesaurus of related words. In: Proceedings of the language resources and evaluation conference, (LREC’04). Lisbon, pp 395–398
Rapp R, How to build your own high-quality neural machine translation system using marian nmt. Tcworld magazine, in print
Rapp R, Vide CM (2007) Statistical machine translation without parallel corpora. In: Datenstrukturen für linguistische Ressourcen und ihre Anwendungen/data structures for linguistic resources and applications. Proceedings of the biennial GLDV conference 2007). Gunter Narr Verlag, Tubingen, pp 231–240
Dodge J, Sap M, Marasović A, Agnew W, Ilharco G, Groeneveld D, Mitchell M, Gardner M (2021) Documenting large webtext corpora: a case study on the colossal clean crawled corpus. In: Proceedings of the 2021 conference on empirical methods in natural language processing. Punta Cana, Dominican Republic. Association for Computational Linguistics, pp 1286–1305. https://doi.org/10.18653/v1/2021.emnlp-main.98. https://aclanthology.org/2021.emnlp-main.98
Dou Q, Knight K (2012) Large scale decipherment for out-of-domain machine translation. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Association for Computational Linguistics, Jeju Island, pp 266–275. https://aclanthology.org/D12-1025
Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E et al (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
Anna R, Olga K, Anna R (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Linguist 8:842–866
Serge S (2020) Finding next of kin: cross-lingual embedding spaces for related languages. J Nat Lang Eng 26:163–182
Del M, Fishel M (2021) Establishing interlingua in multilingual language models. arXiv preprint arXiv:2109.01207
Ahuja K, Anastasopoulos A, Patra B, Neubig G, Choudhury M, Dandapat S, Sitaram S, Chaudhary V (2022) The SUMEval 2022 shared task on performance prediction of multilingual pre-trained language models. In: Proceedings of the first workshop on scaling up multilingual evaluation. Association for Computational Linguistics, pp 1–7. https://aclanthology.org/2022.sumeval-1.1
Alvarez-Melis D, Jaakkola T (2018) Gromov-Wasserstein alignment of word embedding spaces. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 1881–1890. https://doi.org/10.18653/v1/D18-1214. https://aclanthology.org/D18-1214
Antonova A, Misyurev A (2011) Building a web-based parallel corpus and filtering out machine-translated text. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web. Association for Computational Linguistics, Portland, pp 136–144. https://aclanthology.org/W11-1218
Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, pp 3197–3203. https://doi.org/10.18653/v1/P19-1309. https://aclanthology.org/P19-1309
Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610. https://doi.org/10.1162/tacl_a_00288. https://aclanthology.org/Q19-1038
Artetxe M, Labaka G, Agirre E (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 2289–2294. https://doi.org/10.18653/v1/D16-1250. https://aclanthology.org/D16-1250
Artetxe M, Labaka G, Agirre E (2017) Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the ACL. Vancouver, pp 451–462
Bateman JA (1997) Enabling technology for multilingual natural language generation: the KPML development environment. Nat Lang Eng 3(1):15–55
Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16
Minixhofer B, Paischer F, Rekabsaz N (2022) WECHSEL: effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Seattle, pp 3992–4006. https://doi.org/10.18653/v1/2022.naacl-main.293. https://aclanthology.org/2022.naacl-main.293
Moore RC (2002) Fast and accurate sentence alignment of bilingual corpora. In: Machine translation: from research to real users. In: 5th conference of the association for machine translation in the Americas. Springer, Heidelberg, pp 135–244
Munday J (2011) Looming large: a cross-linguistic analysis of semantic prosodies in comparable reference corpora. In: Cruger A, Wallmach K, Munday J (eds) Corpus-based translation studies: research and applications. St Jerome Manchester, pp 169–186
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504
Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Sydney, pp 81–88. https://doi.org/10.3115/1220175.1220186. https://www.aclweb.org/anthology/P06-1011
Wang X, Ruder S, Neubig G (2022) Expanding pretrained models to thousands more languages via lexicon-based adaptation. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), Dublin, pp 863–877. https://doi.org/10.18653/v1/2022.acl-long.61. https://aclanthology.org/2022.acl-long.61
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Conclusions and Future Research. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-31384-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31383-7
Online ISBN: 978-3-031-31384-4
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 12