Abstract
The ability of the computers to handle larger amounts of texts and the availability of more texts in electronic form led to the rise of data driven research in computational linguistics. In the case of Machine Translation (MT) and other kinds of multilingual Natural Language Processing (NLP), the first source of large data came from collections of translations, initially in the Statistical MT (SMT) approach from IBM [1], which was based on the Proceedings of the Canadian Parliament in English and French as their data source. This research direction was followed by a proliferation of SMT models, which relied on larger and larger collections of parallel data, which consist of exact translations between a pair of languages or several languages at the same time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85
Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, Germany, pp 554–563. https://doi.org/10.18653/v1/W16-2347, https://aclanthology.org/W16-2347
Cai X, Huang J, Bian Y, Church K (2021) Isotropy in the contextual embedding space: clusters and manifolds. In: International conference on learning representations
Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of ACL, Cambridge, pp 320–322
Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine translation and the information soup. Springer, pp 1–17. http://www.springerlink.com/content/pqkpwpw32f5r74ev/
Sharoff S, Rapp R, Zweigenbaum P (2013) Overviewing important aspects of the last twenty years of research in comparable corpora. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) BUCC: building and using comparable corpora. Springer, pp 1–17
Sharoff S, Zweigenbaum P, Rapp R (2015) BUCC shared task: cross-language document similarity. In: Proceedings of eighth workshop on building and using comparable corpora. Association for Computational Linguistics, Beijing, China, pp 74–78
Massey G (2017) Translation competence development and process-oriented pedagogy. In: The handbook of translation and cognition. Wiley Online Library, pp 496–518
Harald B (2008) Analyzing linguistic data. Cambridge University Press, Cambridge
Baroni M, Bernardini S, Ferraresi A, Zanchetta E (2009) The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang Resour Eval 43(3):209–226
Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting versus context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Baltimore, Maryland
Bateman JA (1997) Enabling technology for multilingual natural language generation: the KPML development environment. Nat Lang Eng 3(1):15–55
Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Guyon I, Dror G, Lemaire V, Taylor G, Silver D (eds) Proceedings of ICML workshop on unsupervised and transfer learning. Volume 27 of proceedings of machine learning research, Bellevue, Washington, USA. PMLR, pp 17–36. https://proceedings.mlr.press/v27/bengio12a.html. Accessed 02 July 2012
Bernardini S, Ferraresi A (2013) Old needs, new solutions: comparable corpora for language professionals. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, pp 303–319
Philipp K (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT summit X
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations parallel corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), Portorož, Slovenia, May 2016. European Language Resources Association (ELRA)
Guy A, Lou B (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh
Koppel M, Ordan N (2011) Translationese and its dialects. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, June 2011. Association for Computational Linguistics, pp 1318–1326. https://www.aclweb.org/anthology/P11-1132
Frankenberg-Garcia A (2009) Are translations longer than source texts. In: Beeby A, Rodriguez P, Sanchez-Gijon P (eds) Corpus use and learning to translate (CULT): an introduction. John Benjamins, pp 47–58
Rubino R, Lapshinova-Koltunski E, van Genabith J (2016) Information density and quality estimation features as translationese indicators for human translation classification. In: Proceedings of the 2016 conference of the North American Chapter of the association for computational linguistics: human language technologies, San Diego, CA, June 2016. Association for Computational Linguistics, pp 960–970. https://doi.org/10.18653/v1/N16-1110. https://www.aclweb.org/anthology/N16-1110
Ella R, Shuly W (2015) Unsupervised identification of translationese. Trans Assoc Comput Linguist 3:419–432
Riley P, Caswell I, Freitag M, Grangier D (2020) Translationese as a language in “multilingual” NMT. In: Proceedings of the 58th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp 7737–7746. https://www.aclweb.org/anthology/2020.acl-main.691
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA), pp 2214–2218. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Sharoff S (2013) Measuring the distance between comparable corpora between languages. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds), BUCC: building and using comparable corpora. Springer, Berlin/New York, pp 113–129
Sharoff S (2006) Open-source corpora: using the net to fish for linguistic data. Int J Corpus Linguist 11(4):435–462
Wenzek G, Lachaux M-A, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) CCNET: extracting high quality monolingual datasets from Web crawl data. arXiv:1911.00359
Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL ’94), Stroudsburg, PA, USA, 1994. Association for Computational Linguistics, pp 80–87
Wu S, Dredze M (2019) Beto, Bentz, Becas: the surprising cross-lingual effectiveness of BERT. arXiv:1904.09077
Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP, Online, July 2020. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16
Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing
Xu H, Koehn P (2021) Cross-lingual BERT contextual embedding space mapping with isotropic and isometric conditions. arXiv:2107.09186
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Introduction. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-31384-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31383-7
Online ISBN: 978-3-031-31384-4
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 12