Skip to main content

Abstract

The ability of the computers to handle larger amounts of texts and the availability of more texts in electronic form led to the rise of data driven research in computational linguistics. In the case of Machine Translation (MT) and other kinds of multilingual Natural Language Processing (NLP), the first source of large data came from collections of translations, initially in the Statistical MT (SMT) approach from IBM [1], which was based on the Proceedings of the Canadian Parliament in English and French as their data source. This research direction was followed by a proliferation of SMT models, which relied on larger and larger collections of parallel data, which consist of exact translations between a pair of languages or several languages at the same time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85

    Google Scholar 

  2. Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, Germany, pp 554–563. https://doi.org/10.18653/v1/W16-2347, https://aclanthology.org/W16-2347

  3. Cai X, Huang J, Bian Y, Church K (2021) Isotropy in the contextual embedding space: clusters and manifolds. In: International conference on learning representations

    Google Scholar 

  4. Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of ACL, Cambridge, pp 320–322

    Google Scholar 

  5. Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine translation and the information soup. Springer, pp 1–17. http://www.springerlink.com/content/pqkpwpw32f5r74ev/

  6. Sharoff S, Rapp R, Zweigenbaum P (2013) Overviewing important aspects of the last twenty years of research in comparable corpora. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) BUCC: building and using comparable corpora. Springer, pp 1–17

    Google Scholar 

  7. Sharoff S, Zweigenbaum P, Rapp R (2015) BUCC shared task: cross-language document similarity. In: Proceedings of eighth workshop on building and using comparable corpora. Association for Computational Linguistics, Beijing, China, pp 74–78

    Google Scholar 

  8. Massey G (2017) Translation competence development and process-oriented pedagogy. In: The handbook of translation and cognition. Wiley Online Library, pp 496–518

    Google Scholar 

  9. Harald B (2008) Analyzing linguistic data. Cambridge University Press, Cambridge

    Google Scholar 

  10. Baroni M, Bernardini S, Ferraresi A, Zanchetta E (2009) The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang Resour Eval 43(3):209–226

    Article  Google Scholar 

  11. Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting versus context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Baltimore, Maryland

    Google Scholar 

  12. Bateman JA (1997) Enabling technology for multilingual natural language generation: the KPML development environment. Nat Lang Eng 3(1):15–55

    Google Scholar 

  13. Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Guyon I, Dror G, Lemaire V, Taylor G, Silver D (eds) Proceedings of ICML workshop on unsupervised and transfer learning. Volume 27 of proceedings of machine learning research, Bellevue, Washington, USA. PMLR, pp 17–36. https://proceedings.mlr.press/v27/bengio12a.html. Accessed 02 July 2012

  14. Bernardini S, Ferraresi A (2013) Old needs, new solutions: comparable corpora for language professionals. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, pp 303–319

    Google Scholar 

  15. Philipp K (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT summit X

    Google Scholar 

  16. Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations parallel corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), Portorož, Slovenia, May 2016. European Language Resources Association (ELRA)

    Google Scholar 

  17. Guy A, Lou B (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh

    Google Scholar 

  18. Koppel M, Ordan N (2011) Translationese and its dialects. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, June 2011. Association for Computational Linguistics, pp 1318–1326. https://www.aclweb.org/anthology/P11-1132

  19. Frankenberg-Garcia A (2009) Are translations longer than source texts. In: Beeby A, Rodriguez P, Sanchez-Gijon P (eds) Corpus use and learning to translate (CULT): an introduction. John Benjamins, pp 47–58

    Google Scholar 

  20. Rubino R, Lapshinova-Koltunski E, van Genabith J (2016) Information density and quality estimation features as translationese indicators for human translation classification. In: Proceedings of the 2016 conference of the North American Chapter of the association for computational linguistics: human language technologies, San Diego, CA, June 2016. Association for Computational Linguistics, pp 960–970. https://doi.org/10.18653/v1/N16-1110. https://www.aclweb.org/anthology/N16-1110

  21. Ella R, Shuly W (2015) Unsupervised identification of translationese. Trans Assoc Comput Linguist 3:419–432

    Google Scholar 

  22. Riley P, Caswell I, Freitag M, Grangier D (2020) Translationese as a language in “multilingual” NMT. In: Proceedings of the 58th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp 7737–7746. https://www.aclweb.org/anthology/2020.acl-main.691

  23. Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA), pp 2214–2218. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf

  24. Sharoff S (2013) Measuring the distance between comparable corpora between languages. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds), BUCC: building and using comparable corpora. Springer, Berlin/New York, pp 113–129

    Google Scholar 

  25. Sharoff S (2006) Open-source corpora: using the net to fish for linguistic data. Int J Corpus Linguist 11(4):435–462

    Article  Google Scholar 

  26. Wenzek G, Lachaux M-A, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) CCNET: extracting high quality monolingual datasets from Web crawl data. arXiv:1911.00359

  27. Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL ’94), Stroudsburg, PA, USA, 1994. Association for Computational Linguistics, pp 80–87

    Google Scholar 

  28. Wu S, Dredze M (2019) Beto, Bentz, Becas: the surprising cross-lingual effectiveness of BERT. arXiv:1904.09077

  29. Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP, Online, July 2020. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16

  30. Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing

    Google Scholar 

  31. Xu H, Koehn P (2021) Cross-lingual BERT contextual embedding space mapping with isotropic and isometric conditions. arXiv:2107.09186

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Introduction. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_1

Download citation

Publish with us

Policies and ethics