Skip to main content

Abstract

This section concerns applications of comparable corpora beyond pure machine translation. It has been argued [1, 2] that downstream applications such as cross-lingual document classification, information retrieval or natural language inference, apart from proving the practical utility of NLP methods

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the COLING, Mumbai

    Google Scholar 

  2. Glavaš G, Litschko R, Ruder S, Vulić I (2019) How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, pp 710–721. https://doi.org/10.18653/v1/P19-1070. https://www.aclweb.org/anthology/P19-1070

  3. Goldberg Y (2017) Neural network methods for natural language processing, Synthesis lectures on human language technologies. Morgan & Claypool Publishers, p 2017

    Google Scholar 

  4. Grégoire F, Langlais P (2018) Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In: Proceedings of the 27th international conference on computational linguistics. Association for Computational Linguistics, Santa Fe, pp 1442–1453. https://aclanthology.org/C18-1122

  5. Grover J, Mitra P (2017) Bilingual word embeddings with bucketed CNN for parallel sentence extraction. In: Proceedings of ACL 2017, student research workshop. Association for Computational Linguistics, Vancouver, pp 11–16. https://aclanthology.org/P17-3003

  6. Daumé H III, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, Uppsala, pp 53–59. https://aclanthology.org/W10-2608

  7. Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Guyon I, Dror G, Lemaire V, Taylor G, Silver D (eds) Proceedings of ICML workshop on unsupervised and transfer learning. In: Proceedings of machine learning research, vol 27. PMLR, Bellevue, pp 17–36. https://proceedings.mlr.press/v27/bengio12a.html

  8. Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16

  9. Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536

  10. Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: Daum H III, Singh A (eds) Proceedings of the 37th international conference on machine learning. Proceedings of machine learning research, vol 119. PMLR, pp 4411–4421. https://proceedings.mlr.press/v119/hu20b.html

  11. Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Jeju Island, pp 873–882. https://aclanthology.org/P12-1092

  12. Huang J, Cai X, Church K (2020) Improving bilingual lexicon induction for low frequency words. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 1310–1314. https://doi.org/10.18653/v1/2020.emnlp-main.100. https://aclanthology.org/2020.emnlp-main.100

  13. Irvine A, Callison-Burch C (2016) End-to-end statistical machine translation with zero or small parallel texts. Nat Lang Eng 22(4):517–548. https://doi.org/10.1017/S1351324916000127

    Article  Google Scholar 

  14. Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310. https://doi.org/10.1162/COLI_a_00284. https://aclanthology.org/J17-2001

  15. Jakubina L, Langlais P (2016) BAD LUC@WMT 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 703–709. https://doi.org/10.18653/v1/W16-2370. https://aclanthology.org/W16-2370

  16. Jantunen J (2002) Comparable corpora in translation studies: strengths and limitations. SKY J Linguist I5(43):105–117

    Google Scholar 

  17. Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128. ISSN 0162-8828. https://doi.org/10.1109/TPAMI.2010.57

  18. Johnson J, Douze M, Jégou H. Billion-scale similarity search with GPUs. IEEE Trans Big Data 7(3):535–547. https://doi.org/10.1109/TBDATA.2019.2921572

  19. Ruder S, Constant N, Botha J, Siddhant A, Firat O, Fu J, Liu P, Hu J, Garrette D, Neubig G et al (2021) XTREME-R: towards more challenging and nuanced multilingual evaluation. arXiv preprint arXiv:2104.07412

  20. Sahlgren M, Holst A, Kanerva P (2008) Permutations as a means to encode order in word space. In: Proceedings of the 30th annual meeting of the cognitive science society. Washington DC

    Google Scholar 

  21. Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  22. Forsyth R, Sharoff S (2014) Document dissimilarity within and across languages: a benchmarking study. Literary Linguist Comput 29:6–22

    Article  Google Scholar 

  23. Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168

  24. Wu S, Dredze M (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. arXiv preprint arXiv:1904.09077

  25. Conneau A, Lample G, Rinott R, Williams A, Bowman SR, Schwenk H, Stoyanov V (2018) XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 2475–2485. https://doi.org/10.18653/v1/D18-1269. https://www.aclweb.org/anthology/D18-1269

  26. Nivre J, de Marneffe M-C, Ginter F, Goldberg Y, Hajič J, Manning CD, McDonald R, Petrov S, Pyysalo S, Silveira N, Tsarfaty R, Zeman D (2016) Universal dependencies v1: a multilingual treebank collection. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA), Portorož

    Google Scholar 

  27. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  28. Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics. Berlin, pp 1715–1725. https://doi.org/10.18653/v1/P16-1162

  29. Pfeiffer J, Vulić I, Gurevych I, Ruder S (2020) UNKs everywhere: adapting multilingual language models to new scripts. arXiv preprint arXiv:2012.15562

  30. Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502

  31. Rönnqvist S, Skantsi V, Oinonen M, Laippala V (2021) Multilingual and zero-shot is closing in on monolingual web register classification. In: Proceedings of the 23rd nordic conference on computational linguistics (NoDaLiDa). Linköping University Electronic Press, Reykjavik, pp 157–165. https://aclanthology.org/2021.nodalida-main.16

  32. Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1 -from yesterday’s news to tomorrow’s language resources. In: Proceedings of the third international conference on language resources and evaluation (LREC). European Language Resources Association (ELRA), Las Palmas, Canary Islands. http://www.lrec-conf.org/proceedings/lrec2002/pdf/80.pdf

  33. Kurfalı M, Östling R (2021) Probing multilingual language models for discourse. arXiv preprint arXiv:2106.04832

  34. Chau EC, Smith NA (2021) Specializing multilingual language models: an empirical study. arXiv preprint arXiv:2106.09063

  35. Lewis M, Ghazvininejad M, Ghosh G, Aghajanyan A, Wang S, Zettlemoyer L (2020) Pre-training via paraphrasing. arXiv preprint arXiv:2006.15020

  36. Zhao W, Eger S, Bjerva J, Augenstein I (2020) Inducing language-agnostic multilingual representations. arXiv preprint arXiv:2008.09112

  37. Sharoff S (2020) Finding next of kin: cross-lingual embedding spaces for related languages. J Nat Lang Eng 26:163–182

    Article  Google Scholar 

  38. Rios M, Sharoff S (2016) Language adaptation for extending post-editing estimates for closely related languages. Prague Bull Math Linguist 106:181–192

    Article  Google Scholar 

  39. Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the ICML

    Google Scholar 

  40. Kilgarriff A, Charalabopoulou F, Gavrilidou M, Johannessen JB, Khalil S, Johansson Kokkinakis S, Lew R, Sharoff S, Vadlapudi R, Volodina E (2014) Corpus-based vocabulary lists for language learners for nine languages. Lang Resour Evaluat 48(1):121–163

    Article  Google Scholar 

  41. Bański P, Gozdawa-Gołębiowski R (2010) Foreign language examination corpus for l2-learning studies. In: Proceeding of the workshop on building and using comparable corpora, Malta

    Google Scholar 

  42. Kupietz M, Witt A, Bański P, Tufiş D, Cristea D, Váradi T (2017) EuReCo – joining forces for a European reference corpus as a sustainable base for cross-linguistic research. In: Proceedings of the workshop on challenges in the management of large corpora and big data and natural language processing, Birmingham

    Google Scholar 

  43. Kirk J, Čermáková A, Oksefjell Ebeling S, Ebeling J, Kren M, Aijmer K, Benko V, Garabik R, Gorski R, Kupietz M, Jantunen J et al (2018) Introducing the international comparable corpus. In: Using corpora in contrastive and translation studies conference, Louvain. https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/8248/file/Kirk_etal._Introducing_the_International_Comparable_Corpus_2018.pdf

  44. Čermáková A, Jantunen J, Jauhiainen T, Kirk J, Křen M, Kupietz M, Dhonnchadha EU (2021) The international comparable corpus: challenges in building multilingual spoken and written comparable corpora. Res Corpus Linguist 9(1):89–103

    Article  Google Scholar 

  45. Chen J, Nie J-Y (2000) Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In: Sixth applied natural language processing conference. Association for Computational Linguistics, Seattle, pp 21–28. https://doi.org/10.3115/974147.974151. https://aclanthology.org/A00-1004

  46. Church K, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  47. Bernardini S, Ferraresi A (2013) Old needs, new solutions: comparable corpora for language professionals. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, pp 303–319

    Google Scholar 

  48. Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing

    Google Scholar 

  49. Kunilovskaya M, Lapshinova-Koltunski E (2019) Translationese features as indicators of quality in English-Russian human translation. In: Proceedings of the human-informed translation and interpreting technology workshop (HiT-IT 2019). Incoma Ltd, Varna, pp 47–56. https://doi.org/10.26615/issn.2683-0078.2019_006. https://www.aclweb.org/anthology/W19-8706

  50. Kunilovskaya M, Sharoff S (2019) Towards functionally similar corpus resources for translation. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2019). INCOMA Ltd, Varna, pp 583–592. https://doi.org/10.26615/978-954-452-056-4_069. https://aclanthology.org/R19-1069

  51. Munday J (2011) Looming large: a cross-linguistic analysis of semantic prosodies in comparable reference corpora. In: Cruger A, Wallmach K, Munday J (eds) Corpus-based translation studies: research and applications. St Jerome Manchester, pp 169–186

    Google Scholar 

  52. Zanettin F (1998) Bilingual comparable corpora and the training of translators. Meta: Journal des traducteurs 43:12. https://doi.org/10.7202/004638ar

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Other Applications of Comparable Corpora. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_7

Download citation

Publish with us

Policies and ethics