Abstract
This section concerns applications of comparable corpora beyond pure machine translation. It has been argued [1, 2] that downstream applications such as cross-lingual document classification, information retrieval or natural language inference, apart from proving the practical utility of NLP methods
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the COLING, Mumbai
Glavaš G, Litschko R, Ruder S, Vulić I (2019) How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, pp 710–721. https://doi.org/10.18653/v1/P19-1070. https://www.aclweb.org/anthology/P19-1070
Goldberg Y (2017) Neural network methods for natural language processing, Synthesis lectures on human language technologies. Morgan & Claypool Publishers, p 2017
Grégoire F, Langlais P (2018) Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In: Proceedings of the 27th international conference on computational linguistics. Association for Computational Linguistics, Santa Fe, pp 1442–1453. https://aclanthology.org/C18-1122
Grover J, Mitra P (2017) Bilingual word embeddings with bucketed CNN for parallel sentence extraction. In: Proceedings of ACL 2017, student research workshop. Association for Computational Linguistics, Vancouver, pp 11–16. https://aclanthology.org/P17-3003
Daumé H III, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, Uppsala, pp 53–59. https://aclanthology.org/W10-2608
Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Guyon I, Dror G, Lemaire V, Taylor G, Silver D (eds) Proceedings of ICML workshop on unsupervised and transfer learning. In: Proceedings of machine learning research, vol 27. PMLR, Bellevue, pp 17–36. https://proceedings.mlr.press/v27/bengio12a.html
Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? In: Proceedings of the 5th workshop on representation learning for NLP. Association for Computational Linguistics, pp 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16. https://aclanthology.org/2020.repl4nlp-1.16
Conneau A, Wu S, Li H, Zettlemoyer L, Stoyanov V (2020) Emerging cross-lingual structure in pretrained language models. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536
Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: Daum H III, Singh A (eds) Proceedings of the 37th international conference on machine learning. Proceedings of machine learning research, vol 119. PMLR, pp 4411–4421. https://proceedings.mlr.press/v119/hu20b.html
Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Jeju Island, pp 873–882. https://aclanthology.org/P12-1092
Huang J, Cai X, Church K (2020) Improving bilingual lexicon induction for low frequency words. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 1310–1314. https://doi.org/10.18653/v1/2020.emnlp-main.100. https://aclanthology.org/2020.emnlp-main.100
Irvine A, Callison-Burch C (2016) End-to-end statistical machine translation with zero or small parallel texts. Nat Lang Eng 22(4):517–548. https://doi.org/10.1017/S1351324916000127
Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310. https://doi.org/10.1162/COLI_a_00284. https://aclanthology.org/J17-2001
Jakubina L, Langlais P (2016) BAD LUC@WMT 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 703–709. https://doi.org/10.18653/v1/W16-2370. https://aclanthology.org/W16-2370
Jantunen J (2002) Comparable corpora in translation studies: strengths and limitations. SKY J Linguist I5(43):105–117
Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128. ISSN 0162-8828. https://doi.org/10.1109/TPAMI.2010.57
Johnson J, Douze M, Jégou H. Billion-scale similarity search with GPUs. IEEE Trans Big Data 7(3):535–547. https://doi.org/10.1109/TBDATA.2019.2921572
Ruder S, Constant N, Botha J, Siddhant A, Firat O, Fu J, Liu P, Hu J, Garrette D, Neubig G et al (2021) XTREME-R: towards more challenging and nuanced multilingual evaluation. arXiv preprint arXiv:2104.07412
Sahlgren M, Holst A, Kanerva P (2008) Permutations as a means to encode order in word space. In: Proceedings of the 30th annual meeting of the cognitive science society. Washington DC
Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Forsyth R, Sharoff S (2014) Document dissimilarity within and across languages: a benchmarking study. Literary Linguist Comput 29:6–22
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168
Wu S, Dredze M (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. arXiv preprint arXiv:1904.09077
Conneau A, Lample G, Rinott R, Williams A, Bowman SR, Schwenk H, Stoyanov V (2018) XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 2475–2485. https://doi.org/10.18653/v1/D18-1269. https://www.aclweb.org/anthology/D18-1269
Nivre J, de Marneffe M-C, Ginter F, Goldberg Y, Hajič J, Manning CD, McDonald R, Petrov S, Pyysalo S, Silveira N, Tsarfaty R, Zeman D (2016) Universal dependencies v1: a multilingual treebank collection. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA), Portorož
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics. Berlin, pp 1715–1725. https://doi.org/10.18653/v1/P16-1162
Pfeiffer J, Vulić I, Gurevych I, Ruder S (2020) UNKs everywhere: adapting multilingual language models to new scripts. arXiv preprint arXiv:2012.15562
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502
Rönnqvist S, Skantsi V, Oinonen M, Laippala V (2021) Multilingual and zero-shot is closing in on monolingual web register classification. In: Proceedings of the 23rd nordic conference on computational linguistics (NoDaLiDa). Linköping University Electronic Press, Reykjavik, pp 157–165. https://aclanthology.org/2021.nodalida-main.16
Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1 -from yesterday’s news to tomorrow’s language resources. In: Proceedings of the third international conference on language resources and evaluation (LREC). European Language Resources Association (ELRA), Las Palmas, Canary Islands. http://www.lrec-conf.org/proceedings/lrec2002/pdf/80.pdf
Kurfalı M, Östling R (2021) Probing multilingual language models for discourse. arXiv preprint arXiv:2106.04832
Chau EC, Smith NA (2021) Specializing multilingual language models: an empirical study. arXiv preprint arXiv:2106.09063
Lewis M, Ghazvininejad M, Ghosh G, Aghajanyan A, Wang S, Zettlemoyer L (2020) Pre-training via paraphrasing. arXiv preprint arXiv:2006.15020
Zhao W, Eger S, Bjerva J, Augenstein I (2020) Inducing language-agnostic multilingual representations. arXiv preprint arXiv:2008.09112
Sharoff S (2020) Finding next of kin: cross-lingual embedding spaces for related languages. J Nat Lang Eng 26:163–182
Rios M, Sharoff S (2016) Language adaptation for extending post-editing estimates for closely related languages. Prague Bull Math Linguist 106:181–192
Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the ICML
Kilgarriff A, Charalabopoulou F, Gavrilidou M, Johannessen JB, Khalil S, Johansson Kokkinakis S, Lew R, Sharoff S, Vadlapudi R, Volodina E (2014) Corpus-based vocabulary lists for language learners for nine languages. Lang Resour Evaluat 48(1):121–163
Bański P, Gozdawa-Gołębiowski R (2010) Foreign language examination corpus for l2-learning studies. In: Proceeding of the workshop on building and using comparable corpora, Malta
Kupietz M, Witt A, Bański P, Tufiş D, Cristea D, Váradi T (2017) EuReCo – joining forces for a European reference corpus as a sustainable base for cross-linguistic research. In: Proceedings of the workshop on challenges in the management of large corpora and big data and natural language processing, Birmingham
Kirk J, Čermáková A, Oksefjell Ebeling S, Ebeling J, Kren M, Aijmer K, Benko V, Garabik R, Gorski R, Kupietz M, Jantunen J et al (2018) Introducing the international comparable corpus. In: Using corpora in contrastive and translation studies conference, Louvain. https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/8248/file/Kirk_etal._Introducing_the_International_Comparable_Corpus_2018.pdf
Čermáková A, Jantunen J, Jauhiainen T, Kirk J, Křen M, Kupietz M, Dhonnchadha EU (2021) The international comparable corpus: challenges in building multilingual spoken and written comparable corpora. Res Corpus Linguist 9(1):89–103
Chen J, Nie J-Y (2000) Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In: Sixth applied natural language processing conference. Association for Computational Linguistics, Seattle, pp 21–28. https://doi.org/10.3115/974147.974151. https://aclanthology.org/A00-1004
Church K, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Bernardini S, Ferraresi A (2013) Old needs, new solutions: comparable corpora for language professionals. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, pp 303–319
Xiao R (ed) (2020) Using corpora in contrastive and translation studies. Cambridge Scholars Publishing
Kunilovskaya M, Lapshinova-Koltunski E (2019) Translationese features as indicators of quality in English-Russian human translation. In: Proceedings of the human-informed translation and interpreting technology workshop (HiT-IT 2019). Incoma Ltd, Varna, pp 47–56. https://doi.org/10.26615/issn.2683-0078.2019_006. https://www.aclweb.org/anthology/W19-8706
Kunilovskaya M, Sharoff S (2019) Towards functionally similar corpus resources for translation. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2019). INCOMA Ltd, Varna, pp 583–592. https://doi.org/10.26615/978-954-452-056-4_069. https://aclanthology.org/R19-1069
Munday J (2011) Looming large: a cross-linguistic analysis of semantic prosodies in comparable reference corpora. In: Cruger A, Wallmach K, Munday J (eds) Corpus-based translation studies: research and applications. St Jerome Manchester, pp 169–186
Zanettin F (1998) Bilingual comparable corpora and the training of translators. Meta: Journal des traducteurs 43:12. https://doi.org/10.7202/004638ar
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Other Applications of Comparable Corpora. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-31384-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31383-7
Online ISBN: 978-3-031-31384-4
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 12