Abstract
As explained in Chap. 1 and later developed in Chap. 6, Machine Translation (MT) engines need to be trained with large numbers of parallel sentences or segments. The quantity and diversity of existing parallel text is limited however. This motivates the search for parallel sentences in comparable corpora. By exploring a larger share of the levels of comparability introduced in Sect. 1.2, a much larger source of multilingual data can be obtained. Strongly comparable corpora such as Wikipedia entries [1, 62] or news text [2] are rife with parallel sentences and have been among the first to be explored.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See however [11], who performed an audit of some web-mined text collections and uncovered a number of quality issues.
- 2.
References
Adafre SF, de Rijke M (2006) Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources. https://www.aclweb.org/anthology/W06-2810
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504
Smith JR, Quirk C, Toutanova K (2010) Extracting parallel sentences from comparable corpora using document level alignment. In: Human language technologies: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics, Los Angeles, CA, June 2010. Association for Computational Linguistics, pp 403–411. https://aclanthology.org/N10-1063
Kay M, Roscheisen M (1988) Text-translation alignment. Technical report, Xerox Palo Alto Research Center
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1):75–102. https://aclanthology.org/J93-1004
Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL ’94), Stroudsburg, PA, USA. Association for Computational Linguistics, pp 80–87
Melamed ID (1999) Bitext maps and alignments via pattern recognition. Comput Linguist 25(1):107–130
Moore RC (2002) Fast and accurate sentence alignment of bilingual corpora. In: Machine translation: from research to real users, 5th conference of the association for machine translation in the Americas, Heidelberg, Germany. Springer, pp 135–244
Sennrich R, Volk M (2011) Iterative, MT-based sentence alignment of parallel texts. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011), Riga, Latvia, May 2011. Northern European Association for Language Technology (NEALT), pp 175–182. https://aclanthology.org/W11-4624
Fluhr C, Bisson F, Elkateb F (2000) Parallel text alignment using crosslingual information retrieval techniques. Springer Netherlands, Dordrecht, pp 187–200. ISBN 978-94-017-2535-4. https://doi.org/10.1007/978-94-017-2535-4_9
Caswell I, Kreutzer J, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N, Tapo A, Subramani N, Sokolov A, Sikasote C et al (2021) Quality at a glance: an audit of web-crawled multilingual datasets. arXiv:2103.12028
Agirre E, Banea C, Cer D, Diab M, Gonzalez-Agirre A, Mihalcea R, Rigau G, Wiebe J (2016) SemEval-2016 Task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), San Diego, CA, June 2016. Association for Computational Linguistics, pp 497–511. https://doi.org/10.18653/v1/S16-1081
Zweigenbaum P, Sharoff S, Rapp R (2018) A multilingual dataset for evaluating parallel sentence extraction from comparable corpora. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA), pp 3828–3833. ISBN 979-10-95546-00-9. http://www.lrec-conf.org/proceedings/lrec2018/pdf/955.pdf
Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610
Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: Daum H III, Singh A eds, Proceedings of the 37th international conference on machine learning, volume 119 of Proceedings of machine learning research, pp 4411–4421. PMLR, 13–18 Jul 2020. https://proceedings.mlr.press/v119/hu20b.html
Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Jeju Island, Korea, Jul 2012. Association for Computational Linguistics, pp 873–882. https://aclanthology.org/P12-1092
Huang J, Cai X, Church K (2020) Improving bilingual lexicon induction for low frequency words. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, Nov 2020. Association for Computational Linguistics, pp 1310–1314. https://doi.org/10.18653/v1/2020.emnlp-main.100. https://aclanthology.org/2020.emnlp-main.100
Irvine A, Callison-Burch C (2016) End-to-end statistical machine translation with zero or small parallel texts. Nat Lang Eng 22(4):517–548. https://doi.org/10.1017/S1351324916000127
Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310
Jakubina L, Langlais P (2016) BAD LUC@WMT 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, Berlin, Germany, Aug 2016. Association for Computational Linguistics, pp 703–709. https://doi.org/10.18653/v1/W16-2370. https://aclanthology.org/W16-2370
Jantunen J (2002) Comparable corpora in translation studies: strengths and limitations. SKY J Linguist I5(43):105–117
Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128. ISSN 0162-8828. 10.1109/TPAMI.2010.57. https://doi.org/10.1109/TPAMI.2010.57
Johnson J, Douze M, Jégou H (2021) Billion-scale similarity search with GPUs. IEEE Trans Big Data 7(3):535–547. https://doi.org/10.1109/TBDATA.2019.2921572
Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st annual meeting of the association for computational linguistics, Sapporo, Japan, Jul 2003. Association for Computational Linguistics, pp 72–79. https://doi.org/10.3115/1075096.1075106. https://aclanthology.org/P03-1010
Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, Jul 2004. Association for Computational Linguistics, pp 57–63. https://aclanthology.org/W04-3208
Zhao B, Vogel S (2002) Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE international conference on data mining, pp 745–748. https://doi.org/10.1109/ICDM.2002.1184044
Schwenk H (2018) Filtering and mining parallel data in a joint multilingual space. In: Proceedings of the ACL, Melbourne, Australia, Jul 2018. Association for Computational Linguistics, pp 228–234. https://doi.org/10.18653/v1/P18-2037. https://www.aclweb.org/anthology/P18-2037
Hangya V, Braune F, Kalasouskaya Y, Fraser A (2018) Unsupervised parallel sentence extraction from comparable corpora. In: Proceedings of the 15th international workshop on spoken language translation (IWSLT), pp 7–13. https://www.cis.uni-muenchen.de/~fraser/pubs/hangya_iwslt2018.pdf
Harris Z (1954) Distributional structure. Word 10(23):146–162
Hermann KM, Blunsom P (2014) Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Baltimore, Maryland, June 2014. Association for Computational Linguistics, pp 58–68. https://doi.org/10.3115/v1/P14-1006. https://aclanthology.org/P14-1006
Hoshen Y, Wolf L (2018) Non-adversarial unsupervised word translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, Oct 2018. Association for Computational Linguistics, pp 469–478. https://doi.org/10.18653/v1/D18-1043. https://aclanthology.org/D18-1043
Abdul-Rauf S, Schwenk H (2009) Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora (BUCC), Singapore, Aug 2009. Association for Computational Linguistics, pp 46–54. https://aclanthology.org/W09-3109
Etchegoyhen T, Azpeitia A (2016) Set-theoretic alignment for comparable corpora. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, Aug 2016. Association for Computational Linguistics, pp 2009–2018. https://doi.org/10.18653/v1/P16-1189. https://aclanthology.org/P16-1189
Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85
Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, Berlin, Germany, Aug 2016. Association for Computational Linguistics, pp 554–563. https://doi.org/10.18653/v1/W16-2347. https://aclanthology.org/W16-2347
Cai X, Huang J, Bian Y, Church K (2021) Isotropy in the contextual embedding space: clusters and manifolds. In: International conference on learning representations
Grover J, Mitra P (2017) Bilingual word embeddings with bucketed CNN for parallel sentence extraction. In: Proceedings of ACL 2017, student research workshop, Vancouver, Canada, Jul 2017. Association for Computational Linguistics, pp 11–16. https://aclanthology.org/P17-3003
Schwenk H, Douze M (2017) Learning joint multilingual sentence representations with neural machine translation. In: Proceedings of the 2nd workshop on representation learning for NLP, Vancouver, Canada, Aug 2017. Association for Computational Linguistics, pp 157–167. https://doi.org/10.18653/v1/W17-2619. https://aclanthology.org/W17-2619
Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy, Jul 2019. Association for Computational Linguistics, pp 3197–3203. https://doi.org/10.18653/v1/P19-1309. https://aclanthology.org/P19-1309
Grégoire F, Langlais P (2018) Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In: Proceedings of the 27th international conference on computational linguistics, Santa Fe, New Mexico, USA, Aug 2018. Association for Computational Linguistics, pp 1442–1453. https://aclanthology.org/C18-1122
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds), Advances in neural information processing systems, volume 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf
Alexandre K, Ivan T, Binod B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the COLING, Mumbai, India
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the MT Summit X
Koehn P (2010) Statistical machine translation. Cambridge University Press. ISBN 9780521874151. https://books.google.gr/books?id=4v_Cx1wIMLkC
Koehn P (2020) Neural machine translation. Cambridge University Press. ISBN 9781108497329. https://books.google.gr/books?id=mdDqygEACAAJ
Koppel M, Ordan N (2011) Translationese and its dialects. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, June 2011. Association for Computational Linguistics, pp 1318–1326. https://www.aclweb.org/anthology/P11-1132
McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. arxiv:1708.00107
Artetxe M, Labaka G, Agirre E (2017) Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the ACL, Vancouver, pp 451–462
Lample G, Conneau A, Ranzato MA, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proceedings of the international conference on learning representations
Laroche A, Langlais P (2010) Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: Proceedings of the 23rd international conference on computational linguistics (Coling 2010), Beijing, China, Aug 2010. Coling 2010 Organizing Committee, pp 617–625. https://aclanthology.org/C10-1070
Alvarez-Melis D, Jaakkola T (2018) Gromov-Wasserstein alignment of word embedding spaces. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, Oct 2018. Association for Computational Linguistics, pp 1881–1890. https://doi.org/10.18653/v1/D18-1214. https://aclanthology.org/D18-1214
Sun Y, Zhu S, Yifan F, Mi C (2021) Parallel sentences mining with transfer learning in an unsupervised setting. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: student research workshop, Online, June 2021. Association for Computational Linguistics, pp 136–142. https://doi.org/10.18653/v1/2021.naacl-srw.17. https://aclanthology.org/2021.naacl-srw.17
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics, pp 4171–4186. https://aclanthology.org/N19-1423
Hangya V, Fraser A (2019) Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the ACL, Florence, Italy, Jul 2019. Association for Computational Linguistics, pp 1224–1234. https://doi.org/10.18653/v1/P19-1118. https://www.aclweb.org/anthology/P19-1118
Artetxe M, Labaka G, Agirre E, Cho K (2018) Unsupervised neural machine translation. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, 3 Apr 30–May 2018, conference track proceedings. OpenReview.net, 2018. https://openreview.net/forum?id=Sy2ogebAW
Artetxe M, Labaka G, Agirre E (2019) An effective approach to unsupervised machine translation. arXiv:1902.01313
Aston G, Burnard L (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh
Baayen H (2008) Analyzing linguistic data. Cambridge University Press, Cambridge
Bański P, Gozdawa-Gołębiowski R (2010) Foreign language examination corpus for l2-learning studies. In: Proceedings of the workshop on building and using comparable corpora, Malta
Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Conference on neural information processing systems, Vancouver, Canada, pp 7059–7069
Nguyen TQ, Salazar J (2019) Transformers without tears: improving the normalization of self-attention. In: 16th international workshop on spoken language translation, Hong Kong, Nov 2019. Zenodo. https://doi.org/10.5281/zenodo.3525484
Schwenk H, Chaudhary V, Sun S, Gong H, Guzmán F (2021) WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia. In: Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: main volume, Online, Apr 2021a. Association for Computational Linguistics, pp 1351–1361. https://aclanthology.org/2021.eacl-main.115
Sarikaya R, Maskey S, Zhang R, Jan E-E, Wang D, Ramabhadran B, Roukos S (2009) Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: Proceedings of the Interspeech 2009, pp 432–435. https://doi.org/10.21437/Interspeech.2009-156
Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers), Online, Aug 2021. Association for Computational Linguistics, pp 6490–6500. https://aclanthology.org/2021.acl-long.507
Munteanu DS, Fraser A, Marcu D (2004) Improved machine translation performance via parallel sentence extraction from comparable corpora. In: Proceedings of the human language technology conference of the North American chapter of the association for computational linguistics: HLT-NAACL 2004, Boston, MA, USA, May 2004. Association for Computational Linguistics, pp 265–272. https://aclanthology.org/N04-1034
Stefănescu D, Ion R, Hunsicker S (2012) Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th annual conference of the European association for machine translation, Trento, Italy, May 2012. European Association for Machine Translation, pp 137–144. https://aclanthology.org/2012.eamt-1.37
Sánchez-Cartagena VM, Bañón M, Ortiz-Rojas S, Ramírez G (2018) Prompsit’s submission to WMT 2018 parallel corpus filtering shared task. In: Proceedings of the third conference on machine translation: shared task papers, Belgium, Brussels, Oct 2018. Association for Computational Linguistics, pp 955–962. https://doi.org/10.18653/v1/W18-6488. https://aclanthology.org/W18-6488
Fan A, Bhosale S, Schwenk H, Ma Z, El-Kishky A, Goyal S, Baines M, Celebi O, Wenzek G, Chaudhary V, Goyal N, Birch T, Liptchinsky V, Edunov S, Auli M, Joulin A (2021) Beyond English-centric multilingual machine translation. J Mach Learn Res 22(107):1–48. http://jmlr.org/papers/v22/20-1307.html
Keung P, Salazar J, Lu Y, Smith NA (2020) Unsupervised bitext mining and translation via self-trained contextual embeddings. Trans Assoc Comput Linguist 8:828–841
España-Bonet C, Varga ÁC, Barrón-Cedeño A, van Genabith J (2017) An empirical analysis of NMT-derived interlingual embeddings and their use in parallel sentence identification. IEEE J Sel Top Signal Process 11(8):1340–1350. https://doi.org/10.1109/JSTSP.2017.2764273. https://doi.org/10.1109/JSTSP.2017.2764273
Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2022) Language-agnostic BERT sentence embedding. In: Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022. Association for Computational Linguistics, pp 878–891. https://doi.org/10.18653/v1/2022.acl-long.62. https://aclanthology.org/2022.acl-long.62
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Extraction of Parallel Sentences. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-31384-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31383-7
Online ISBN: 978-3-031-31384-4
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 12