Skip to main content

Abstract

As explained in Chap. 1 and later developed in Chap. 6, Machine Translation (MT) engines need to be trained with large numbers of parallel sentences or segments. The quantity and diversity of existing parallel text is limited however. This motivates the search for parallel sentences in comparable corpora. By exploring a larger share of the levels of comparability introduced in Sect. 1.2, a much larger source of multilingual data can be obtained. Strongly comparable corpora such as Wikipedia entries [1, 62] or news text [2] are rife with parallel sentences and have been among the first to be explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See however [11], who performed an audit of some web-mined text collections and uncovered a number of quality issues.

  2. 2.

    https://github.com/facebookresearch/LASER.

References

  1. Adafre SF, de Rijke M (2006) Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources. https://www.aclweb.org/anthology/W06-2810

  2. Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504

    Google Scholar 

  3. Smith JR, Quirk C, Toutanova K (2010) Extracting parallel sentences from comparable corpora using document level alignment. In: Human language technologies: the 2010 annual conference of the North American chapter of the Association for Computational Linguistics, Los Angeles, CA, June 2010. Association for Computational Linguistics, pp 403–411. https://aclanthology.org/N10-1063

  4. Kay M, Roscheisen M (1988) Text-translation alignment. Technical report, Xerox Palo Alto Research Center

    Google Scholar 

  5. Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1):75–102. https://aclanthology.org/J93-1004

  6. Wu D (1994) Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL ’94), Stroudsburg, PA, USA. Association for Computational Linguistics, pp 80–87

    Google Scholar 

  7. Melamed ID (1999) Bitext maps and alignments via pattern recognition. Comput Linguist 25(1):107–130

    Google Scholar 

  8. Moore RC (2002) Fast and accurate sentence alignment of bilingual corpora. In: Machine translation: from research to real users, 5th conference of the association for machine translation in the Americas, Heidelberg, Germany. Springer, pp 135–244

    Google Scholar 

  9. Sennrich R, Volk M (2011) Iterative, MT-based sentence alignment of parallel texts. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011), Riga, Latvia, May 2011. Northern European Association for Language Technology (NEALT), pp 175–182. https://aclanthology.org/W11-4624

  10. Fluhr C, Bisson F, Elkateb F (2000) Parallel text alignment using crosslingual information retrieval techniques. Springer Netherlands, Dordrecht, pp 187–200. ISBN 978-94-017-2535-4. https://doi.org/10.1007/978-94-017-2535-4_9

  11. Caswell I, Kreutzer J, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N, Tapo A, Subramani N, Sokolov A, Sikasote C et al (2021) Quality at a glance: an audit of web-crawled multilingual datasets. arXiv:2103.12028

  12. Agirre E, Banea C, Cer D, Diab M, Gonzalez-Agirre A, Mihalcea R, Rigau G, Wiebe J (2016) SemEval-2016 Task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), San Diego, CA, June 2016. Association for Computational Linguistics, pp 497–511. https://doi.org/10.18653/v1/S16-1081

  13. Zweigenbaum P, Sharoff S, Rapp R (2018) A multilingual dataset for evaluating parallel sentence extraction from comparable corpora. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA), pp 3828–3833. ISBN 979-10-95546-00-9. http://www.lrec-conf.org/proceedings/lrec2018/pdf/955.pdf

  14. Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610

    Google Scholar 

  15. Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: Daum H III, Singh A eds, Proceedings of the 37th international conference on machine learning, volume 119 of Proceedings of machine learning research, pp 4411–4421. PMLR, 13–18 Jul 2020. https://proceedings.mlr.press/v119/hu20b.html

  16. Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Jeju Island, Korea, Jul 2012. Association for Computational Linguistics, pp 873–882. https://aclanthology.org/P12-1092

  17. Huang J, Cai X, Church K (2020) Improving bilingual lexicon induction for low frequency words. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Online, Nov 2020. Association for Computational Linguistics, pp 1310–1314. https://doi.org/10.18653/v1/2020.emnlp-main.100. https://aclanthology.org/2020.emnlp-main.100

  18. Irvine A, Callison-Burch C (2016) End-to-end statistical machine translation with zero or small parallel texts. Nat Lang Eng 22(4):517–548. https://doi.org/10.1017/S1351324916000127

    Article  Google Scholar 

  19. Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310

    Google Scholar 

  20. Jakubina L, Langlais P (2016) BAD LUC@WMT 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, Berlin, Germany, Aug 2016. Association for Computational Linguistics, pp 703–709. https://doi.org/10.18653/v1/W16-2370. https://aclanthology.org/W16-2370

  21. Jantunen J (2002) Comparable corpora in translation studies: strengths and limitations. SKY J Linguist I5(43):105–117

    Google Scholar 

  22. Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128. ISSN 0162-8828. 10.1109/TPAMI.2010.57. https://doi.org/10.1109/TPAMI.2010.57

  23. Johnson J, Douze M, Jégou H (2021) Billion-scale similarity search with GPUs. IEEE Trans Big Data 7(3):535–547. https://doi.org/10.1109/TBDATA.2019.2921572

  24. Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st annual meeting of the association for computational linguistics, Sapporo, Japan, Jul 2003. Association for Computational Linguistics, pp 72–79. https://doi.org/10.3115/1075096.1075106. https://aclanthology.org/P03-1010

  25. Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, Jul 2004. Association for Computational Linguistics, pp 57–63. https://aclanthology.org/W04-3208

  26. Zhao B, Vogel S (2002) Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE international conference on data mining, pp 745–748. https://doi.org/10.1109/ICDM.2002.1184044

  27. Schwenk H (2018) Filtering and mining parallel data in a joint multilingual space. In: Proceedings of the ACL, Melbourne, Australia, Jul 2018. Association for Computational Linguistics, pp 228–234. https://doi.org/10.18653/v1/P18-2037. https://www.aclweb.org/anthology/P18-2037

  28. Hangya V, Braune F, Kalasouskaya Y, Fraser A (2018) Unsupervised parallel sentence extraction from comparable corpora. In: Proceedings of the 15th international workshop on spoken language translation (IWSLT), pp 7–13. https://www.cis.uni-muenchen.de/~fraser/pubs/hangya_iwslt2018.pdf

  29. Harris Z (1954) Distributional structure. Word 10(23):146–162

    Article  Google Scholar 

  30. Hermann KM, Blunsom P (2014) Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Baltimore, Maryland, June 2014. Association for Computational Linguistics, pp 58–68. https://doi.org/10.3115/v1/P14-1006. https://aclanthology.org/P14-1006

  31. Hoshen Y, Wolf L (2018) Non-adversarial unsupervised word translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, Oct 2018. Association for Computational Linguistics, pp 469–478. https://doi.org/10.18653/v1/D18-1043. https://aclanthology.org/D18-1043

  32. Abdul-Rauf S, Schwenk H (2009) Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora (BUCC), Singapore, Aug 2009. Association for Computational Linguistics, pp 46–54. https://aclanthology.org/W09-3109

  33. Etchegoyhen T, Azpeitia A (2016) Set-theoretic alignment for comparable corpora. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, Aug 2016. Association for Computational Linguistics, pp 2009–2018. https://doi.org/10.18653/v1/P16-1189. https://aclanthology.org/P16-1189

  34. Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85

    Google Scholar 

  35. Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, Berlin, Germany, Aug 2016. Association for Computational Linguistics, pp 554–563. https://doi.org/10.18653/v1/W16-2347. https://aclanthology.org/W16-2347

  36. Cai X, Huang J, Bian Y, Church K (2021) Isotropy in the contextual embedding space: clusters and manifolds. In: International conference on learning representations

    Google Scholar 

  37. Grover J, Mitra P (2017) Bilingual word embeddings with bucketed CNN for parallel sentence extraction. In: Proceedings of ACL 2017, student research workshop, Vancouver, Canada, Jul 2017. Association for Computational Linguistics, pp 11–16. https://aclanthology.org/P17-3003

  38. Schwenk H, Douze M (2017) Learning joint multilingual sentence representations with neural machine translation. In: Proceedings of the 2nd workshop on representation learning for NLP, Vancouver, Canada, Aug 2017. Association for Computational Linguistics, pp 157–167. https://doi.org/10.18653/v1/W17-2619. https://aclanthology.org/W17-2619

  39. Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy, Jul 2019. Association for Computational Linguistics, pp 3197–3203. https://doi.org/10.18653/v1/P19-1309. https://aclanthology.org/P19-1309

  40. Grégoire F, Langlais P (2018) Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In: Proceedings of the 27th international conference on computational linguistics, Santa Fe, New Mexico, USA, Aug 2018. Association for Computational Linguistics, pp 1442–1453. https://aclanthology.org/C18-1122

  41. Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds), Advances in neural information processing systems, volume 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf

  42. Alexandre K, Ivan T, Binod B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the COLING, Mumbai, India

    Google Scholar 

  43. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the MT Summit X

    Google Scholar 

  44. Koehn P (2010) Statistical machine translation. Cambridge University Press. ISBN 9780521874151. https://books.google.gr/books?id=4v_Cx1wIMLkC

  45. Koehn P (2020) Neural machine translation. Cambridge University Press. ISBN 9781108497329. https://books.google.gr/books?id=mdDqygEACAAJ

  46. Koppel M, Ordan N (2011) Translationese and its dialects. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, June 2011. Association for Computational Linguistics, pp 1318–1326. https://www.aclweb.org/anthology/P11-1132

  47. McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. arxiv:1708.00107

  48. Artetxe M, Labaka G, Agirre E (2017) Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the ACL, Vancouver, pp 451–462

    Google Scholar 

  49. Lample G, Conneau A, Ranzato MA, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proceedings of the international conference on learning representations

    Google Scholar 

  50. Laroche A, Langlais P (2010) Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: Proceedings of the 23rd international conference on computational linguistics (Coling 2010), Beijing, China, Aug 2010. Coling 2010 Organizing Committee, pp 617–625. https://aclanthology.org/C10-1070

  51. Alvarez-Melis D, Jaakkola T (2018) Gromov-Wasserstein alignment of word embedding spaces. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, Oct 2018. Association for Computational Linguistics, pp 1881–1890. https://doi.org/10.18653/v1/D18-1214. https://aclanthology.org/D18-1214

  52. Sun Y, Zhu S, Yifan F, Mi C (2021) Parallel sentences mining with transfer learning in an unsupervised setting. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: student research workshop, Online, June 2021. Association for Computational Linguistics, pp 136–142. https://doi.org/10.18653/v1/2021.naacl-srw.17. https://aclanthology.org/2021.naacl-srw.17

  53. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics, pp 4171–4186. https://aclanthology.org/N19-1423

  54. Hangya V, Fraser A (2019) Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the ACL, Florence, Italy, Jul 2019. Association for Computational Linguistics, pp 1224–1234. https://doi.org/10.18653/v1/P19-1118. https://www.aclweb.org/anthology/P19-1118

  55. Artetxe M, Labaka G, Agirre E, Cho K (2018) Unsupervised neural machine translation. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, 3 Apr 30–May 2018, conference track proceedings. OpenReview.net, 2018. https://openreview.net/forum?id=Sy2ogebAW

  56. Artetxe M, Labaka G, Agirre E (2019) An effective approach to unsupervised machine translation. arXiv:1902.01313

  57. Aston G, Burnard L (1998) The BNC handbook: exploring the British national corpus with SARA. Edinburgh University Press, Edinburgh

    Google Scholar 

  58. Baayen H (2008) Analyzing linguistic data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  59. Bański P, Gozdawa-Gołębiowski R (2010) Foreign language examination corpus for l2-learning studies. In: Proceedings of the workshop on building and using comparable corpora, Malta

    Google Scholar 

  60. Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Conference on neural information processing systems, Vancouver, Canada, pp 7059–7069

    Google Scholar 

  61. Nguyen TQ, Salazar J (2019) Transformers without tears: improving the normalization of self-attention. In: 16th international workshop on spoken language translation, Hong Kong, Nov 2019. Zenodo. https://doi.org/10.5281/zenodo.3525484

  62. Schwenk H, Chaudhary V, Sun S, Gong H, Guzmán F (2021) WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia. In: Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: main volume, Online, Apr 2021a. Association for Computational Linguistics, pp 1351–1361. https://aclanthology.org/2021.eacl-main.115

  63. Sarikaya R, Maskey S, Zhang R, Jan E-E, Wang D, Ramabhadran B, Roukos S (2009) Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: Proceedings of the Interspeech 2009, pp 432–435. https://doi.org/10.21437/Interspeech.2009-156

  64. Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers), Online, Aug 2021. Association for Computational Linguistics, pp 6490–6500. https://aclanthology.org/2021.acl-long.507

  65. Munteanu DS, Fraser A, Marcu D (2004) Improved machine translation performance via parallel sentence extraction from comparable corpora. In: Proceedings of the human language technology conference of the North American chapter of the association for computational linguistics: HLT-NAACL 2004, Boston, MA, USA, May 2004. Association for Computational Linguistics, pp 265–272. https://aclanthology.org/N04-1034

  66. Stefănescu D, Ion R, Hunsicker S (2012) Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th annual conference of the European association for machine translation, Trento, Italy, May 2012. European Association for Machine Translation, pp 137–144. https://aclanthology.org/2012.eamt-1.37

  67. Sánchez-Cartagena VM, Bañón M, Ortiz-Rojas S, Ramírez G (2018) Prompsit’s submission to WMT 2018 parallel corpus filtering shared task. In: Proceedings of the third conference on machine translation: shared task papers, Belgium, Brussels, Oct 2018. Association for Computational Linguistics, pp 955–962. https://doi.org/10.18653/v1/W18-6488. https://aclanthology.org/W18-6488

  68. Fan A, Bhosale S, Schwenk H, Ma Z, El-Kishky A, Goyal S, Baines M, Celebi O, Wenzek G, Chaudhary V, Goyal N, Birch T, Liptchinsky V, Edunov S, Auli M, Joulin A (2021) Beyond English-centric multilingual machine translation. J Mach Learn Res 22(107):1–48. http://jmlr.org/papers/v22/20-1307.html

  69. Keung P, Salazar J, Lu Y, Smith NA (2020) Unsupervised bitext mining and translation via self-trained contextual embeddings. Trans Assoc Comput Linguist 8:828–841

    Article  Google Scholar 

  70. España-Bonet C, Varga ÁC, Barrón-Cedeño A, van Genabith J (2017) An empirical analysis of NMT-derived interlingual embeddings and their use in parallel sentence identification. IEEE J Sel Top Signal Process 11(8):1340–1350. https://doi.org/10.1109/JSTSP.2017.2764273. https://doi.org/10.1109/JSTSP.2017.2764273

  71. Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2022) Language-agnostic BERT sentence embedding. In: Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022. Association for Computational Linguistics, pp 878–891. https://doi.org/10.18653/v1/2022.acl-long.62. https://aclanthology.org/2022.acl-long.62

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sharoff, S., Rapp, R., Zweigenbaum, P. (2023). Extraction of Parallel Sentences. In: Building and Using Comparable Corpora for Multilingual Natural Language Processing. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-31384-4_4

Download citation

Publish with us

Policies and ethics