Skip to main content

Parallel Corpora

  • Chapter
  • First Online:
A Practical Handbook of Corpus Linguistics

Abstract

This chapter gives an overview of parallel corpora, i.e. corpora containing source texts in a given language, aligned with their translations in another language. More specifically, it focuses on directional corpora, i.e. parallel corpora where the source and target languages are clearly identified. These types of corpora are widely used in contrastive linguistics and translation studies. The chapter first outlines the key features of parallel corpora (they typically contain written texts translated by expert translators working into their native language) and describes the main methods of parallel corpus analysis, including the combined use of parallel and comparable corpora. It then examines the major challenges that are linked with the design and analysis of parallel corpora, such as text availability, metadata collection, bitext alignment, and multilingual linguistic annotation, on the one hand, and data scarcity, interpretation of the results and infelicitous translations, on the other. Finally, the chapter shows how these challenges can be overcome, most notably by compiling balanced, richly-documented parallel corpora and by cross-fertilizing insights from cross-linguistic research and natural language processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.monde-diplomatique.fr/diplo/int/. Accessed 22 May 2019.

  2. 2.

    https://uclouvain.be/en/research-institutes/ilc/cecl/pleci.html. Accessed 22 May 2019.

  3. 3.

    https://uclouvain.be/en/research-institutes/ilc/cecl/mult-ed.html. Accessed 22 May 2019.

  4. 4.

    https://www.alc.manchester.ac.uk/translation-and-intercultural-studies/research/projects/translational-english-corpus-tec/. Accessed 22 May 2019.

  5. 5.

    Unsurprisingly, it is far from easy to obtain copyright clearance for texts to be included in parallel corpora. For this reason, many parallel corpora are not publicly available (e.g. ENPC, PLECI, Raf Salkie’s INTERSECT, P-ACTRES, CroCo).

  6. 6.

    http://nl.ijs.si/ME/V4/msd/html/index.html. Accessed 22 May 2019.

  7. 7.

    http://nl.ijs.si/spook/msd/html-en/. Accessed 22 May 2019.

  8. 8.

    http://universaldependencies.org/. Accessed 22 May 2019.

  9. 9.

    The practice of translating the European Parliament proceedings into all EU languages was ceased in the second half of 2011. The verbatim reports of the plenary sittings are still made available on the European Parliament website, but the written-up versions of the speeches are only published in the languages in which the speeches were delivered.

  10. 10.

    In this respect, it is important to stress that English is increasingly used as a lingua franca at the European Parliament. In other words, some of the speeches originally delivered in English are in fact given by non-native speakers of English (the same holds, albeit to a lesser extent, for other languages, such as French). This is not a trivial issue, as recent research indicates that the use of English as a Lingua Franca can have a considerable impact on translators’ (and interpreters’) outputs (see Albl-Mikasa 2017 for an overview of English as a Lingua Franca in translation and interpreting).

References

  • Aijmer, K., & Simon-Vandenbergen, A.-M. (2003). The discourse particle well and its equivalents in Swedish and Dutch. Linguistics, 41(6), 1123–1161.

    Article  Google Scholar 

  • Albl-Mikasa, M. (2017). ELF and translation/interpreting. In J. Jenkins, W. Baker, & M. Dewey (Eds.), The Routledge handbook of English as a Lingua Franca (pp. 369–384). London/New York: Routledge.

    Chapter  Google Scholar 

  • Altenberg, B. (1999). Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In H. Hasselgård & S. Oksefjell (Eds.), Out of corpora. Studies in honour of Stig Johansson (pp. 249–268). Amsterdam: Rodopi.

    Google Scholar 

  • Altenberg, B., & Granger, S. (2002). Recent trends in cross-linguistic lexical studies. In B. Altenberg & S. Granger (Eds.), Lexis in contrast. Corpus-based approaches (pp. 3–48). Amsterdam/Philadelphia: John Benjamins.

    Chapter  Google Scholar 

  • Assis Rosa, A., Pięta, H., & Bueno Maia, R. (2017). Theoretical, methodological and terminological issues regarding indirect translation: An overview. Translation Studies, 10(2), 113–132.

    Article  Google Scholar 

  • Augustinus, L., Vandeghinste, V., & Vanallemeersch, T. (2016). Poly-GrETEL: Cross-lingual example-based querying of syntactic constructions. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 3549–3554). European Language Resources Association (ELRA).

    Google Scholar 

  • Baisa, V., Michelfeit, J., Medved, M., & Jakubíček, M. (2016). European Union language resources in sketch engine. In Proceedings of tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA).

    Google Scholar 

  • Baker, M. (1993). Corpus linguistics and translation studies. Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology. In honour of John Sinclair (pp. 233–250). Amsterdam: John Benjamins.

    Chapter  Google Scholar 

  • Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Targets, 7(2), 223–243.

    Article  Google Scholar 

  • Baños, R., Bruti, S., & Zanotti, S., (Eds.). (2013). Corpus linguistics and audiovisual translation: In search of an integrated approach. Special issue of Perspectives, 21(4).

    Google Scholar 

  • Beeby Lonsdale, A. (2009). Directionality. In M. Baker & G. Saldanha (Eds.), Routledge encyclopedia of translation studies (pp. 84–88). Abingdon: Routledge.

    Google Scholar 

  • Benko, V. (2016). Two years of Aranea: Increasing counts and tuning the pipeline. In Proceedings of 10th international conference on language resources and evaluation (LREC’16) (pp. 4245–4248). European Language Resources Association (ELRA).

    Google Scholar 

  • Bernardini, S. (2011). Monolingual comparable corpora and parallel corpora in the search for features of translated language. SYNAPS, 26, 2–13.

    Google Scholar 

  • Bernardini, S., Ferraresi, A., Russo, M., Collard, C., & Defrancq, B. (2018). Building interpreting and intermodal corpora: A How-to for a formidable task. In M. Russo, C. Bendazzoli, & B. Defrancq (Eds.), Making way in corpus-based interpreting studies (pp. 21–42). Springer.

    Google Scholar 

  • Bojar, O., Žabokrtský, Z., Dušek, O., Galušcáková, P., Majliš, M., Marecek, D., Maršík, J., Novák, M., Popel, M., & Tamchyna, A. (2012). The joy of parallelism with CzEng 1.0. In Proceedings of the 8th international conference on language resources and evaluation (LREC-2012) (pp. 3921–3928). European Language Resources Association (ELRA).

    Google Scholar 

  • Bowker, L., & Bennison, P. (2003). Student translation archive and student translation tracking system. Design, development and application. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 103–117). Manchester: St. Jerome Publishing.

    Google Scholar 

  • Cappelle, B., & Loock, R. (2013). Is there interference of usage constraints? A frequency study of existential there is and its French equivalent il y a in translated vs. non-translated texts. Target, 25(2), 252–275.

    Article  Google Scholar 

  • Cartoni, B., & Meyer, T. (2012). Extracting directional and comparable corpora from a multilingual corpus for translation studies. In Proceedings of the 8th international conference on language resources and evaluation (LREC-2012) (pp. 2132–2137). European Language Resources Association (ELRA).

    Google Scholar 

  • Castagnoli, S., Ciobanu, D., Kübler, N., Kunz, K., & Volanschi, A. (2011). Designing a learner translator Corpus for training purposes. In N. Kübler (Ed.), Corpora, language, teaching, and resources: From theory to practice (pp. 221–248). Bern: Peter Lang.

    Google Scholar 

  • Čermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3), 411–427.

    Article  Google Scholar 

  • Cettolo, M., Girardi, C., & Federico, M. (2012). WIT3: Web inventory of transcribed and translated talks. Proceedings of EAMT, 261–268.

    Google Scholar 

  • De Sutter, G., Lefer, M.-A., & Delaere, I. (Eds.). (2017). Empirical translation studies: New methodological and theoretical traditions. Berlin/Boston: De Gruyter Mouton.

    Google Scholar 

  • Delaere, I., & De Sutter, G. (2017). Variability of English loanword use in Belgian Dutch translations: Measuring the effect of source language, register, and editorial intervention. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 81–112). Berlin/Boston: De Gruyter Mouton.

    Google Scholar 

  • Dupont, M., & Zufferey, S. (2017). Methodological issues in the use of directional parallel corpora. A case study of English and French concessive connectives. International Journal of Corpus Linguistics, 22(2), 270–297.

    Article  Google Scholar 

  • Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of IBM model 2. Proceedings of NAACL-HLT, 2013, 644–648.

    Google Scholar 

  • Espunya, A. (2014). The UPF learner translation corpus as a resource for translator training. Language Resources and Evaluation, 48(1), 33–43.

    Article  Google Scholar 

  • Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32, 429–492.

    Article  Google Scholar 

  • Evert, S., & Neumann, S. (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 47–80). Berlin/Boston: De Gruyter Mouton.

    Google Scholar 

  • Ferraresi, A., & Bernardini, S. (2019). Building EPTIC: A many-sided, multi-purpose corpus of EU parliament proceedings. In M. T. S. Nieto & I. Doval (Eds.), Parallel corpora: Creation and application. Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Fišer, D., Lenardič, J., & Erjavec, T. (2018). CLARIN’s key resource families. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 1320–1325).

    Google Scholar 

  • Fløttum, K., Dahl, T., Didriksen, A. A., & Gjesdal, A. M. (2013). KIAP – Reflections on a complex corpus. Bergen Language and Linguistics Studies, 3(1), 137–150.

    Article  Google Scholar 

  • Frankenberg-Garcia, A., & Santos, D. (2003). Introducing COMPARA: The Portuguese-English parallel Corpus. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 71–87). Manchester: St. Jerome Publishing.

    Google Scholar 

  • Granger, S., & Lefer, M.-A. (2016). From general to learners’ bilingual dictionaries: Towards a more effective fulfillment of advanced learners' phraseological needs. International Journal of Lexicography, 29(3), 279–295.

    Article  Google Scholar 

  • Granger, S., & Lefer, M.-A. (2020). The multilingual student translation corpus: A resource for translation teaching and research. Language Resources and Evaluation: Online First.

    Google Scholar 

  • Granger, S., Lerot, J., & Petch-Tyson, S. (Eds.). (2003). Corpus-based approaches to contrastive linguistics and translation studies. Amsterdam/New York: Rodopi.

    Google Scholar 

  • Halverson, S. L. (2015). The status of contrastive data in translation studies. Across Languages and Cultures, 16(2), 163–185.

    Article  Google Scholar 

  • Hansen-Schirra, S., Neumann, S., & Steiner, E. (2012). Cross-linguistic corpora for the study of translations. Insights from the language pair English-German. Berlin: De Gruyter.

    Book  Google Scholar 

  • Izquierdo, M., Hofland, K., & Reigem, Ø. (2008). The ACTRES parallel corpus: An English–Spanish translation corpus. Corpora, 3(1), 31–41.

    Article  Google Scholar 

  • Jimenez Hurtado, C., & Soler Gallego, S. (2013). Multimodality, translation and accessibility: A corpus-based study of audio description. Perspectives, 21(4), 577–594.

    Article  Google Scholar 

  • Johansson, S. (2007). Seeing through multilingual corpora. On the use of corpora in contrastive studies. Amsterdam/Philadelphia: John Benjamins.

    Book  Google Scholar 

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of MT Summit X, 79–86.

    Google Scholar 

  • Kruger, A., Wallmach, K., & Munday, J. (Eds.). (2011). Corpus-based translation studies. Research and applications. London/New York: Bloomsbury.

    Google Scholar 

  • Kutuzov, A., & Kunilovskaya, M. (2014). Russian learner translator corpus. Design, research potential and applications. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue. TSD 2014 (pp. 315–323). Springer.

    Google Scholar 

  • Lapshinova-Koltunski, E. (2017). Exploratory analysis of dimensions influencing variation in translation. The case of text register and translation method. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 207–234). Berlin/Boston: De Gruyter Mouton.

    Google Scholar 

  • Lefer, M.-A., & Grabar, N. (2015). Super-creative and over-bureaucratic: A cross-genre corpus-based study on the use and translation of evaluative prefixation in TED talks and EU parliamentary debates. Across Languages and Cultures, 16(2), 187–208.

    Article  Google Scholar 

  • Levshina, N. (2016). Verbs of letting in Germanic and Romance languages: A quantitative investigation based on a parallel corpus of film subtitles. Languages in Contrast, 16(1), 84–117.

    Article  Google Scholar 

  • Macken, L., De Clercq, O., & Paulussen, H. (2011). Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta, 56(2), 374–390.

    Article  Google Scholar 

  • Mauranen, A. (2005). Contrasting languages and varieties with translational corpora. Languages in Contrast, 5(1), 73–92.

    Article  Google Scholar 

  • Meurant, L., Gobert, M., & Cleve, A. (2016). Modelling a parallel corpus of French and French Belgian sign language. In Proceedings of the 10th edition of the language resources and evaluation conference (LREC 2016) (pp. 4236–4240).

    Google Scholar 

  • Mezeg, A. (2010). Compiling and using a French-Slovenian parallel corpus. In R. Xiao (Ed.), Proceedings of the international symposium on using corpora in contrastive and translation studies (UCCTS 2010) (pp. 1–27). Ormskirk: Edge Hill University.

    Google Scholar 

  • Mikhailov, M., & Cooper, R. (2016). Corpus linguistics for translation and contrastive studies. A guide for research. London/New York: Routledge.

    Book  Google Scholar 

  • Neumann, S. (2013). Contrastive register variation. A quantitative approach to the comparison of English and German. Berlin/Boston: De Gruyter Mouton.

    Book  Google Scholar 

  • Noël, D. (2003). Translations as evidence for semantics: An illustration. Linguistics, 41(4), 757–785.

    Article  Google Scholar 

  • Obrusnik, A. (2014). Hypal: A user-friendly tool for automatic parallel Text alignment and error tagging. In Eleventh international conference teaching and language corpora, Lancaster, 20–23 July 2014 (pp. 67–69).

    Google Scholar 

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards wider multilinguality. In Proceedings of the language resources and evaluation conference (LREC-2012). European Language Resources Association (ELRA).

    Google Scholar 

  • Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.

    Article  Google Scholar 

  • Rosen, A. (2010). Mediating between incompatible Tagsets. NEALT Proceedings Series, 10, 53–62.

    Google Scholar 

  • Russo, M., Bendazzoli, C., & Sandrelli, A. (2006). Looking for lexical patterns in a trilingual corpus of source and interpreted speeches: Extended analysis of EPIC. Forum, 4(1), 221–254.

    Article  Google Scholar 

  • Russo, M., Bendazzoli, C., & Defrancq, B. (Eds.). (2018). Making way in corpus-based interpreting studies. Springer.

    Google Scholar 

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing.

    Google Scholar 

  • Singh, S., McEnery, T., & Baker, P. (2000). Building a parallel corpus of English/Panjabi. In J. Véronis (Ed.), Parallel Text processing. Alignment and use of translation corpora (pp. 335–346). Kluwer Academic Publishers.

    Google Scholar 

  • Tiedemann, J. (2011). Bitext alignment. Morgan & Claypool Publishers.

    Google Scholar 

  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012) (pp. 2214–2218).

    Google Scholar 

  • Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.

    Google Scholar 

  • Toury, G. (2012). Descriptive translation studies – And beyond. Amsterdam/Philadelphia: John Benjamins.

    Book  Google Scholar 

  • Uzar, R. S. (2002). A corpus methodology for analysing translation. Cadernos de Tradução, 9(1), 235–263.

    Google Scholar 

  • Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2007). Parallel corpora for medium density languages. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing IV: Selected papers from RANLP 2005 (pp. 247–258). Amsterdam & Philadelphia: John Benjamins.

    Chapter  Google Scholar 

  • Volk, M., Ghring, A., Rios, A., Marek, T., & Samuelsson, Y. (2015). SMULTRON (version 4.0) – The Stockholm MULtilingual parallel TReebank. An English-French-German-Quechua-Spanish-Swedish parallel treebank with sub-sentential alignments. Institute of Computational Linguistics, University of Zurich.

    Google Scholar 

  • Vondřička, P. (2014). Aligning parallel texts with InterText. In Proceedings of the ninth international conference on language resources and evaluation (LREC 2014) (pp. 1875–1879).

    Google Scholar 

  • Waldenfels, R. V. (2011). Recent developments in ParaSol: Breadth for depth and XSLT based web concordancing with CWB. In D. Majchráková & R. Garabík (Eds.), Natural Language Processing, Multilinguality. Proceedings of Slovko 2011, Modra, Slovakia, 20–21 October 2011 (pp. 156–162). Bratislava: Tribun EU.

    Google Scholar 

  • Xiao, R. (Ed.). (2010). Using corpora in contrastive and translation studies. Newcastle upon Tyne: Cambridge Scholars Publishing.

    Google Scholar 

  • Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations parallel corpus v1.0. Language Resources and Evaluation (LREC’16).

    Google Scholar 

  • Zufferey, S., & Cartoni, B. (2012). English and French causal connectives in contrast. Languages in Contrast, 12(2), 232–250.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marie-Aude Lefer .

Editor information

Editors and Affiliations

Further Reading

Further Reading

  • Johansson, S. 2007. Seeing through Multilingual Corpora. On the use of corpora in contrastive studies. Amsterdam/Philadelphia: John Benjamins.

Johansson’s monograph is a must-read for anyone interested in corpus-based contrastive linguistics. The book provides a highly readable introduction to corpus design and use in contrastive linguistics. It also offers a range of exemplary case studies contrasting lexis, syntax, and discourse on the basis of parallel corpus data.

  • Mikhailov, M., and Cooper, R. 2016. Corpus Linguistics for Translation and Contrastive Studies. A guide for research . London/New York: Routledge.

In this accessible guide for research, Mikhailov & Cooper provide detailed information on parallel corpus compilation and describe a wide range of search procedures that are commonly used in corpus-based contrastive and translation studies. The book also offers a useful survey of some of the available parallel corpora.

  • Zanettin, F. 2012. Translation-Driven Corpora. Corpus Resources for Descriptive and Applied Translation Studies . London/New York: Routledge.

Zanettin’s coursebook is a practical introduction to descriptive and applied corpus-based translation studies. In addition to providing clear background information on the study of translation features in the field, it offers a wealth of useful information on translation-driven (including parallel) corpus design, encoding, annotation, and analysis. Each chapter is enriched with insightful case studies and hands-on tasks.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Lefer, MA. (2020). Parallel Corpora. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_12

Download citation

Publish with us

Policies and ethics