Abstract
This chapter gives an overview of parallel corpora, i.e. corpora containing source texts in a given language, aligned with their translations in another language. More specifically, it focuses on directional corpora, i.e. parallel corpora where the source and target languages are clearly identified. These types of corpora are widely used in contrastive linguistics and translation studies. The chapter first outlines the key features of parallel corpora (they typically contain written texts translated by expert translators working into their native language) and describes the main methods of parallel corpus analysis, including the combined use of parallel and comparable corpora. It then examines the major challenges that are linked with the design and analysis of parallel corpora, such as text availability, metadata collection, bitext alignment, and multilingual linguistic annotation, on the one hand, and data scarcity, interpretation of the results and infelicitous translations, on the other. Finally, the chapter shows how these challenges can be overcome, most notably by compiling balanced, richly-documented parallel corpora and by cross-fertilizing insights from cross-linguistic research and natural language processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
https://www.monde-diplomatique.fr/diplo/int/. Accessed 22 May 2019.
- 2.
https://uclouvain.be/en/research-institutes/ilc/cecl/pleci.html. Accessed 22 May 2019.
- 3.
https://uclouvain.be/en/research-institutes/ilc/cecl/mult-ed.html. Accessed 22 May 2019.
- 4.
- 5.
Unsurprisingly, it is far from easy to obtain copyright clearance for texts to be included in parallel corpora. For this reason, many parallel corpora are not publicly available (e.g. ENPC, PLECI, Raf Salkie’s INTERSECT, P-ACTRES, CroCo).
- 6.
http://nl.ijs.si/ME/V4/msd/html/index.html. Accessed 22 May 2019.
- 7.
http://nl.ijs.si/spook/msd/html-en/. Accessed 22 May 2019.
- 8.
http://universaldependencies.org/. Accessed 22 May 2019.
- 9.
The practice of translating the European Parliament proceedings into all EU languages was ceased in the second half of 2011. The verbatim reports of the plenary sittings are still made available on the European Parliament website, but the written-up versions of the speeches are only published in the languages in which the speeches were delivered.
- 10.
In this respect, it is important to stress that English is increasingly used as a lingua franca at the European Parliament. In other words, some of the speeches originally delivered in English are in fact given by non-native speakers of English (the same holds, albeit to a lesser extent, for other languages, such as French). This is not a trivial issue, as recent research indicates that the use of English as a Lingua Franca can have a considerable impact on translators’ (and interpreters’) outputs (see Albl-Mikasa 2017 for an overview of English as a Lingua Franca in translation and interpreting).
References
Aijmer, K., & Simon-Vandenbergen, A.-M. (2003). The discourse particle well and its equivalents in Swedish and Dutch. Linguistics, 41(6), 1123–1161.
Albl-Mikasa, M. (2017). ELF and translation/interpreting. In J. Jenkins, W. Baker, & M. Dewey (Eds.), The Routledge handbook of English as a Lingua Franca (pp. 369–384). London/New York: Routledge.
Altenberg, B. (1999). Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In H. Hasselgård & S. Oksefjell (Eds.), Out of corpora. Studies in honour of Stig Johansson (pp. 249–268). Amsterdam: Rodopi.
Altenberg, B., & Granger, S. (2002). Recent trends in cross-linguistic lexical studies. In B. Altenberg & S. Granger (Eds.), Lexis in contrast. Corpus-based approaches (pp. 3–48). Amsterdam/Philadelphia: John Benjamins.
Assis Rosa, A., Pięta, H., & Bueno Maia, R. (2017). Theoretical, methodological and terminological issues regarding indirect translation: An overview. Translation Studies, 10(2), 113–132.
Augustinus, L., Vandeghinste, V., & Vanallemeersch, T. (2016). Poly-GrETEL: Cross-lingual example-based querying of syntactic constructions. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 3549–3554). European Language Resources Association (ELRA).
Baisa, V., Michelfeit, J., Medved, M., & Jakubíček, M. (2016). European Union language resources in sketch engine. In Proceedings of tenth international conference on language resources and evaluation (LREC’16). European Language Resources Association (ELRA).
Baker, M. (1993). Corpus linguistics and translation studies. Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology. In honour of John Sinclair (pp. 233–250). Amsterdam: John Benjamins.
Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Targets, 7(2), 223–243.
Baños, R., Bruti, S., & Zanotti, S., (Eds.). (2013). Corpus linguistics and audiovisual translation: In search of an integrated approach. Special issue of Perspectives, 21(4).
Beeby Lonsdale, A. (2009). Directionality. In M. Baker & G. Saldanha (Eds.), Routledge encyclopedia of translation studies (pp. 84–88). Abingdon: Routledge.
Benko, V. (2016). Two years of Aranea: Increasing counts and tuning the pipeline. In Proceedings of 10th international conference on language resources and evaluation (LREC’16) (pp. 4245–4248). European Language Resources Association (ELRA).
Bernardini, S. (2011). Monolingual comparable corpora and parallel corpora in the search for features of translated language. SYNAPS, 26, 2–13.
Bernardini, S., Ferraresi, A., Russo, M., Collard, C., & Defrancq, B. (2018). Building interpreting and intermodal corpora: A How-to for a formidable task. In M. Russo, C. Bendazzoli, & B. Defrancq (Eds.), Making way in corpus-based interpreting studies (pp. 21–42). Springer.
Bojar, O., Žabokrtský, Z., Dušek, O., Galušcáková, P., Majliš, M., Marecek, D., Maršík, J., Novák, M., Popel, M., & Tamchyna, A. (2012). The joy of parallelism with CzEng 1.0. In Proceedings of the 8th international conference on language resources and evaluation (LREC-2012) (pp. 3921–3928). European Language Resources Association (ELRA).
Bowker, L., & Bennison, P. (2003). Student translation archive and student translation tracking system. Design, development and application. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 103–117). Manchester: St. Jerome Publishing.
Cappelle, B., & Loock, R. (2013). Is there interference of usage constraints? A frequency study of existential there is and its French equivalent il y a in translated vs. non-translated texts. Target, 25(2), 252–275.
Cartoni, B., & Meyer, T. (2012). Extracting directional and comparable corpora from a multilingual corpus for translation studies. In Proceedings of the 8th international conference on language resources and evaluation (LREC-2012) (pp. 2132–2137). European Language Resources Association (ELRA).
Castagnoli, S., Ciobanu, D., Kübler, N., Kunz, K., & Volanschi, A. (2011). Designing a learner translator Corpus for training purposes. In N. Kübler (Ed.), Corpora, language, teaching, and resources: From theory to practice (pp. 221–248). Bern: Peter Lang.
Čermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 13(3), 411–427.
Cettolo, M., Girardi, C., & Federico, M. (2012). WIT3: Web inventory of transcribed and translated talks. Proceedings of EAMT, 261–268.
De Sutter, G., Lefer, M.-A., & Delaere, I. (Eds.). (2017). Empirical translation studies: New methodological and theoretical traditions. Berlin/Boston: De Gruyter Mouton.
Delaere, I., & De Sutter, G. (2017). Variability of English loanword use in Belgian Dutch translations: Measuring the effect of source language, register, and editorial intervention. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 81–112). Berlin/Boston: De Gruyter Mouton.
Dupont, M., & Zufferey, S. (2017). Methodological issues in the use of directional parallel corpora. A case study of English and French concessive connectives. International Journal of Corpus Linguistics, 22(2), 270–297.
Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of IBM model 2. Proceedings of NAACL-HLT, 2013, 644–648.
Espunya, A. (2014). The UPF learner translation corpus as a resource for translator training. Language Resources and Evaluation, 48(1), 33–43.
Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32, 429–492.
Evert, S., & Neumann, S. (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 47–80). Berlin/Boston: De Gruyter Mouton.
Ferraresi, A., & Bernardini, S. (2019). Building EPTIC: A many-sided, multi-purpose corpus of EU parliament proceedings. In M. T. S. Nieto & I. Doval (Eds.), Parallel corpora: Creation and application. Amsterdam/Philadelphia: John Benjamins.
Fišer, D., Lenardič, J., & Erjavec, T. (2018). CLARIN’s key resource families. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 1320–1325).
Fløttum, K., Dahl, T., Didriksen, A. A., & Gjesdal, A. M. (2013). KIAP – Reflections on a complex corpus. Bergen Language and Linguistics Studies, 3(1), 137–150.
Frankenberg-Garcia, A., & Santos, D. (2003). Introducing COMPARA: The Portuguese-English parallel Corpus. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 71–87). Manchester: St. Jerome Publishing.
Granger, S., & Lefer, M.-A. (2016). From general to learners’ bilingual dictionaries: Towards a more effective fulfillment of advanced learners' phraseological needs. International Journal of Lexicography, 29(3), 279–295.
Granger, S., & Lefer, M.-A. (2020). The multilingual student translation corpus: A resource for translation teaching and research. Language Resources and Evaluation: Online First.
Granger, S., Lerot, J., & Petch-Tyson, S. (Eds.). (2003). Corpus-based approaches to contrastive linguistics and translation studies. Amsterdam/New York: Rodopi.
Halverson, S. L. (2015). The status of contrastive data in translation studies. Across Languages and Cultures, 16(2), 163–185.
Hansen-Schirra, S., Neumann, S., & Steiner, E. (2012). Cross-linguistic corpora for the study of translations. Insights from the language pair English-German. Berlin: De Gruyter.
Izquierdo, M., Hofland, K., & Reigem, Ø. (2008). The ACTRES parallel corpus: An English–Spanish translation corpus. Corpora, 3(1), 31–41.
Jimenez Hurtado, C., & Soler Gallego, S. (2013). Multimodality, translation and accessibility: A corpus-based study of audio description. Perspectives, 21(4), 577–594.
Johansson, S. (2007). Seeing through multilingual corpora. On the use of corpora in contrastive studies. Amsterdam/Philadelphia: John Benjamins.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of MT Summit X, 79–86.
Kruger, A., Wallmach, K., & Munday, J. (Eds.). (2011). Corpus-based translation studies. Research and applications. London/New York: Bloomsbury.
Kutuzov, A., & Kunilovskaya, M. (2014). Russian learner translator corpus. Design, research potential and applications. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue. TSD 2014 (pp. 315–323). Springer.
Lapshinova-Koltunski, E. (2017). Exploratory analysis of dimensions influencing variation in translation. The case of text register and translation method. In G. De Sutter, M.-A. Lefer, & I. Delaere (Eds.), Empirical translation studies: New methodological and theoretical traditions (pp. 207–234). Berlin/Boston: De Gruyter Mouton.
Lefer, M.-A., & Grabar, N. (2015). Super-creative and over-bureaucratic: A cross-genre corpus-based study on the use and translation of evaluative prefixation in TED talks and EU parliamentary debates. Across Languages and Cultures, 16(2), 187–208.
Levshina, N. (2016). Verbs of letting in Germanic and Romance languages: A quantitative investigation based on a parallel corpus of film subtitles. Languages in Contrast, 16(1), 84–117.
Macken, L., De Clercq, O., & Paulussen, H. (2011). Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta, 56(2), 374–390.
Mauranen, A. (2005). Contrasting languages and varieties with translational corpora. Languages in Contrast, 5(1), 73–92.
Meurant, L., Gobert, M., & Cleve, A. (2016). Modelling a parallel corpus of French and French Belgian sign language. In Proceedings of the 10th edition of the language resources and evaluation conference (LREC 2016) (pp. 4236–4240).
Mezeg, A. (2010). Compiling and using a French-Slovenian parallel corpus. In R. Xiao (Ed.), Proceedings of the international symposium on using corpora in contrastive and translation studies (UCCTS 2010) (pp. 1–27). Ormskirk: Edge Hill University.
Mikhailov, M., & Cooper, R. (2016). Corpus linguistics for translation and contrastive studies. A guide for research. London/New York: Routledge.
Neumann, S. (2013). Contrastive register variation. A quantitative approach to the comparison of English and German. Berlin/Boston: De Gruyter Mouton.
Noël, D. (2003). Translations as evidence for semantics: An illustration. Linguistics, 41(4), 757–785.
Obrusnik, A. (2014). Hypal: A user-friendly tool for automatic parallel Text alignment and error tagging. In Eleventh international conference teaching and language corpora, Lancaster, 20–23 July 2014 (pp. 67–69).
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards wider multilinguality. In Proceedings of the language resources and evaluation conference (LREC-2012). European Language Resources Association (ELRA).
Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.
Rosen, A. (2010). Mediating between incompatible Tagsets. NEALT Proceedings Series, 10, 53–62.
Russo, M., Bendazzoli, C., & Sandrelli, A. (2006). Looking for lexical patterns in a trilingual corpus of source and interpreted speeches: Extended analysis of EPIC. Forum, 4(1), 221–254.
Russo, M., Bendazzoli, C., & Defrancq, B. (Eds.). (2018). Making way in corpus-based interpreting studies. Springer.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing.
Singh, S., McEnery, T., & Baker, P. (2000). Building a parallel corpus of English/Panjabi. In J. Véronis (Ed.), Parallel Text processing. Alignment and use of translation corpora (pp. 335–346). Kluwer Academic Publishers.
Tiedemann, J. (2011). Bitext alignment. Morgan & Claypool Publishers.
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012) (pp. 2214–2218).
Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.
Toury, G. (2012). Descriptive translation studies – And beyond. Amsterdam/Philadelphia: John Benjamins.
Uzar, R. S. (2002). A corpus methodology for analysing translation. Cadernos de Tradução, 9(1), 235–263.
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2007). Parallel corpora for medium density languages. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing IV: Selected papers from RANLP 2005 (pp. 247–258). Amsterdam & Philadelphia: John Benjamins.
Volk, M., Ghring, A., Rios, A., Marek, T., & Samuelsson, Y. (2015). SMULTRON (version 4.0) – The Stockholm MULtilingual parallel TReebank. An English-French-German-Quechua-Spanish-Swedish parallel treebank with sub-sentential alignments. Institute of Computational Linguistics, University of Zurich.
Vondřička, P. (2014). Aligning parallel texts with InterText. In Proceedings of the ninth international conference on language resources and evaluation (LREC 2014) (pp. 1875–1879).
Waldenfels, R. V. (2011). Recent developments in ParaSol: Breadth for depth and XSLT based web concordancing with CWB. In D. Majchráková & R. Garabík (Eds.), Natural Language Processing, Multilinguality. Proceedings of Slovko 2011, Modra, Slovakia, 20–21 October 2011 (pp. 156–162). Bratislava: Tribun EU.
Xiao, R. (Ed.). (2010). Using corpora in contrastive and translation studies. Newcastle upon Tyne: Cambridge Scholars Publishing.
Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations parallel corpus v1.0. Language Resources and Evaluation (LREC’16).
Zufferey, S., & Cartoni, B. (2012). English and French causal connectives in contrast. Languages in Contrast, 12(2), 232–250.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Further Reading
Further Reading
-
Johansson, S. 2007. Seeing through Multilingual Corpora. On the use of corpora in contrastive studies. Amsterdam/Philadelphia: John Benjamins.
Johansson’s monograph is a must-read for anyone interested in corpus-based contrastive linguistics. The book provides a highly readable introduction to corpus design and use in contrastive linguistics. It also offers a range of exemplary case studies contrasting lexis, syntax, and discourse on the basis of parallel corpus data.
-
Mikhailov, M., and Cooper, R. 2016. Corpus Linguistics for Translation and Contrastive Studies. A guide for research . London/New York: Routledge.
In this accessible guide for research, Mikhailov & Cooper provide detailed information on parallel corpus compilation and describe a wide range of search procedures that are commonly used in corpus-based contrastive and translation studies. The book also offers a useful survey of some of the available parallel corpora.
-
Zanettin, F. 2012. Translation-Driven Corpora. Corpus Resources for Descriptive and Applied Translation Studies . London/New York: Routledge.
Zanettin’s coursebook is a practical introduction to descriptive and applied corpus-based translation studies. In addition to providing clear background information on the study of translation features in the field, it offers a wealth of useful information on translation-driven (including parallel) corpus design, encoding, annotation, and analysis. Each chapter is enriched with insightful case studies and hands-on tasks.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Lefer, MA. (2020). Parallel Corpora. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-46216-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)