Skip to main content

Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0

  • Chapter
  • First Online:
New Perspectives on Corpus Translation Studies

Part of the book series: New Frontiers in Translation Studies ((NFTS))

Abstract

This chapter introduces a new, updated version of the Dutch Parallel Corpus, a bidirectional parallel corpus of expert translations for Dutch><English and Dutch><French language pairs. This revisited version of the corpus, which we dub Dutch Parallel Corpus 2.0, is dynamic in nature, and contains 2.75 million words at the time of writing. The corpus is sentence-aligned, lemmatized and POS-tagged using the state-of-the-art natural language processing toolkit Stanza. Compared to its predecessor, the Dutch Parallel Corpus 2.0 contains more metadata about the translators (e.g. gender, education, experience) and the translation projects (e.g. L1/L2 translation, software used, degree and type of revision), next to the traditional metadata about the texts themselves (e.g. source and target language, intended audience, intended goal, register). The availability of an extensive set of metadata is considered the main asset of this corpus, together with a more principled and flexible register classification, thus stimulating corpus-based translation scholars to answer more refined research questions about the linguistic and contextual factors that shape translated texts, and ultimately fostering ideas and theories about the social and cognitive processes involved in translation performance. The corpus is freely available for research purposes via https://www.dpc2.ugent.be/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The word count was calculated after the initial cleaning process of all texts (cf. Section 3). The eventual word count may deviate somewhat from this preliminary calculation.

  2. 2.

    Translation agencies which were not fully able to provide us with specific details on their employees were counted as a single translator, although more translators may have been involved in the translation process. This is clearly marked in the corpus. The numbers between parentheses refer to the amount of individual translators whose profile could be determined on the basis of all available metadata.

  3. 3.

    Such literary texts were retrieved from Project Gutenberg, an online library of free eBooks: https://www.gutenberg.org/.

  4. 4.

    In contrast with texts of the original DPC-project, which are primarily outdated a decade after being produced, literary texts remain relevant to a higher extent. As such, they are better suited for reintegration in DPC 2.0.

  5. 5.

    Vanilla Aligner (Danielsson and Ridings 1997), Geometric Mapping and Alignment tool (Melamed 1997) and Microsoft Bilingual Aligner (Moore 2002).

  6. 6.

    AlignFactory Light was developed by the software company Terminotix: http://www.terminotix.com/.

  7. 7.

    With the exception of the last example (Ø: 1), which was taken from dpc2-img-000453-NL_EN, all other alignment types were extracted from dpc2-vbr-000244-NL_EN, which is a tourist brochure on the city of Bruges.

  8. 8.

    https://universaldependencies.org/u/pos/.

  9. 9.

    https://universaldependencies.org/u/feat/index.html.

  10. 10.

    https://universaldependencies.org/.

  11. 11.

    For English, Dutch and French, the following language models were used, respectively: UD_English-EWT, UD_Dutch-Alpino and UD_French-GSD.

  12. 12.

    Initially, we did not include translators with a degree in interpreting or occasional translators, for instance.

  13. 13.

    Translator-specific criteria, such as age or gender, were often left unspecified in the questionnaire, since we regularly obtained a general overview of a translation department instead of a unique questionnaire for each translator.

  14. 14.

    As mentioned in Sect. 2, source texts had to be proper source texts, i.e. not translated from yet another source text. Nevertheless, DPC 2.0 contains four source texts which are translations themselves. In contrast with the original DPC, however, the inclusion of these texts was only accepted when the language of the original source text was known.

  15. 15.

    CAT-tools that were mentioned by the text providers are MemoQ, SDL Trados Studio, Déjà Vu X3 Professional, XTM and Wordfast. Post-edited texts were generated by either DeepL or Google Translate.

  16. 16.

    With domain expertise, we refer to translators’ subjective estimation of their expertise regarding a particular translation task and its topic(s).

  17. 17.

    These preliminary calculations were made on the basis of the main annotator’s initial labelling throughout the text-collection phase and do not account for doubtful cases, nor for hybrid contexts. The results of the interannotator agreement are expected to generate subtle modifications for the metadata channel, intended audience, communicative purpose and topic.

  18. 18.

    In addition to texts which were produced for an external audience, we were able to gather texts which are written for an internal target audience, in which organization-internal information is provided to a very specific, internal target audience. Texts which were produced for an internal audience are automatically classified as specialist.

  19. 19.

    The calculations in Table 5 were based on each text’s main communicative purpose. However, the register classification in DPC 2.0 (cf. Section 5) takes into consideration the presence of additional communicative goals within a single text.

  20. 20.

    Text provider and intended audience, respectively, refer to addressor and addressee mentioned in Biber & Conrad (2009).

  21. 21.

    As we mentioned in the previous section, these criteria were determined on the basis of the main annotator’s initial labelling, in anticipation of the in-depth analysis of the students’ ratings. All ratings will be added to the final corpus in order to allow for a more nuanced, fine-grained interpretation of (hybrid) situational criteria, depending on the specific aim(s) of each research project.

  22. 22.

    In order to retrieve literature or journalistic texts which discuss a touristic topic, we invite end-users of DPC 2.0 to further subdivide all registers according to this particular topic. Additionally, the flexibility of our approach equally allows for a topic-based classification of texts, regardless of their predefined situational characteristics.

References

  • Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and technology: In honour of John Sinclair, ed. M. Baker, G. Francis, and E. Tognini-Bonelli, 233–250. Philadelphia, PA/Amsterdam: Benjamins.

    Google Scholar 

  • Baker, M. 1996. Corpus-based translation studies: the challenges that lie ahead. In Terminology, LSP and translation: Studies in language engineering in honour of Juan C. Sager, ed. H. Somers, 175–186. Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Baker, M. 2004. A corpus-based view of similarity and difference in translation. International Journal of Corpus Linguistics 9 (2): 167–193.

    Google Scholar 

  • Biber, D., and S. Conrad. 2009. Register, genre, and style. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Corpas Pastor, G., and M. Seghiri. 2016. Corpus-based approaches to translation and interpreting: From theory to applications. Frankfurt am Main [etc]: Lang.

    Google Scholar 

  • Danielsson, P., and D. Ridings. 1997. Practical presentation of a “vanilla aligner". In Proceedings of the TELRI workshop on alignment and exploitation of texts, Ljubljana.

    Google Scholar 

  • De Clercq, O., and M. Montero Perez. 2010. Data collection and IPR in multilingual parallel corpora: Dutch parallel corpus. In LREC 2010 : Seventh conference on international language resources and evaluation, ed. N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, … D. Tapias, 3383–3388. Paris, France: European Language Resources Association (ELRA).

    Google Scholar 

  • Delaere, I. 2015. Do translations walk the line?: Visually exploring translated and non-translated texts in search of norm conformity. Faculty of Arts and Philosophy, Ghent, Belgium: Ghent University.

    Google Scholar 

  • Delaere, I., and G. De Sutter. 2017. Variability of English loanword use in Belgian Dutch translations : Measuring the effect of source language, register, and editorial intervention. In Empirical translation studies: New methodological and theoretical traditions, vol. 300, ed. G. De Sutter, M.-A. Lefer, and I. Delaere, 81–112. Berlin/Boston: De Gruyter Mouton.

    Google Scholar 

  • De Sutter, G., P. Goethals, T. Leuschner, and S. Vandepitte. 2012. Towards methodologically more rigorous corpus-based translation studies. Across Languages and Cultures 13 (2): 137–143.

    Article  Google Scholar 

  • De Sutter, G., M.-A. Lefer, and I. Delaere (eds.). 2017. Empirical translation studies: New methodological and theoretical traditions. Berlin, Boston: De Gruyter.

    Google Scholar 

  • De Sutter, G., and M.-A. Lefer. 2019. On the need for a new research agenda for corpus-based translation studies: A multi-methodological, multifactorial and interdisciplinary approach. Perspectives-Studies in Translation Theory and Practice 28 (1): 1–23.

    Google Scholar 

  • De Swert, K. 2012. Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. Unpublished manuscript University of Amsterdam. https://www.polcomm.org/wp-content/uploads/ICR01022012.pdf.

  • Egbert, J., D. Biber, and M. Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66 (9): 1817–1831.

    Google Scholar 

  • Fantinuoli, C., and F. Zanettin. 2015. New directions in corpus-based translation studies (Translation and Multilingual Natural Language Processing 1). Berlin: Language Science Press.

    Google Scholar 

  • Geyken, A. 2007. The DWDS corpus: A reference corpus for the German language of the 20th century. In Collocations and Idioms: Linguistic, lexicographic, and computational aspects, ed. C. Fellbaum, 23–41. Continuum Press.

    Google Scholar 

  • Granger, S., and M.-A. Lefer. 2020. The multilingual student translation corpus: A resource for translation teaching and research. Language Resources & Evaluation 54: 1183–1199. https://doi.org/10.1007/s10579-020-09485-6.

  • Halverson, S.L. 2013. Implications of cognitive linguistics for translation studies. In Cognitive linguistics and translation: Advances in some theoretical models and applications, ed. A. Rojo, and I. Ibarretxe-Antuñano, 33–74. Berlin/Boston: Mouton de Gruyter.

    Google Scholar 

  • Halverson, S.L. 2015. Cognitive translation studies and the merging of empirical paradigms. The case of ‘literal Translation’. Translation Spaces 4(2): 310–40.

    Google Scholar 

  • Halverson, S.L. 2017. Gravitational pull in translation: Testing a revised model. In Empirical translation studies: New methodological and theoretical traditions, ed. G. De Sutter, M.-A. Lefer, and I. Delaere, 9–46. Berlin/Boston: Mouton De Gruyter.

    Google Scholar 

  • Hayes, A.F., and K. Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1 (1): 77–89.

    Google Scholar 

  • Ji, M. 2016. Empirical translation studies. Interdisciplinary methodologies explored. Sheffield, UK: Equinox.

    Google Scholar 

  • Kotze, H. 2020. Converging what and how to find out why: An outlook on empirical translation studies. In New Empirical Perspectives on Translation and Interpreting, ed. L. Vandevoorde, J. Daems, and B. Defranq, 333–371. Routledge.

    Google Scholar 

  • Kruger, H., and G. De Sutter. 2018. Alternations in contact and non-contact varieties. Reconceptualising that-omission in translated and non-translated English using the MuPDAR approach. Translation, Cognition & Behavior 1 (2): 251–290.

    Google Scholar 

  • Kruger, H., and B. Van Rooy. 2012. Register and the features of translated language. Across Languages and Cultures 13 (1): 33–65.

    Google Scholar 

  • Kruger, H., and B. Van Rooy. 2016. Constrained language: A multidimensional analysis of translated English and non-native indigenised varieties of English. English World-Wide 37 (1): 26–57.

    Google Scholar 

  • Laviosa, S. 2002. Corpus-based translation studies. Theory, findings, applications. Amsterdam/New York: Rodopi.

    Google Scholar 

  • Lefer, M.-A. 2020. Parallel corpora. In A practical handbook of corpus linguistics, ed. M. Paquot and S. Th. Gries, 257–282. Springer.

    Google Scholar 

  • Macken, L., O. De Clercq, and H. Paulussen. 2011. Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta 56 (2): 374–390.

    Article  Google Scholar 

  • Malamatidou, S. 2018. Corpus triangulation: Combining data and methods in corpus-based translation studies. London: Routledge.

    Google Scholar 

  • Marcus, M.P., B. Santorini, and M.A. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2): 313–330.

    Google Scholar 

  • Melamed, D.I. 1997. A portable algorithm for mapping bitext correspondence. In Proceedings of the 35th annual meeting of the association of computational linguistics (ACL), 305–312. Madrid, Spain.

    Google Scholar 

  • Mellinger, C.D., and T.A. Hanson. 2016. Quantitative research methods in translation and interpreting studies. Abingdon, Oxon: Routledge.

    Google Scholar 

  • Moore, R.C. 2002. Fast and accurate sentence alignment of bilingual corpora. In Proceedings of the 5th conference of the association for machine translation in the Americas, 135–244. Tiburon, California.

    Google Scholar 

  • Neumann, S. 2013. Contrastive register variation. A quantitative approach to the comparison of English and German. Berlin: de Gruyter.

    Google Scholar 

  • Oakes, M.P., and M. Ji. 2012. Quantitative methods in corpus-based translation studies: A practical guide to descriptive translation research. Philadelphia, PA/Amsterdam: Benjamins.

    Google Scholar 

  • Olohan, M. 2004. Introducing Corpora in translation studies. London: Routledge.

    Google Scholar 

  • Paulussen, H., L. Macken, W. Vandeweghe, and P. Desmet. 2013. Dutch parallel corpus: A balanced parallel corpus for Dutch-English and Dutch-French. In Essential speech and language technology for Dutch: Results by the STEVIN-programme, ed. P. Spyns, and J. Odijk, 185–199. Berlin, Germany: Springer.

    Google Scholar 

  • Qi, P., Zhang Yuhao, Zhang Yuhui, J. Bolton, and C.D. Manning. 2020. A Python natural language processing toolkit for many human languages. In Proceedings of the 58th annual meeting of the association for computational linguistics: System demonstrations, 101–108. Online.

    Google Scholar 

  • Vandevoorde, L., J. Daems, and B. Defrancq, eds. 2020. New empirical perspectives on translation and interpreting. Routledge.

    Google Scholar 

  • Van Eynde F., J. Zavrel, and W. Daelemans. 2000. Part of speech tagging and lemmatisation for the spoken Dutch Corpus. In Proceedings of the Second Language Resources and Evaluation Conference (LREC), ed. M. Gavrilidou et al., 1427–1433. Athens, Greece.

    Google Scholar 

  • Xiao, R., and X. Hu. 2015. Corpus-based studies of translational Chinese in English-Chinese Translation. Springer.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gert De Sutter .

Editor information

Editors and Affiliations

Appendix

Appendix

Questionnaire for translators

  1. 1.

    Documents or websites translated (please mention the title of each text):

  2. 2.

    Translation direction:

  3. 3.

    Collaborative translation:

  4. yes

  5. no

  6. 4.

    Translator’s gender:

  7. m

  8. f

  9. x

  10. 5.

    Translator’s degree:

  11. no specific language degree

  12. translation Master

  13. translation Bachelor

  14. language and literature

  15. interpreting

  16. 6.

    Experience as a translator (in years):

  17. 7.

    Translator’s year of birth:

  18. 8.

    Translation tools or memory involved:

  19. none, manual translation

  20. CAT-tool, i.e. ________________________________

  21. post-editing—machine translation, i.e. ____________________________

  22. 9.

    Translation directionality:

  23. (L1 (first language)

  24. (L2 (foreign language)

  25. 10.

    Translator’s status

  26. freelance

  27. in-house

  28. both

  29. 11.

    Use of style guides:

  30. in-house guidelines

  31. in-house glossary

  32. both

  33. none

  34. 12.

    Domain expertise (regarding the text’s topic)

  35. expert

  36. non-expert

  37. 13.

    External revision

  38. monolingual (only translation)

  39. bilingual (source text and translation)

  40. no revision

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Reynaert, R., Macken, L., Tezcan, A., De Sutter, G. (2021). Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0. In: Wang, V.X., Lim, L., Li, D. (eds) New Perspectives on Corpus Translation Studies. New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-16-4918-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-4918-9_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-4917-2

  • Online ISBN: 978-981-16-4918-9

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics