Abstract
This chapter introduces a new, updated version of the Dutch Parallel Corpus, a bidirectional parallel corpus of expert translations for Dutch><English and Dutch><French language pairs. This revisited version of the corpus, which we dub Dutch Parallel Corpus 2.0, is dynamic in nature, and contains 2.75 million words at the time of writing. The corpus is sentence-aligned, lemmatized and POS-tagged using the state-of-the-art natural language processing toolkit Stanza. Compared to its predecessor, the Dutch Parallel Corpus 2.0 contains more metadata about the translators (e.g. gender, education, experience) and the translation projects (e.g. L1/L2 translation, software used, degree and type of revision), next to the traditional metadata about the texts themselves (e.g. source and target language, intended audience, intended goal, register). The availability of an extensive set of metadata is considered the main asset of this corpus, together with a more principled and flexible register classification, thus stimulating corpus-based translation scholars to answer more refined research questions about the linguistic and contextual factors that shape translated texts, and ultimately fostering ideas and theories about the social and cognitive processes involved in translation performance. The corpus is freely available for research purposes via https://www.dpc2.ugent.be/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The word count was calculated after the initial cleaning process of all texts (cf. Section 3). The eventual word count may deviate somewhat from this preliminary calculation.
- 2.
Translation agencies which were not fully able to provide us with specific details on their employees were counted as a single translator, although more translators may have been involved in the translation process. This is clearly marked in the corpus. The numbers between parentheses refer to the amount of individual translators whose profile could be determined on the basis of all available metadata.
- 3.
Such literary texts were retrieved from Project Gutenberg, an online library of free eBooks: https://www.gutenberg.org/.
- 4.
In contrast with texts of the original DPC-project, which are primarily outdated a decade after being produced, literary texts remain relevant to a higher extent. As such, they are better suited for reintegration in DPC 2.0.
- 5.
- 6.
AlignFactory Light was developed by the software company Terminotix: http://www.terminotix.com/.
- 7.
With the exception of the last example (Ø: 1), which was taken from dpc2-img-000453-NL_EN, all other alignment types were extracted from dpc2-vbr-000244-NL_EN, which is a tourist brochure on the city of Bruges.
- 8.
- 9.
- 10.
- 11.
For English, Dutch and French, the following language models were used, respectively: UD_English-EWT, UD_Dutch-Alpino and UD_French-GSD.
- 12.
Initially, we did not include translators with a degree in interpreting or occasional translators, for instance.
- 13.
Translator-specific criteria, such as age or gender, were often left unspecified in the questionnaire, since we regularly obtained a general overview of a translation department instead of a unique questionnaire for each translator.
- 14.
As mentioned in Sect. 2, source texts had to be proper source texts, i.e. not translated from yet another source text. Nevertheless, DPC 2.0 contains four source texts which are translations themselves. In contrast with the original DPC, however, the inclusion of these texts was only accepted when the language of the original source text was known.
- 15.
CAT-tools that were mentioned by the text providers are MemoQ, SDL Trados Studio, Déjà Vu X3 Professional, XTM and Wordfast. Post-edited texts were generated by either DeepL or Google Translate.
- 16.
With domain expertise, we refer to translators’ subjective estimation of their expertise regarding a particular translation task and its topic(s).
- 17.
These preliminary calculations were made on the basis of the main annotator’s initial labelling throughout the text-collection phase and do not account for doubtful cases, nor for hybrid contexts. The results of the interannotator agreement are expected to generate subtle modifications for the metadata channel, intended audience, communicative purpose and topic.
- 18.
In addition to texts which were produced for an external audience, we were able to gather texts which are written for an internal target audience, in which organization-internal information is provided to a very specific, internal target audience. Texts which were produced for an internal audience are automatically classified as specialist.
- 19.
- 20.
Text provider and intended audience, respectively, refer to addressor and addressee mentioned in Biber & Conrad (2009).
- 21.
As we mentioned in the previous section, these criteria were determined on the basis of the main annotator’s initial labelling, in anticipation of the in-depth analysis of the students’ ratings. All ratings will be added to the final corpus in order to allow for a more nuanced, fine-grained interpretation of (hybrid) situational criteria, depending on the specific aim(s) of each research project.
- 22.
In order to retrieve literature or journalistic texts which discuss a touristic topic, we invite end-users of DPC 2.0 to further subdivide all registers according to this particular topic. Additionally, the flexibility of our approach equally allows for a topic-based classification of texts, regardless of their predefined situational characteristics.
References
Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and technology: In honour of John Sinclair, ed. M. Baker, G. Francis, and E. Tognini-Bonelli, 233–250. Philadelphia, PA/Amsterdam: Benjamins.
Baker, M. 1996. Corpus-based translation studies: the challenges that lie ahead. In Terminology, LSP and translation: Studies in language engineering in honour of Juan C. Sager, ed. H. Somers, 175–186. Amsterdam/Philadelphia: John Benjamins.
Baker, M. 2004. A corpus-based view of similarity and difference in translation. International Journal of Corpus Linguistics 9 (2): 167–193.
Biber, D., and S. Conrad. 2009. Register, genre, and style. Cambridge, UK: Cambridge University Press.
Corpas Pastor, G., and M. Seghiri. 2016. Corpus-based approaches to translation and interpreting: From theory to applications. Frankfurt am Main [etc]: Lang.
Danielsson, P., and D. Ridings. 1997. Practical presentation of a “vanilla aligner". In Proceedings of the TELRI workshop on alignment and exploitation of texts, Ljubljana.
De Clercq, O., and M. Montero Perez. 2010. Data collection and IPR in multilingual parallel corpora: Dutch parallel corpus. In LREC 2010 : Seventh conference on international language resources and evaluation, ed. N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, … D. Tapias, 3383–3388. Paris, France: European Language Resources Association (ELRA).
Delaere, I. 2015. Do translations walk the line?: Visually exploring translated and non-translated texts in search of norm conformity. Faculty of Arts and Philosophy, Ghent, Belgium: Ghent University.
Delaere, I., and G. De Sutter. 2017. Variability of English loanword use in Belgian Dutch translations : Measuring the effect of source language, register, and editorial intervention. In Empirical translation studies: New methodological and theoretical traditions, vol. 300, ed. G. De Sutter, M.-A. Lefer, and I. Delaere, 81–112. Berlin/Boston: De Gruyter Mouton.
De Sutter, G., P. Goethals, T. Leuschner, and S. Vandepitte. 2012. Towards methodologically more rigorous corpus-based translation studies. Across Languages and Cultures 13 (2): 137–143.
De Sutter, G., M.-A. Lefer, and I. Delaere (eds.). 2017. Empirical translation studies: New methodological and theoretical traditions. Berlin, Boston: De Gruyter.
De Sutter, G., and M.-A. Lefer. 2019. On the need for a new research agenda for corpus-based translation studies: A multi-methodological, multifactorial and interdisciplinary approach. Perspectives-Studies in Translation Theory and Practice 28 (1): 1–23.
De Swert, K. 2012. Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. Unpublished manuscript University of Amsterdam. https://www.polcomm.org/wp-content/uploads/ICR01022012.pdf.
Egbert, J., D. Biber, and M. Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66 (9): 1817–1831.
Fantinuoli, C., and F. Zanettin. 2015. New directions in corpus-based translation studies (Translation and Multilingual Natural Language Processing 1). Berlin: Language Science Press.
Geyken, A. 2007. The DWDS corpus: A reference corpus for the German language of the 20th century. In Collocations and Idioms: Linguistic, lexicographic, and computational aspects, ed. C. Fellbaum, 23–41. Continuum Press.
Granger, S., and M.-A. Lefer. 2020. The multilingual student translation corpus: A resource for translation teaching and research. Language Resources & Evaluation 54: 1183–1199. https://doi.org/10.1007/s10579-020-09485-6.
Halverson, S.L. 2013. Implications of cognitive linguistics for translation studies. In Cognitive linguistics and translation: Advances in some theoretical models and applications, ed. A. Rojo, and I. Ibarretxe-Antuñano, 33–74. Berlin/Boston: Mouton de Gruyter.
Halverson, S.L. 2015. Cognitive translation studies and the merging of empirical paradigms. The case of ‘literal Translation’. Translation Spaces 4(2): 310–40.
Halverson, S.L. 2017. Gravitational pull in translation: Testing a revised model. In Empirical translation studies: New methodological and theoretical traditions, ed. G. De Sutter, M.-A. Lefer, and I. Delaere, 9–46. Berlin/Boston: Mouton De Gruyter.
Hayes, A.F., and K. Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1 (1): 77–89.
Ji, M. 2016. Empirical translation studies. Interdisciplinary methodologies explored. Sheffield, UK: Equinox.
Kotze, H. 2020. Converging what and how to find out why: An outlook on empirical translation studies. In New Empirical Perspectives on Translation and Interpreting, ed. L. Vandevoorde, J. Daems, and B. Defranq, 333–371. Routledge.
Kruger, H., and G. De Sutter. 2018. Alternations in contact and non-contact varieties. Reconceptualising that-omission in translated and non-translated English using the MuPDAR approach. Translation, Cognition & Behavior 1 (2): 251–290.
Kruger, H., and B. Van Rooy. 2012. Register and the features of translated language. Across Languages and Cultures 13 (1): 33–65.
Kruger, H., and B. Van Rooy. 2016. Constrained language: A multidimensional analysis of translated English and non-native indigenised varieties of English. English World-Wide 37 (1): 26–57.
Laviosa, S. 2002. Corpus-based translation studies. Theory, findings, applications. Amsterdam/New York: Rodopi.
Lefer, M.-A. 2020. Parallel corpora. In A practical handbook of corpus linguistics, ed. M. Paquot and S. Th. Gries, 257–282. Springer.
Macken, L., O. De Clercq, and H. Paulussen. 2011. Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta 56 (2): 374–390.
Malamatidou, S. 2018. Corpus triangulation: Combining data and methods in corpus-based translation studies. London: Routledge.
Marcus, M.P., B. Santorini, and M.A. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2): 313–330.
Melamed, D.I. 1997. A portable algorithm for mapping bitext correspondence. In Proceedings of the 35th annual meeting of the association of computational linguistics (ACL), 305–312. Madrid, Spain.
Mellinger, C.D., and T.A. Hanson. 2016. Quantitative research methods in translation and interpreting studies. Abingdon, Oxon: Routledge.
Moore, R.C. 2002. Fast and accurate sentence alignment of bilingual corpora. In Proceedings of the 5th conference of the association for machine translation in the Americas, 135–244. Tiburon, California.
Neumann, S. 2013. Contrastive register variation. A quantitative approach to the comparison of English and German. Berlin: de Gruyter.
Oakes, M.P., and M. Ji. 2012. Quantitative methods in corpus-based translation studies: A practical guide to descriptive translation research. Philadelphia, PA/Amsterdam: Benjamins.
Olohan, M. 2004. Introducing Corpora in translation studies. London: Routledge.
Paulussen, H., L. Macken, W. Vandeweghe, and P. Desmet. 2013. Dutch parallel corpus: A balanced parallel corpus for Dutch-English and Dutch-French. In Essential speech and language technology for Dutch: Results by the STEVIN-programme, ed. P. Spyns, and J. Odijk, 185–199. Berlin, Germany: Springer.
Qi, P., Zhang Yuhao, Zhang Yuhui, J. Bolton, and C.D. Manning. 2020. A Python natural language processing toolkit for many human languages. In Proceedings of the 58th annual meeting of the association for computational linguistics: System demonstrations, 101–108. Online.
Vandevoorde, L., J. Daems, and B. Defrancq, eds. 2020. New empirical perspectives on translation and interpreting. Routledge.
Van Eynde F., J. Zavrel, and W. Daelemans. 2000. Part of speech tagging and lemmatisation for the spoken Dutch Corpus. In Proceedings of the Second Language Resources and Evaluation Conference (LREC), ed. M. Gavrilidou et al., 1427–1433. Athens, Greece.
Xiao, R., and X. Hu. 2015. Corpus-based studies of translational Chinese in English-Chinese Translation. Springer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Questionnaire for translators
-
1.
Documents or websites translated (please mention the title of each text):
-
2.
Translation direction:
-
3.
Collaborative translation:
-
•
yes
-
•
no
-
4.
Translator’s gender:
-
•
m
-
•
f
-
•
x
-
5.
Translator’s degree:
-
•
no specific language degree
-
•
translation Master
-
•
translation Bachelor
-
•
language and literature
-
•
interpreting
-
6.
Experience as a translator (in years):
-
7.
Translator’s year of birth:
-
8.
Translation tools or memory involved:
-
•
none, manual translation
-
•
CAT-tool, i.e. ________________________________
-
•
post-editing—machine translation, i.e. ____________________________
-
9.
Translation directionality:
-
•
(L1 (first language)
-
•
(L2 (foreign language)
-
10.
Translator’s status
-
•
freelance
-
•
in-house
-
•
both
-
11.
Use of style guides:
-
•
in-house guidelines
-
•
in-house glossary
-
•
both
-
•
none
-
12.
Domain expertise (regarding the text’s topic)
-
•
expert
-
•
non-expert
-
13.
External revision
-
•
monolingual (only translation)
-
•
bilingual (source text and translation)
-
•
no revision
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Reynaert, R., Macken, L., Tezcan, A., De Sutter, G. (2021). Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0. In: Wang, V.X., Lim, L., Li, D. (eds) New Perspectives on Corpus Translation Studies. New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-16-4918-9_4
Download citation
DOI: https://doi.org/10.1007/978-981-16-4918-9_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4917-2
Online ISBN: 978-981-16-4918-9
eBook Packages: EducationEducation (R0)