Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0

Reynaert, Ryan; Macken, Lieve; Tezcan, Arda; De Sutter, Gert

doi:10.1007/978-981-16-4918-9_4

Ryan Reynaert⁵,
Lieve Macken⁶,
Arda Tezcan⁶ &
…
Gert De Sutter⁵

Part of the book series: New Frontiers in Translation Studies ((NFTS))

783 Accesses
2 Citations

Abstract

This chapter introduces a new, updated version of the Dutch Parallel Corpus, a bidirectional parallel corpus of expert translations for Dutch><English and Dutch><French language pairs. This revisited version of the corpus, which we dub Dutch Parallel Corpus 2.0, is dynamic in nature, and contains 2.75 million words at the time of writing. The corpus is sentence-aligned, lemmatized and POS-tagged using the state-of-the-art natural language processing toolkit Stanza. Compared to its predecessor, the Dutch Parallel Corpus 2.0 contains more metadata about the translators (e.g. gender, education, experience) and the translation projects (e.g. L1/L2 translation, software used, degree and type of revision), next to the traditional metadata about the texts themselves (e.g. source and target language, intended audience, intended goal, register). The availability of an extensive set of metadata is considered the main asset of this corpus, together with a more principled and flexible register classification, thus stimulating corpus-based translation scholars to answer more refined research questions about the linguistic and contextual factors that shape translated texts, and ultimately fostering ideas and theories about the social and cognitive processes involved in translation performance. The corpus is freely available for research purposes via https://www.dpc2.ugent.be/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Translation Corpus-Informed Research: A Swedish-Croatian Example

Review of Kruger, A., Wallmach, K. and Munday J. (2011) Corpus-Based Translation Studies: Research and Applications. London and New York: Bloomsbury

Corpora in Translation

Notes

1.
The word count was calculated after the initial cleaning process of all texts (cf. Section 3). The eventual word count may deviate somewhat from this preliminary calculation.
2.
Translation agencies which were not fully able to provide us with specific details on their employees were counted as a single translator, although more translators may have been involved in the translation process. This is clearly marked in the corpus. The numbers between parentheses refer to the amount of individual translators whose profile could be determined on the basis of all available metadata.
3.
Such literary texts were retrieved from Project Gutenberg, an online library of free eBooks: https://www.gutenberg.org/.
4.
In contrast with texts of the original DPC-project, which are primarily outdated a decade after being produced, literary texts remain relevant to a higher extent. As such, they are better suited for reintegration in DPC 2.0.
5.
Vanilla Aligner (Danielsson and Ridings 1997), Geometric Mapping and Alignment tool (Melamed 1997) and Microsoft Bilingual Aligner (Moore 2002).
6.
AlignFactory Light was developed by the software company Terminotix: http://www.terminotix.com/.
7.
With the exception of the last example (Ø: 1), which was taken from dpc2-img-000453-NL_EN, all other alignment types were extracted from dpc2-vbr-000244-NL_EN, which is a tourist brochure on the city of Bruges.
8.
https://universaldependencies.org/u/pos/.
9.
https://universaldependencies.org/u/feat/index.html.
10.
https://universaldependencies.org/.
11.
For English, Dutch and French, the following language models were used, respectively: UD_English-EWT, UD_Dutch-Alpino and UD_French-GSD.
12.
Initially, we did not include translators with a degree in interpreting or occasional translators, for instance.
13.
Translator-specific criteria, such as age or gender, were often left unspecified in the questionnaire, since we regularly obtained a general overview of a translation department instead of a unique questionnaire for each translator.
14.
As mentioned in Sect. 2, source texts had to be proper source texts, i.e. not translated from yet another source text. Nevertheless, DPC 2.0 contains four source texts which are translations themselves. In contrast with the original DPC, however, the inclusion of these texts was only accepted when the language of the original source text was known.
15.
CAT-tools that were mentioned by the text providers are MemoQ, SDL Trados Studio, Déjà Vu X3 Professional, XTM and Wordfast. Post-edited texts were generated by either DeepL or Google Translate.
16.
With domain expertise, we refer to translators’ subjective estimation of their expertise regarding a particular translation task and its topic(s).
17.
These preliminary calculations were made on the basis of the main annotator’s initial labelling throughout the text-collection phase and do not account for doubtful cases, nor for hybrid contexts. The results of the interannotator agreement are expected to generate subtle modifications for the metadata channel, intended audience, communicative purpose and topic.
18.
In addition to texts which were produced for an external audience, we were able to gather texts which are written for an internal target audience, in which organization-internal information is provided to a very specific, internal target audience. Texts which were produced for an internal audience are automatically classified as specialist.
19.
The calculations in Table 5 were based on each text’s main communicative purpose. However, the register classification in DPC 2.0 (cf. Section 5) takes into consideration the presence of additional communicative goals within a single text.
20.
Text provider and intended audience, respectively, refer to addressor and addressee mentioned in Biber & Conrad (2009).
21.
As we mentioned in the previous section, these criteria were determined on the basis of the main annotator’s initial labelling, in anticipation of the in-depth analysis of the students’ ratings. All ratings will be added to the final corpus in order to allow for a more nuanced, fine-grained interpretation of (hybrid) situational criteria, depending on the specific aim(s) of each research project.
22.
In order to retrieve literature or journalistic texts which discuss a touristic topic, we invite end-users of DPC 2.0 to further subdivide all registers according to this particular topic. Additionally, the flexibility of our approach equally allows for a topic-based classification of texts, regardless of their predefined situational characteristics.

References

Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and technology: In honour of John Sinclair, ed. M. Baker, G. Francis, and E. Tognini-Bonelli, 233–250. Philadelphia, PA/Amsterdam: Benjamins.
Google Scholar
Baker, M. 1996. Corpus-based translation studies: the challenges that lie ahead. In Terminology, LSP and translation: Studies in language engineering in honour of Juan C. Sager, ed. H. Somers, 175–186. Amsterdam/Philadelphia: John Benjamins.
Google Scholar
Baker, M. 2004. A corpus-based view of similarity and difference in translation. International Journal of Corpus Linguistics 9 (2): 167–193.
Google Scholar
Biber, D., and S. Conrad. 2009. Register, genre, and style. Cambridge, UK: Cambridge University Press.
Google Scholar
Corpas Pastor, G., and M. Seghiri. 2016. Corpus-based approaches to translation and interpreting: From theory to applications. Frankfurt am Main [etc]: Lang.
Google Scholar
Danielsson, P., and D. Ridings. 1997. Practical presentation of a “vanilla aligner". In Proceedings of the TELRI workshop on alignment and exploitation of texts, Ljubljana.
Google Scholar
De Clercq, O., and M. Montero Perez. 2010. Data collection and IPR in multilingual parallel corpora: Dutch parallel corpus. In LREC 2010 : Seventh conference on international language resources and evaluation, ed. N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, … D. Tapias, 3383–3388. Paris, France: European Language Resources Association (ELRA).
Google Scholar
Delaere, I. 2015. Do translations walk the line?: Visually exploring translated and non-translated texts in search of norm conformity. Faculty of Arts and Philosophy, Ghent, Belgium: Ghent University.
Google Scholar
Delaere, I., and G. De Sutter. 2017. Variability of English loanword use in Belgian Dutch translations : Measuring the effect of source language, register, and editorial intervention. In Empirical translation studies: New methodological and theoretical traditions, vol. 300, ed. G. De Sutter, M.-A. Lefer, and I. Delaere, 81–112. Berlin/Boston: De Gruyter Mouton.
Google Scholar
De Sutter, G., P. Goethals, T. Leuschner, and S. Vandepitte. 2012. Towards methodologically more rigorous corpus-based translation studies. Across Languages and Cultures 13 (2): 137–143.
Article Google Scholar
De Sutter, G., M.-A. Lefer, and I. Delaere (eds.). 2017. Empirical translation studies: New methodological and theoretical traditions. Berlin, Boston: De Gruyter.
Google Scholar
De Sutter, G., and M.-A. Lefer. 2019. On the need for a new research agenda for corpus-based translation studies: A multi-methodological, multifactorial and interdisciplinary approach. Perspectives-Studies in Translation Theory and Practice 28 (1): 1–23.
Google Scholar
De Swert, K. 2012. Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. Unpublished manuscript University of Amsterdam. https://www.polcomm.org/wp-content/uploads/ICR01022012.pdf.
Egbert, J., D. Biber, and M. Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66 (9): 1817–1831.
Google Scholar
Fantinuoli, C., and F. Zanettin. 2015. New directions in corpus-based translation studies (Translation and Multilingual Natural Language Processing 1). Berlin: Language Science Press.
Google Scholar
Geyken, A. 2007. The DWDS corpus: A reference corpus for the German language of the 20th century. In Collocations and Idioms: Linguistic, lexicographic, and computational aspects, ed. C. Fellbaum, 23–41. Continuum Press.
Google Scholar
Granger, S., and M.-A. Lefer. 2020. The multilingual student translation corpus: A resource for translation teaching and research. Language Resources & Evaluation 54: 1183–1199. https://doi.org/10.1007/s10579-020-09485-6.
Halverson, S.L. 2013. Implications of cognitive linguistics for translation studies. In Cognitive linguistics and translation: Advances in some theoretical models and applications, ed. A. Rojo, and I. Ibarretxe-Antuñano, 33–74. Berlin/Boston: Mouton de Gruyter.
Google Scholar
Halverson, S.L. 2015. Cognitive translation studies and the merging of empirical paradigms. The case of ‘literal Translation’. Translation Spaces 4(2): 310–40.
Google Scholar
Halverson, S.L. 2017. Gravitational pull in translation: Testing a revised model. In Empirical translation studies: New methodological and theoretical traditions, ed. G. De Sutter, M.-A. Lefer, and I. Delaere, 9–46. Berlin/Boston: Mouton De Gruyter.
Google Scholar
Hayes, A.F., and K. Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1 (1): 77–89.
Google Scholar
Ji, M. 2016. Empirical translation studies. Interdisciplinary methodologies explored. Sheffield, UK: Equinox.
Google Scholar
Kotze, H. 2020. Converging what and how to find out why: An outlook on empirical translation studies. In New Empirical Perspectives on Translation and Interpreting, ed. L. Vandevoorde, J. Daems, and B. Defranq, 333–371. Routledge.
Google Scholar
Kruger, H., and G. De Sutter. 2018. Alternations in contact and non-contact varieties. Reconceptualising that-omission in translated and non-translated English using the MuPDAR approach. Translation, Cognition & Behavior 1 (2): 251–290.
Google Scholar
Kruger, H., and B. Van Rooy. 2012. Register and the features of translated language. Across Languages and Cultures 13 (1): 33–65.
Google Scholar
Kruger, H., and B. Van Rooy. 2016. Constrained language: A multidimensional analysis of translated English and non-native indigenised varieties of English. English World-Wide 37 (1): 26–57.
Google Scholar
Laviosa, S. 2002. Corpus-based translation studies. Theory, findings, applications. Amsterdam/New York: Rodopi.
Google Scholar
Lefer, M.-A. 2020. Parallel corpora. In A practical handbook of corpus linguistics, ed. M. Paquot and S. Th. Gries, 257–282. Springer.
Google Scholar
Macken, L., O. De Clercq, and H. Paulussen. 2011. Dutch parallel corpus: A balanced copyright-cleared parallel corpus. Meta 56 (2): 374–390.
Article Google Scholar
Malamatidou, S. 2018. Corpus triangulation: Combining data and methods in corpus-based translation studies. London: Routledge.
Google Scholar
Marcus, M.P., B. Santorini, and M.A. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2): 313–330.
Google Scholar
Melamed, D.I. 1997. A portable algorithm for mapping bitext correspondence. In Proceedings of the 35th annual meeting of the association of computational linguistics (ACL), 305–312. Madrid, Spain.
Google Scholar
Mellinger, C.D., and T.A. Hanson. 2016. Quantitative research methods in translation and interpreting studies. Abingdon, Oxon: Routledge.
Google Scholar
Moore, R.C. 2002. Fast and accurate sentence alignment of bilingual corpora. In Proceedings of the 5th conference of the association for machine translation in the Americas, 135–244. Tiburon, California.
Google Scholar
Neumann, S. 2013. Contrastive register variation. A quantitative approach to the comparison of English and German. Berlin: de Gruyter.
Google Scholar
Oakes, M.P., and M. Ji. 2012. Quantitative methods in corpus-based translation studies: A practical guide to descriptive translation research. Philadelphia, PA/Amsterdam: Benjamins.
Google Scholar
Olohan, M. 2004. Introducing Corpora in translation studies. London: Routledge.
Google Scholar
Paulussen, H., L. Macken, W. Vandeweghe, and P. Desmet. 2013. Dutch parallel corpus: A balanced parallel corpus for Dutch-English and Dutch-French. In Essential speech and language technology for Dutch: Results by the STEVIN-programme, ed. P. Spyns, and J. Odijk, 185–199. Berlin, Germany: Springer.
Google Scholar
Qi, P., Zhang Yuhao, Zhang Yuhui, J. Bolton, and C.D. Manning. 2020. A Python natural language processing toolkit for many human languages. In Proceedings of the 58th annual meeting of the association for computational linguistics: System demonstrations, 101–108. Online.
Google Scholar
Vandevoorde, L., J. Daems, and B. Defrancq, eds. 2020. New empirical perspectives on translation and interpreting. Routledge.
Google Scholar
Van Eynde F., J. Zavrel, and W. Daelemans. 2000. Part of speech tagging and lemmatisation for the spoken Dutch Corpus. In Proceedings of the Second Language Resources and Evaluation Conference (LREC), ed. M. Gavrilidou et al., 1427–1433. Athens, Greece.
Google Scholar
Xiao, R., and X. Hu. 2015. Corpus-based studies of translational Chinese in English-Chinese Translation. Springer.
Book Google Scholar

Download references

Author information

Authors and Affiliations

Empirical and Quantitative Translation and Interpreting Studies (EQTIS), Department of Translation, Interpreting and Communication, Ghent University, Groot-Brittanniëlaan 45, 9000, Ghent, Belgium
Ryan Reynaert & Gert De Sutter
Language and Translation Technology Team (LT3), Department of Translation, Interpreting and Communication, Ghent University, Groot-Brittanniëlaan 45, 9000, Ghent, Belgium
Lieve Macken & Arda Tezcan

Authors

Ryan Reynaert
View author publications
You can also search for this author in PubMed Google Scholar
Lieve Macken
View author publications
You can also search for this author in PubMed Google Scholar
Arda Tezcan
View author publications
You can also search for this author in PubMed Google Scholar
Gert De Sutter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gert De Sutter .

Editor information

Editors and Affiliations

Department of English, The University of Macau, Macao, Macao
Vincent X. Wang
School of Languages and Translation, Macao Polytechnic Institute, Macao, Macao
Lily Lim
Department of English, University of Macau, Macao, Macao
Defeng Li

Appendix

Questionnaire for translators

1.
Documents or websites translated (please mention the title of each text):
2.
Translation direction:
3.
Collaborative translation:
•
yes
•
no
4.
Translator’s gender:
•
m
•
f
•
x
5.
Translator’s degree:
•
no specific language degree
•
translation Master
•
translation Bachelor
•
language and literature
•
interpreting
6.
Experience as a translator (in years):
7.
Translator’s year of birth:
8.
Translation tools or memory involved:
•
none, manual translation
•
CAT-tool, i.e. ________________________________
•
post-editing—machine translation, i.e. ____________________________
9.
Translation directionality:
•
(L1 (first language)
•
(L2 (foreign language)
10.
Translator’s status
•
freelance
•
in-house
•
both
11.
Use of style guides:
•
in-house guidelines
•
in-house glossary
•
both
•
none
12.
Domain expertise (regarding the text’s topic)
•
expert
•
non-expert
13.
External revision
•
monolingual (only translation)
•
bilingual (source text and translation)
•
no revision

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Reynaert, R., Macken, L., Tezcan, A., De Sutter, G. (2021). Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0. In: Wang, V.X., Lim, L., Li, D. (eds) New Perspectives on Corpus Translation Studies. New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-16-4918-9_4

Download citation

DOI: https://doi.org/10.1007/978-981-16-4918-9_4
Published: 12 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4917-2
Online ISBN: 978-981-16-4918-9
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0

Abstract

Access this chapter

Similar content being viewed by others

Translation Corpus-Informed Research: A Swedish-Croatian Example

Review of Kruger, A., Wallmach, K. and Munday J. (2011) Corpus-Based Translation Studies: Research and Applications. London and New York: Bloomsbury

Corpora in Translation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Building a New-Generation Corpus for Empirical Translation Studies: The Dutch Parallel Corpus 2.0

Abstract

Access this chapter

Similar content being viewed by others

Translation Corpus-Informed Research: A Swedish-Croatian Example

Review of Kruger, A., Wallmach, K. and Munday J. (2011) Corpus-Based Translation Studies: Research and Applications. London and New York: Bloomsbury

Corpora in Translation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation