Skip to main content

Abstract

Since its existence 10 years ago, the Center for Research and Innovation in Translation and Translation Technology (CRITT) at the Copenhagen Business School has been involved in Translation Process Research (TPR). TPR data was initially collected by the Translog tool and released in 2012 as a Translation Process Research Database (TPR-DB). Since 2012 many more experiments have been conducted and more data has been added to the TPR-DB. In particular, within the CASMACAT (Sanchis-Trilles et al. 2014) project a large amount of TPR data for post-editing machine translation was recorded and the TPR-DB has been made publicly available under a creative commons license. At the time of this writing, the TPR-DB contains almost 30 studies of translation, post-editing, revision, authoring and copying tasks, recorded with Translog and with the CASMACAT workbench. Each study consists of between 8 and more than 100 recording sessions, involving more than 300 translators. Currently, the data amounts to more than 500 h of text production time gathered in more than 1400 sessions with more than 600,000 translated words in more than 10 different target languages.

This chapter describes the features and visualization options of the TPR-DB. This database contains recorded logging data, as well as derived and annotated information assembled in seven kinds of simple and compound process—and product units which are suited to investigate human and computer-assisted translation processes and advanced user modelling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The database is freely available under a creative commons license, and can be downloaded free of charge from https://sites.google.com/site/centretranslationinnovation/tpr-db

  2. 2.

    While the Translog-II and CASMACAT logged UAD is slightly different, the structure of the generated tables is identical.

  3. 3.

    The example is taken from a CASMACT study. Tasks are R: revision, P: post-editing and PIO: Interactive post-editing with online learning. A full list of task descriptions is in Appendix 1.

  4. 4.

    A large number of different pause thresholds have been suggested and are used. Vandepitte et al. (2015) segment keystroke sequences at 200 ms, while Lacruz and Shreve (2014: 250) find that “complete editing events are separated by long pauses (5 s or more.) They normally contain short pauses (more than 0.5 s, but less than 2 s,) and more effortful complete editing events will often include multiple short pauses. Post-editors may make intermediate duration pauses (more than 2 s, but less than 5 s) during a complete editing event”. Jakobsen (2005) suggests 2.4 s for his definition of “key performance”.

  5. 5.

    The letters in brackets in the list represent the file extensions in the TPR-DB. The section in italics points to the section where the table is described in more detail.

  6. 6.

    British National Corpus http://www.natcorp.ox.ac.uk/

  7. 7.

    We follow suggestions as proposed by Kertz Lab as in https://wiki.brown.edu/confluence/display/kertzlab/Eye-Tracking+While+Reading

  8. 8.

    InfuseIDFX.pl script is part of the TPR-DB and can be downloaded from the TPR-DB website, https://sites.google.com/site/centretranslationinnovation/tpr-db

  9. 9.

    The script AddExtColumns.pl can be downloaded from https://svn.code.sf.net/p/tprdb/svn/ and called with the parameters AddExtColumns.pl -C ExtraColumnsFile –S Study_name

References

  • Alves, F., & Vale, D. C. (2011). On drafting and revision in translation: A corpus linguistics oriented analysis of translation process data. Translation: Corpora, Computation, Cognition. Special Issue on Parallel Corpora: Annotation, Exploitation, Evaluation, 1(1), 105–122. http://www.t-c3.org/.

    Google Scholar 

  • Carl, M. (2012a). Translog-II: A program for recording user activity data for empirical reading and writing research. In The eighth international conference on language resources and evaluation (pp. 2–6). May 21–27, 2012, Istanbul, Tyrkiet. Department of International Language Studies and Computational Linguistics.

    Google Scholar 

  • Carl, M. (2012b). The CRITT TPR-DB 1.0: A database for empirical human translation process research. In S. O’Brien, M. Simard, & L. Specia (Eds.), Proceedings of the AMTA 2012 workshop on post-editing technology and practice (WPTP 2012) (pp. 9–18). Stroudsburg, PA: Association for Machine Translation in the Americas (AMTA).

    Google Scholar 

  • Carl, M., & Kay, M. (2011). Gazing and typing activities during translation: A comparative study of translation units of professional and student translators. Meta, 56(4), 952–975.

    Article  Google Scholar 

  • Jakobsen, A. L. (2002). Translation drafting by professional translators and by translation students. In G. Hansen (Ed.), Empirical translation studies: Process and product (pp. 191–204). Copenhagen: Samfundslitteratur.

    Google Scholar 

  • Jakobsen, A. L. (2011). Tracking translators’ keystrokes and eye movements with translog. In C. Alvstad, A. Hild, & E. Tiselius (Eds.), Methods and strategies of process research: Integrative approaches in translation studies (Benjamins translation library, Vol. 94, pp. 37–55). Amsterdam: John Benjamins.

    Chapter  Google Scholar 

  • Jakobsen, A. L., & Schou, L. (1999). Translog documentation. In G. Hansen (Ed.), Probing the process in translation methods and results (pp.~1–36). Copenhagen: Samfundslitteratur.

    Google Scholar 

  • Jakobsen, A. L. (2005). Instances of peak performance in translation. Lebende Sprachen, 50(3), 111–116.

    Google Scholar 

  • Germann, U. (2008). Yawat: Yet another word alignment tool. In Proceedings of the ACL-08: HLT demo session (Companion Volume) (pp. 20–23). Columbus, OH: Association for Computational Linguistics.

    Google Scholar 

  • Lacruz, I., & Shreve, S. (2014). Pauses and cognitive effort in post-editing. In post-editing of machine translation: Processes and applications. In S. O’Brien, M. Simard, L. Specia, M. Carl, & L. W. Balling (Eds.), Expertise in post-editing: Processes, technology and applications (pp. 246–274). Cambridge: Scholars Publishing.

    Google Scholar 

  • Leijten, M., & Van Waes, L. (2013). Keystroke logging in writing research: Using inputlog to analyze and visualize writing processes. Written Communication, 30(3), 358–392.

    Article  Google Scholar 

  • Sanchis-Trilles, G., Alabau, V., Buck, C., Carl, M., Casacuberta, F., Martinez, M. G., et al. (2014). Interactive translation prediction versus conventional post-editing in practice: A study with the CasMaCat workbench. Machine Translation, 28(3–4), 217–235.

    Article  Google Scholar 

  • Vandepitte, S., Hartsuiker, R. J., & Van Assche, E., (2015). Process and text studies of a translation problem. In A. Ferreira, & J. W. Schwieter (Eds.), Psycholinguistic and Cognitive Inquiries into Translation and Interpreting. (pp. 127–143).

    Google Scholar 

Download references

Acknowledgement

This work was supported by the CASMACAT project funded by the European Commission (7th Framework Programme). We are grateful to all contributors to the database and for allowing us to use their data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Carl .

Editor information

Editors and Affiliations

Appendices

Appendix 1

Overall the TPR-DB contains more than 580 h of text production time in terms of Fdur duration. In the 1689 sessions were involved 132 different translators producing all together more than 660,000 words in 9 different languages.

The language pair en → es is the by far the largest language represented in TPR-DB, with 660 sessions, 500,000 target words and more than 320 h of Fdur production time. The second most represented language pair is en → hi with 161 sessions, more than 20,000 tokens in the Hindi translations and more than 46 h of Fdur production time. The third language pair is en → de with 146 sessions, more than 24,000 tokens in the German translations and more than 24 h of Fdur production time production time, followed by en → da with 127 sessions, more than 18,000 tokens in the Danish translations and 12 h of Fdur production time. The rest of the language pairs in the TPR-DB involve more than 20 translation directions in 7 different source and 16 target languages (This includes language directions not shown in Table 2.21). Please consult the TPR-DB website for an updated version of the database contents.

Table 2.21 Summary table for TPR-DB studies: continuation below

Each study in the TPR-DB was conducted with a (set of) research question(s) in mind, which can be roughly summarized as follows:

  1. (A)

    The TPR-DB contains ten studies conducted with the three different CASMACAT workbenches as follows:

    1. 1.

      ALG14: This study compares professional translator and bilinguals while post-editing with the third prototype of the CASMACAT workbench featuring visualization of word alignments.

    2. 2.

      CEMPT13: This study contains post-editing recordings with the second prototype of the CASMACAT workbench, featuring interactive machine translation.

    3. 3.

      CFT12: This study contains data of the first CASMACAT field trial from June 2012, comparing post-editing with from-scratch translation.

    4. 4.

      CFT13: This study contains data of the second CASMACAT field trial from June 2013, comparing post-editing and interactive machine translation.

    5. 5.

      CFT14: This study contains data of the second CASMACAT field trial from June 2014, comparing interactive machine translation and online learning.

    6. 6.

      EFT14: The study compares active and online learning during interactive translation prediction

    7. 7.

      JN13: This study is recorded with the second prototype of the CASMACAT workbench featuring interactive machine translation and word alignments.

    8. 8.

      LS14: This study investigates learning effects with interactive post-editing over a period of 6 weeks (longitudinal study) with the third prototype of the CASMACAT workbench.

    9. 9.

      PFT13: This study is a pre-field trial test prior to the second CASMACAT field trial.

    10. 10.

      PFT14: This study is a pre-field trial test prior to the third CASMACAT field trial.

  2. (B)

    The aim of the MultiLingual experiment is to compare from-scratch translation (T), post-editing (P) and monolingual post-editing (E), for different translators and for different languages. The six English source texts are translated by student and experienced translators; three texts (1–3) are news, three texts (4–5) sociological texts from an encyclopedia. Texts were permuted in a systematic manner so as to make sure that each text was translated by every translator and every translator translated two different texts in each translation mode.

    1. 11.

      BML12: This study contains translating, post-editing and editing data of six texts from English into Spanish.

    2. 12.

      KTHJ08: This study contains only translation data for the news text 1–3.

    3. 13.

      MS12: This study contains translating, post-editing and editing of the six texts English into Chinese.

    4. 14.

      NJ12: This study contains translating, post-editing and editing of the six texts English into Hindi by professional translators.

    5. 15.

      SG12: This study contains translating, post-editing and editing of the six texts English into German.

    6. 16.

      TDA14: In this study participants were asked to copying the six English texts.

    7. 17.

      WARDHA13: This study contains translating, post-editing and editing of the six texts English into Hindi by students.

  3. (C)

    In addition, the TPR-DB contains a few individual experiments that were conducted with Translog-II:

    1. 18.

      ACS08: This study explores the way in which translators process the meaning of non-literal expressions by investigating the gaze times associated with these expressions.

    2. 19.

      BD08: This study involves Danish professional translators working from English into Danish.

    3. 20.

      BD13: This study involves secondary school students translating and post-editing from English into Danish.

    4. 21.

      DG01: The study compares students, professional and non-professional translators with and without a representation of the text.

    5. 22.

      GS12: This study contains post-editing data of four pieces of news from Spanish into English.

    6. 23.

      HLR13: This is a translation study from English into Estonian (5 participants translating 3 different texts).

    7. 24.

      JLG10: This study investigates L1 and L2 translations from/to English and Brazilian Portuguese.

    8. 25.

      LWB09: This study reports on an eye tracking experiment in which professional translators were asked to translate two texts from L1 Danish into L2 English.

    9. 26.

      MS13: This study is an investigation of translator’s behaviour when translating and post-editing Portuguese and Chinese in both language directions.

    10. 27.

      RH12: This is an authoring study for the production of news by two Spanish journalists.

    11. 28.

      ROBOT14: This study investigates usage of external resources during translation and post-editing.

    12. 29.

      ZHPT12: This study investigates translator’s behaviour when translating journalistic texts. The specific aim is to explore translation process research while processing non-literal (metaphoric) expressions.

Appendix 2

During each session a particular Task is conducted, as follows:

  • A: Authoring of a journalistic text. Source and target languages are identical.

  • C: Copying a text (manually) from the source window into the target window. Source and target languages are identical.

  • E: Editing of post-editing of MT output without access to the source text (monolingual post-editing).

  • P: Traditional post-editing of MT output (no additional help is provided during the process).

  • R: Review of post-edited text.

  • T: Translation ‘from-scratch’.

Within the CASMACAT context, a large number of different post-editing settings were investigated:

  • PA: Traditional post-editing visualizing source (ST) and target (TT) alignment links (triggered by mouse or cursor).

  • PI: Advanced post-editing through interactive translation prediction (ITP) / interactive machine translation.

  • PIA: Advanced post-editing through ITP showing ST-TT alignments (visualization option).

  • PIC: Advanced post-editing through ITP showing ST-TT alignments (visualization option).

  • PIO: Advanced post-editing through ITP and online learning techniques.

  • PIL: Advanced post-editing through ITP showing the post-edited text (suffix) in grey (visualization option).

  • PIV: Advanced post-editing through ITP showing Search&Replace bar, alignments and mouse-triggered alternative ITP options.

  • PIVA: Advanced post-editing through ITP and active learning techniques.

  • PIVO: Advanced post-editing through ITP and online learning techniques.

Appendix 3

This appendix lists all features that are used in the TPR-DB v2 to describe the unit tables. There are in total 275 features and 111 different features describing 11 different unit tables discussed in this chapter. These features are clustered here into 12 types, according to whether they describe a session, segment, token, keyboard or gaze behaviour, etc. In parenthesis are indicated the unit tables in which the features appear.

  1. 1.

    Session data: these features describe the sessions of a study:

    • Study: Study name as in the TPR-DB (AU, EX, PU, SG, SS, ST, TT)

    • Session: Session name, a composite of Participant, Text and Task (AU, CU, EX, PU, SG, SS, ST, TT)

    • Text: Text identifier in the study (AU, SS, ST, TT)

    • Task: Type of task, see Appendix 2 (AU, SS, ST, TT)

    • Part: Participant ID of study (AU, ST, TT, SS)

    • SL: Source text language (AU, SS, ST, TT)

    • TL: Target text language (AU, SS, ST, TT)

    • Break: Duration of session break (SS)

    • TimeR: Starting time of revision phase (SS)

    • TimeD: Starting time of drafting phase (SS)

  2. 2.

    Segment: information related to segments:

    • Seg: Source or target segment identifier, depending on Win feature (FD)

    • STseg: Source segment identifier (AU, PU, SG, SS, ST)

    • Nedit: Number of times the segment was edited (SG)

    • TTseg: Target segment identifier (AU, CU, KD, PU, TT, SG, SS)

    • LenS: Length in characters of the source segment (SG, SS)

    • LenT: Length in characters of the target segment (SG, SS)

    • LenMT: Length in characters of the pre-filled MT segment (SG)

    • TokS: Number of source tokens in segment (SG, SS)

    • TokT: Number of target tokens in segment (SG, SS)

    • Literal: Degree of segment literality (SG)

    • Nedit: Number of times the segment has been edited (SG)

  3. 3.

    Tokens: information concerning source and target text tokens in the translation product

    • STId: unique identifier of source text token (FD, KD, PU, ST, TT)

    • TTId: unique identifier of target text token (FD, KD, PU, ST, TT)

    • SAU: Source text segment string (AU)

    • TAU: Target text segment string (AU)

    • SAUnbr: Number of tokens in source side of alignment unit (AU, ST, TT)

    • TAUnbr: Number of tokens in target side of alignment unit (AU, ST, TT)

    • SToken: Source text token (ST, TT)

    • TToken: Target text token (ST, TT)

    • Lemma: Lemma of token (ST, TT)

    • PoS: Part-of-Speech of token (ST, TT)

    • PosS: Part-of-Speech of source token sequence (PU)

    • PosT: Part-of-Speech of target token sequence (PU)

    • Prob1: Probability of uni-gram occurrence (ST, TT)

    • Prob2: Probability of bi-gram occurrence (ST, TT)

  4. 4.

    Translation literality metric

    • AltT: number of different translation alternatives (ST)

    • CountT: number of observed current translation choice (ST)

    • ProbT: Probability of current translation choice (ST)

    • HTra: Word translation entropy (SG, ST)

    • HSeg: Translation segmentation entropy (SG, ST)

    • Cross: Cross value of token (AU, ST, TT)

    • CrossS: Cross value for source tokens (PU, SG)

    • CrossT: Cross value for target tokens (PU, SG)

    • Literal: Degree of segment literality (SG)

  5. 5.

    Keystrokes: information concerning keystroke activities

    • KDid: keystroke ID (KD)

    • Del: Number of manual and automatic deletions (AU,PU,ST, TT)

    • Ins: Number of manual and automatic insertions (AU,PU,ST, TT)

    • Adel: Number of automatically generated deletions (SG, SS)

    • Ains: Number of automatically generated insertions, (SG, SS)

    • Mdel: Number of manually generated deletions (SG, SS)

    • Mins: Number of manually generated insertions (SG, SS)

    • Char: UTF8 character typed or deleted (KD)

    • Munit: Number of micro units (AU, ST, TT)

    • Edit: Sequence of keystrokes producing TT string (AU, EX,FD,PU,ST, TT)

    • Edit1: Sequence of keystrokes of the first micro unit (AU, ST, TT)

    • Edit2: Sequence of keystrokes of the second micro unit (AU, ST, TT)

    • InEff: Inefficiency measure for segment generation (AU, ST, TT)

    • Scatter: Amount of non-linear text production (PU, SG, SS)

  6. 6.

    Gaze on source and target window

    • Path: Sequence of fixations on source or target window (FU)

    • FFTime: Starting time of first fixation (ST, TT)

    • FFDur: Duration of first fixation (ST, TT)

    • FPDurS: First pass duration on source text unit (AU, ST, TT)

    • FPDurT: First pass duration on target text unit (AU, ST, TT)

    • FixS: Number of fixations on source text unit (AU, PU, SG, SS, ST, TT)

    • FixT: Number of fixations on target text unit (AU, PU, SG, SS, ST, TT)

    • TrtS: Total gaze time on source text unit (AU,SG, SS, ST, TT)

    • TrtT: Total gaze time on target text unit (AU, SG, SS, ST, TT)

    • FixS1: Number of fixations on source text unit during production of first micro unit (AU, ST, TT)

    • FixS2: Number of fixations on source text unit during production of second micro unit (AU, ST, TT)

    • FixT1: Number of fixations on target text unit during production of first micro unit (AU, ST, TT)

    • FixT2: Number of fixations on target text unit during production of second micro unit (AU, ST, TT)

    • RPDur: Regression path duration (ST, TT)

    • Regr: Boolean value indicating whether regression started from token (ST, TT)

  7. 7.

    Concurrent keyboard and gaze activities:

    • ParalK: Parallel keyboard activity during gaze activity (FU, FD)

    • ParalS: Parallel source text gaze activity during typing (PU)

    • ParalT: Parallel target text gaze activity during typing (PU)

    • ParalS1: Parallel source text gaze activity during typing micro unit one (AU, ST, TT)

    • ParalS2: Parallel source text gaze activity during typing micro unit two (AU, ST, TT)

    • ParalT1: Parallel target text gaze activity during typing micro unit one (AU, ST, TT)

    • ParalT2: Parallel target text gaze activity during typing micro unit two (AU, ST, TT)

  8. 8.

    Starting times and durations of units and phases:

    • Dur: Duration of unit production time (AU, CU, EX,FD, FU, PU, SG, SS, ST, TT)

    • Dur1: Duration of first micro unit production time (AU, ST, TT)

    • Dur2: Duration of second micro unit production time (AU, ST, TT)

    • Fdur: Duration of segment production time excluding keystroke pauses ≥200s (SG, SS)

    • Kdur: Duration of coherent keyboard activity excluding keystroke pauses ≥5 s (SG, SS)

    • Pdur: Duration of coherent keyboard activity excluding keystroke pauses ≥ s (SG, SS)

    • Pnum: Number of production units (SG, SS)

    • Time: Starting time of unit (CU, EX, FD, FU, KD, PU)

    • Time1: Starting time of first micro unit (AU, ST, TT)

    • Time2: Starting time of second micro unit (AU, ST, TT)

    • TimeR: Starting time of revision phase (SS)

    • TimeD: Starting time of drafting phase (SS)

  9. 9.

    Pausing before the starting time of a unit:

    • Pause: Pause between end of previous and start of current unit (FU, PU)

    • Pause1: Pause between end of previous unit and start of first micro unit (AU, ST, TT)

    • Pause2: Pause between end of previous unit and start of second micro unit (AU, ST, TT)

  10. 10.

    GUI related information:

    • Win: Window in which gaze activity was recorded,1: source text, 2: target text window (FD)

    • Cursor: Character offset on which activity, keystrokes, fixations, was recorded, (FD, KD)

  11. 11.

    External resources

    • Focus: Name of the window in focus (EX)

    • KDidL: ID of last keystroke before leaving Translog-II (EX)

    • KDidN: ID of next keystroke after returning to Translog-II (EX)

    • STidN: ID of next source token after returning to Translog-II (EX)

    • STidL: ID of last source token before leaving Translog-II (EX)

    • STsegL: Source segment identifier of last event (EX)

    • STsegN: Source segment identifier of next event (EX)

  12. 12.

    Miscellaneous features:

    • Type: Type of keystroke: [AM]ins, [AM]del (KD)

    • Type: Type of activity unit, as discussed in Sect. 2.4.5 (CU)

    • Label: Label for activity units (CU)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Carl, M., Schaeffer, M., Bangalore, S. (2016). The CRITT Translation Process Research Database. In: Carl, M., Bangalore, S., Schaeffer, M. (eds) New Directions in Empirical Translation Process Research. New Frontiers in Translation Studies. Springer, Cham. https://doi.org/10.1007/978-3-319-20358-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20358-4_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20357-7

  • Online ISBN: 978-3-319-20358-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics