Predictive Turn in Translation Studies: Review and Prospects

  • Moritz SchaefferEmail author
  • Jean Nitzke
  • Silvia Hansen-Schirra
Living reference work entry


Translation studies – like other disciplines – are influenced by digitization, big data analytics, and artificial intelligence. Two major scientific developments exploit digital and data-driven methods and are currently triggering a “predictive turn” in translation studies: machine learning approaches to translation and computational modelling of the human translation process. This chapter presents a literature review of these two areas and explains the development of the “predictive turn.” The impact of machine translation on the market, practice, and theory of translation changes the way a translation is defined; it is not necessarily only a human product or service anymore. Using behavioral and imaging techniques – e.g., eye tracking, electroencephalography (EEG), and functional magnetic resonance imaging (fMRI) – the hitherto speculatively investigated translation process is becoming increasingly predictable. Machine translation aims to predict the product of the translation process which used to be human only and which is being modelled by translation process researchers. The ever-tighter integration of human and machine means that risks resulting from this interaction (e.g., in terms of error rates) must be calculated differently. While not yet possible, a fully implemented model of the translation process which can predict when and why a translator is having trouble carrying out the task could be integrated with a machine translation system and which behaves in such a way as to provide the necessary information the human needs to solve the problem at hand. It is also very likely that machine translation systems will learn from the human process so that they will eventually become even better at predicting what and how the human would translate. The integration between human and machine will, just like in all other aspects of life, become more intimate, radicalizing the risks and benefits associated with this conversation between mathematical models and human behavior and cognition.


Translation process Translation behavior Machine translation Post-editing 


Long gone are the days when translators wrote their texts in longhand, and equally long gone are the days when translators relied on hardcopies of dictionaries and other reference sources printed on paper. The impact of translation technologies that help the translator translate texts efficiently and consistently has been evident since the 1990s, and this impact is increasingly becoming felt by practitioners: Translation memory (TM) systems store aligned source and target text sentences in a database and suggest previously translated segments when a source text segment occurs which is the same or similar to one that has already been translated (Heyn 1998; Dragsted 2004). Examples for proprietary licenses are SDL Trados, Déjà Vu, Across, and Star Transit; online accessible are, for instance, Wordfast Anywhere, OmegaT, or Pootle. Terminology management systems help creating a powerful glossary with many functions that exceed a mere term list by far (e.g., SDL MultiTerm, TermStar). In addition, more and more professional translators work also with integrated machine translation systems of one kind or another. Given that machine translation (MT) is often not comparable in quality to purely human translation, humans correct the mistakes the machine makes, that is, translators post-edit the MT output.

Typically, MT systems are developed by computational linguists within large projects funded by the European Union (e.g., EuroMatrix; Eisele et al. 2008). They may, however, also be developed by companies or institutions that have access to big parallel corpora (e.g., Google) or have a high demand for translations into various languages (for instance, the European Union or companies in the area of technical documentation such as Siemens) (for an introduction to machine translation and its political dimension, see Hutchins and Somers 1992). Providers of translation tools implement MT systems into TMs and other translation databases and sell MT solutions as a plug-in within established translation tools. The cost of such a setting depends on the quality of the MT system, its accessibility (open access vs. proprietary), the number of licenses, and the data management strategy. If a company wants to use own MT models trained on the basis of own domain-specific training data and terminology, but maintained, supported and hosted by the provider, the systems are quite expensive and may cost several thousand euros. However, depending on the modalities, MT may also be much cheaper or even free (e.g., Google Translate). These systems may in turn have restricted usage or bare risks in terms of data security.

All these technological advances result in significant productivity gains which may vary according to the kind of text being translated, the quality of the MT system, and the purpose for which the translation is being produced. Depending on the use for which the text is intended, it may suffice to produce a translation which is factually correct and does not present major shortcomings in terms of grammar and syntax, but which may be considered substandard as compared to a human translation because of stylistic discrepancies or inconsistent terminology, etc. This process is known as light (or minimal or rapid) post-editing. In general, this type of post-editing is not intended to produce a publishable version of the target text; instead, it facilitates the gathering of information. In contrast, full post-editing is expected to produce a text which is ready for publication. The grammar, syntax, terminology, and formatting must generally be correct. However, stylistic perfection is not expected either with this type of post-editing, assuming the flow of the target text is not significantly impaired. Translation problems typically arise when there are ambiguities in the source text. For instance, Čulo et al. (2014) discuss that the generic English noun “nurse,” which referred to a male person in the source text, was translated with the word for female nurse in the German MT output. The reason for this might lie in the fact that this translation equivalent was far more frequent in the training data of the MT system compared with the male alternative. Other translation problems include idioms, collocations, and typological differences between languages (Čulo et al. 2014).

Relating MT quality to post-editing effort, Krings (1986) distinguishes between technical, temporal, and cognitive efforts. Technical effort in post-editing is related to the number of keystrokes; inserted and deleted characters can be recorded and quantified with keylogging software. Temporal effort refers simply to the amount of time needed to process a segment or a text. Although there are exceptions, keystrokes and time are generally related. Of course, cognitive effort is also related to keystrokes and time. The more typing occurs, the longer it takes so that more cognitive effort is typically entailed. However, a significant cognitive effort can also be necessary to correct a single word if, for example, this requires substantial reading in the source text to understand the context or if the mistranslation is subtle and requires a great deal of research. Cognitive effort is usually measured on the basis of eye movements which are detected by an eye tracker. While reading, the eyes normally move from word to word and remain on a given word for a certain time. This rest period is known as a fixation. The longer the eyes come to rest, the greater the cognitive effort that is tied to the processing of this word. The duration of a fixation and the number of fixations on a word, sentence, or text are typical measures for cognitive processing. The pupil size is also recorded, and the larger the pupil (compared to a baseline), the greater the cognitive effort. Figure 1 shows a progression graph of the translation of the English sentence “This gives us a human touch you can feel right across the organisation” (on the left axis) into German. The final target sentence can be found on the right axis. Translation activities are mapped onto a timescale spanning from 286,000 to 310,000 milliseconds (ms). The figure visualizes gaze fixations on the source text (ST) by means of blue boxes with dots and gaze fixations on the target text (TT) by means of green boxes with diamonds and keystrokes, viz., insertions in black and deletions in red (the example in the figure does, however, not include any deletions). In this case, the translator first read the whole sentence within the first 3000 ms. This is what Carl and Kay (2011) identified as orientation phase (see also below). Then the translator re-read the first six words again and started the translation process by typing “Das verlei,” followed by reading the insertions just made and continuing with the verb “eiht” (which includes a typo and was corrected in the revision phase later on). The fixations on the target verb indicate processing effort, which can be interpreted in various ways (e.g., retrieval processes in the mental lexicon, monitoring, deep processing of semantic or syntactic structures). Progression graphs are used for the analysis of scanpaths and translation activity units as suggested by Schaeffer et al. (2016) and described below in further detail. They, in turn, shed light on simultaneous versus consecutive processing during translation.
Fig. 1

Progression graph including eye tracking and keylogging data. (Source: authors)

The efficiency gains resulting from these technologies need to be counterbalanced against the risks associated with this process, viz., a low-quality target text might “only” cause a loss of reputation in one case, but in other cases, it might also cause property damages or injuries (for instance, if marketing materials are translated improperly not taking into account cultural specificities). Canfora and Ottmann (2015: 315) argue that translations must also be incorporated into risk management since they can entail significant risks for companies. Examples would be the incorrect translation of disclaimers or dysfunctional operating instructions that cause personal damage or danger. The goal is to achieve an ideal balance between the level of risk incurred and the resulting opportunities, that is, translations. Also the usage of MT must be considered in risk management for a company’s overall business management processes (Nitzke et al. 2019).

For a long time, models of the translation process were speculative in nature. As a result of methods imported from psychology and neurosciences, it is now possible to study the cognitive architectures that play a role during translation empirically. On the basis of behavioral data, translation process research sheds light on the cognitive processes involved in translation. Translators’ strategies, typical translation patterns, and cognitive effort become measurable and quantifiable. This, in turn, makes it possible for the first time to empirically model the translation process. As a consequence of digital development including machine learning, it is, furthermore, now to predict the quality, efficiency, and productivity of post-editing machine-translated output. This effort results in integrated models of human and machine translation describing the implications for human cognitive resources. Adapting predictive methods and models constitutes a new paradigm in translation studies, which will be coined as “predictive turn” by the authors of this chapter. The following sections include a critical literature review describing the development of this adaptation and its impact for empirical translation studies.

Empirical Models of Translation Processes

An implemented cognitive model of translation should respond to two opposing but converging demands: It should be general enough in order to generalize across at least a large number of very different languages. It is estimated that there are around 5000 spoken languages across the globe. Exhausting all combinations between these 5000 languages results in 25 million translation directions. But an implemented cognitive translation model should also be specific enough so that it can make predictions about a particular moment in time during a translation between two particular languages or, more precisely, it should be able to predict sequences of behavior by explaining the latter in terms of their cognitive architecture. The enormity of the challenge arising from the need to both scale models and apply them to minute and radically varying contexts is obviously no mean feat, but it is one that is shared with the attempts to automatize translation.

Modelling Translation Tasks and Strategies

One early study, which deserves mentioning, is the one by Jensen et al. (2009). Professional translators translated two short texts while their eye movements were recorded. The dependent variable was the total reading time on the critical phrases which when translating from Danish into English either required a change in word order or not: In one condition, the order subject-verb (SV condition) could be maintained, and in the other condition, the Danish constituent order verb-subject (VS condition) had to be inverted in English in order to conform to its norms. The empirical modelling was done in the form of a linear mixed effects model. Such a regression is a relatively simple (and linear) model, but interesting is that the following variables had an effect: The manipulation (SV vs. VS) had an effect such that phrases which required reordering resulted in longer reading times. The position of the phrases in the text also had an effect, that is, the later in the text the phrase occurred, the longer the reading times. The number of times a word had occurred prior to the current occurrence resulted in shorter reading times, presumably because of a repetition priming effect. Finally, frequency also had an effect such that the more frequent the words in a phrase were, the shorter the reading times as did the number of characters in a phrase (longer phrases had longer reading times).

In sum, the study by Jensen et al. (2009) is by no means a complete model as it only considers one language combination. It is based on a rather limited set of participants (16) and only models the effect of a small number of factors on one dependent variable, that is, the total reading time on the source. The fact that the model is “only” a regression should, however, not distract from the successful predictions this model makes, viz., word order changes, number of preceding words, times a word is repeated in the prior context, and frequency, all which have an effect on reading times of the source. The study may serve as an example for a range of studies which employed regressions in order to statistically model the effect of a number of aspects of either the source or the target (or both) on a range of behavioral measures.

A study which employs even simpler statistical methods is the one by Jakobsen and Jensen (2008): t-tests are employed in order to query whether the task (reading for comprehension, reading for later translation, reading during sight translation, and reading during translation) has an effect on eye movements. They identified an effect on the number of fixations on the whole source text, such that reading for comprehension had fewer fixations than reading for later translation, which had fewer fixations than sight translation, and which, in turn, had fewer fixations than translation. Maybe the most interesting effect of these fixations is the prediction that the mere instruction to read a text with the purpose of translating resulted in it afterward there was a higher number of fixations. This finding suggests that reading for translation differs from reading for comprehension, something that has been confirmed in other studies (Macizo and Bajo 2006; Schaeffer and Carl 2013; Winther Balling et al. 2014; Schaeffer et al. 2017). This finding means that implemented models of reading (Richter et al. 2006; Reichle et al. 2009) cannot be imported in their current forms – apart from the fact that they, of course, only model eye movements during reading and cannot be used to describe or predict the effects that writing a text in a different language may have on eye movements during translation.

Predicting Translation Efficiency

A similarly simple model is provided by a more recent study (Schaeffer and Carl 2017b), which attempts to find behavioral measures which do justice to the fact that during translation, participants read and write in two languages. This study builds on previous studies (Dragsted 2005, 2010; Dragsted and Hansen 2008). Dragsted and colleagues took what in simultaneous interpreting is called a standard behavioral measure, that is, the ear voice span (or décalage), and applied it to translation, given that translators, just like interpreters, need to coordinate reception and production activities. In particular, the later studies (Dragsted and Hansen 2008; Dragsted 2010) showed that the eye-key-span, that is, the time interval between the first (or last) fixation on a particular word and the first keystroke which contributed to the typing of equivalent words, could be predicted by the degree to which it was difficult to translate a particular word. Dragsted and Hansen (2008) showed that words which could be translated by many different target alternatives led to longer eye-key-spans than words which had only one possible translation. Neither the 2008 study nor the 2010 study provides even a simple statistical test to bolster the hypotheses put forward on the basis of descriptive means. However, the study by Carl and Schaeffer (2017) does exactly that. It is a replication of previous studies. Schaeffer and Carl showed on the basis of a much larger sample (Danish, Spanish, Chinese, Hindi, and German as target languages, 108 participants, and 3,242 unique source words) that what Dragsted and Hansen (2008) and Dragsted (2010) showed was upheld after more rigorous testing. Linear mixed effects models showed that the following variables had an effect on the eye-key-span (EKS): The more words a sentence, the longer the EKS. Word length in characters (as a proxy for word frequency) also had a positive effect, viz., the longer a word, the longer the EKS. The position of the word in the text had a positive effect, that is, the further to the end of the text a word was positioned, the longer the EKS. Word order differences were operationalized in a measure called Cross (Carl and Schaeffer 2017). This concept describes the relative word order distortions between source and target sentences as measured against a situation in which all the words in both sentences are in the same positions, in which case all words receive a value of 1, and the larger the distances are by which the words are shifted from this monotone relationship, the larger the Cross values. Cross also had a positive effect, which may not be too surprising, viz., if a word is read and if this specific word is only translated after having typed, e.g., five intervening words, then this increases the eye-key-span, if it is measured from the first fixation on a word. More interesting is the effect of a measure called HCross on the EKS. HCross (Carl and Schaeffer 2017) is the entropy of the Cross value and describes to what extent different word orders are possible for a given item. The authors explain this effect on the basis of the following example (Carl and Schaeffer 2017: 148): “As a result, full-time leaders, bureaucrats, or artisans are rarely supported by hunter-gatherer societies.” The translation database includes 26 translations into German for this sentence. With seven different translation solutions for the verb phrase “are supported,” this verb has a high HCross value of 3.57. Of course, differences in syntactic choices, that is, changes in word order, result in higher HCross values. In terms of EKS, the effect of this variable was positive such that larger HCross values led to a longer EKS. Similar to what Dragsted (2005, 2010) found, the professional experience of participants played a role, that is, those who identified as professionals had a shorter EKS than those who identified as students. Again, this was a relatively simple regression and not a fully implemented model. But just like the studies discussed previously, they made rather specific predictions, and what is a model other than a set of, more or less specific, predictions?

Balling and Carl (2014) take a large subset from the translation research database (TPR-DB) (Carl et al. 2016b) in order to construct a complex linear mixed effects model based on alignment units. Alignment units are constituted by words in the source which are aligned with words in the target; any observations and measures are based on this unit of analysis. The dependent variable is the production time, that is, the time participants spent typing the words which are constituted by the alignment unit. The data consist of both translation from scratch and post-editing. Balling and Carl find that the more inefficient the typing process is, the longer participants needed to produce the words in the alignment unit. They further find that the more time participants spent reading the target, the shorter were the production times. Divided attention, that is, reading the source while typing, had a positive effect on production times. The percentage of time spent reading the target while typing had a positive effect for low and intermediate values and declined thereafter, presumably because participants were shifting back and forth between source and target. The effect of frequency was as expected, that is, more frequent words are translated faster, even though this effect was stronger for students. Further variables were edit distance, that is, a measure of how many text manipulation operations (deletions, insertions, and substitutions) are necessary in order to transform the source into the target, specifically, Cross (as described above) and number of alternative translations; given that the same source texts were translated several times, it was possible to see how many different target versions there are for a given source item. The results here were not very clear, possibly because the underlying data were alignment units rather than source text words, that is, alignment units may differ substantially in size and nature between language combinations and between individual translations within one language. This study is similar in scope to the one by Schaeffer et al. (2016) and shows that the TPR-DB offers the possibility to model behavior during translation in a way that lends itself to generalizations across language combinations, something that was not as easily done prior to the existence of this database.

Most of the time between every two keystrokes, there is a small, a pause. Yet, not all pauses are meaningful, and there is a long tradition of investigating pauses during typing as indicators of cognitive effort (e.g., Schilperoord 1996; Lacruz et al. 2014). By setting the pause threshold to a certain value, that is, below which pauses are disregarded, what emerges are production bursts, that is, sequences of keystrokes which are considered to occur without meaningful interruptions. Carl and Kay (2011) call these production units. In their study, they seek to find the optimal pause threshold and conclude that this is 1000 ms. While their study is purely descriptive in regard to the statistics and the modelling involved, this kind of study and the considerations involved represent the necessary precursors. Later studies continued to develop this, e.g., Schaeffer et al. 2016, but see also Alves and Vale 2009, 2011, which model the complex interactions between reading and writing during translation. Without the descriptive measures which are unique to the task of translation, it is very difficult to model the behavior given that eye movements cannot account for this complex interaction of tasks.

Carl (2010) used a relatively simple model of typing as proposed by John (1996) to discern to what extent it can predict writing behavior during translation. Carl (2010) describes John’s (1996) model in the following way; John’s model, called TYPIST, aims to simulate copying rather than other forms of writing:

… an “engineering model” of typing texts which consists of three operators, a perceptual, a cognitive and a motor operator: the perceptual operator perceives a written word and encodes it into an ordered list of letters, the cognitive operator initiates the characters in the list and the motor operator executes the typing activity. (Carl 2010, n.p.)

On the basis of this, Carl (2010, n.p.) proposes a very similar simple model with five production rules: one which finds the “physical location on the screen” of the next word to translate, a second rule which shifts attention to this word, a third rule which “retrieve[s the] word from [the] mental dictionary,” a fourth rule which “retrieve[s the] associated translation,” and finally a fifth rule which “serialize[s the] spelling and type[s the] word.” However, applying this model to actual source texts produces a behavior which is far too static in comparison to actual human behavior which is why Carl (2010) proposes to model the translation process statistically, analogously to how this is done in statistical machine translation. In other words, given a source and a target text, the statistical model (on the basis of actual translations produced by humans and recorded by keylogging and eye-tracking devices) seeks to find the most likely reading and typing behaviors. The very limited sample studied by Carl (2010) produces behavior which is strikingly similar to the one produced by one of the 24 translators in the sample and does a much better job than the engineering model based on John (1996). While estimating probable eye movements is promising, estimating typing behavior is problematic as it involves predicting the product. As mentioned earlier, this is no easy task either (e.g., Wu et al. 2016). In addition to the challenges involved in predicting the final target text, predicting human typing behavior with a high degree of accuracy would involve modelling revisions; this behavior is currently unthinkable. Given this situation, it may be understandable that a statistical model as the one described by Carl (2010) has not received more attention. Nevertheless, Carl and Dragsted (2012) used John’s (1996) model more successfully. The architecture proposed by John makes rather specific predictions about the times that the individual operators of the model are required to copy a word: A perceptual operator needs 340 ms to encode a six-letter word, a cognitive operator requires “50ms to retrieve the spelling and to activate the typing of the characters from the spelling list” (see Carl and Dragsted 2012: n.p.), and finally, the “motor operator needs 230 ms on an alphanumeric keyboard at a rate of about 30” (ibid) gross words per minute. On the basis of visualizations of the behavioral data, Carl and Dragsted (2012) identify stretches of target text production, which are unproblematic, and use John’s (1996) model to predict the total time required to produce the text. Both during copying and during translation, John’s model is highly successful; the error is generally less than 5%, suggesting that both copying and translating, when unproblematic, employ similar processes given that the same assumptions lead to highly successful predictions. However, Carl and Dragsted observe that participants engage in extended re-reading both during copying and translation which cannot be easily integrated in the rather mechanistic model John (1996) proposes.

Jakobsen’s (2011) proposal of micro-cycles also qualifies as an algorithm, although, to the best of the authors’ knowledge, it has not been implemented. Jakobsen (2011: 48) proposes to view the translation as a sequence of recurring actions (micro-cycles); these do not necessarily all occur in sequence and not necessarily in the order in which they are listed here:
  1. 1.

    Moving the gaze to read the next chunk of new source text (and constructing a translation of it)

  2. 2.

    Shifting the gaze to the target text to locate the input area and read the current target text anchor word(s)

  3. 3.

    Typing the translation of the source text chunk

  4. 4.

    Monitoring the typing process and the screen outcome

  5. 5.

    Shifting the gaze to the source text to locate the relevant reading area

  6. 6.

    Reading the current source text anchor word(s)


If implemented, this model could prove useful because it integrates both reading and writing activities into a coherent framework which could be exploited and extended in several ways.

Modelling Translation Difficulty and Expertise

A computational model, which has actually been implemented, was presented by Mishra et al. (2013). It assumes that the time translators spend reading both the source and the target can be seen as an indicator of how difficult the translation is. The (TDI) relates this reading time to three aspects of the source sentences: length (in number of words), degree of polysemy (based on the senses as recorded in WordNet (Fellbaum 1998)), and the structural complexity (“…total length of dependency links in the dependency structure of the sentence…” (Mishra et al. 2013, n.p.) which is normalized by the number of words in the sentence as based on Lin (1996)). Mishra and colleagues applied a more complex regression using the machine learning approach Support Vector Regression (SVR) in order to predict the observed TDI, that is, the source and target reading times during translation were normalized by the number of words in the sentence on the basis of the three features observed in the sentences (length, degree of polysemy, and structural complexity). Doing this task renders it possible to predict how difficult it is to translate a sentence for which there are no behavioral data. The authors report an accuracy of 67.5%. In other words, the prediction of how much time actual translators spent reading the source and target sentences was accurate to a relatively high degree, though of course not perfect. However, Schaeffer et al. (2016) have shown that the TDI is successful in predicting the pause-to-word ratio (PWR) which is calculated as the number of pauses during typing (at a certain threshold) normalized by the number of words in a sentence (Lacruz et al. 2014). In other words, the TDI is relatively successful in predicting cognitive effort during translation and does so on the basis of merely three features of the English source sentences (this task is currently only available for English). The possible advantages of this partial model are obvious, that is, it is a partial model, because it models cognitive effort as operationalized on the basis of eye movements only and neither takes into account the relationship between input and output in terms of the product or in terms of the process. It could find a range of applications such as the selection of comparable stimulus texts to be used in experimental studies, predicting cognitive effort as one aspect in terms of how much a translator is paid, or the design of translation assignments in a didactic evaluation. However, despite the advantages, the model remains vague, given that it delivers predictions at the level of the sentence and thus it is unsuitable to model processes at lower levels of analysis (phrase, word). In addition, while relatively accurate, the margin of error (32.5%) is not small.

A study which further exemplifies the effort that explored new ways of how to describe behavior during translation and at the same time as finding ways of modelling the behavior is the study by Martínez-Gómez et al. (2014; see also 2018). First, the authors cluster the continuous stream of data from keylogs and eye-tracking recording events, the reasoning being that single fixations or single keystrokes hardly define a translator. Rather, a more or less short interleaved sequence of these activities may say something about the translator. However, how to classify sequences of keystrokes and fixations into events is by no means an easy task, nor is it straightforward. Martínez-Gómez et al. (2014, n.p.) define eight activity types or events (“1. Source text reading. 2. Target text reading. 3. Source and target text reading. 4. Target text typing. 5. Target text typing and source text reading. 6. Target text typing and target text reading. 7. Target text typing, source text reading and target text reading. 8. Translator idle”). This classification follows a top-down perspective, while a bottom-up approach is also taken: The authors use K-means clustering to group the single observations into event types. In order to do so, 17 features are derived from the data which describe single observations and how they relate to previous and future behavior. Examples are “Number of fixations alternating between the source and target text, in the last 5 seconds,” or the “Number of insertion events in the next 10 events.” These features describe the current single observation (single keystroke or single fixation) and the context of this observation, both in terms of what happened prior to and after it. These two procedures of determining what constitutes an event (a sequence of single observations) have advantages and disadvantages: K-means clustering is unsupervised and hence led by the data itself and is not influenced by the, possibly erroneous, assumptions of the researchers. The downside of this approach is that the resulting clusters are relatively opaque. It is difficult to determine why a number of observations belong to an event type or cluster. The advantage of the top-down approach is that it is known what constitutes an event type and how it is defined, but the definition may or may not correspond to how the data behave, that is, the resulting sequences do not emerge naturally.

Further N-gram clusters of these events (sequences of single observations) are used to predict the expertise of the translators using three different parameters: one binary distinction (certified vs. non-certified) and two continuous variables (years of training and years of experience). Martínez-Gómez et al. (2014) use random forests and regressions to predict the experience of translators. The results are highly accurate: The most successful model is the one with four latent activities (that is clustered with K-means). The error in predicting whether a translator is certified or not is astonishingly low at 19%. In other words, in the vast majority of cases, the algorithm successfully predicts whether a participant identifies as a certified translator. This finding is against a baseline of 49% which means that the features describing single observations in context are relevant predictors of behavior for whether someone has experience. The best model for predicting years of experience has an error of 4.15 years (against a baseline of 5.83 years), which is only a modest improvement. Further testing revealed that removing eye-tracking data from the observations resulted in a significant increase in the prediction error across models and parameters. The a priori classification into eight activity types performed worse than most of the latent activities (derived from the K-means clustering) for an error rate of 37%.

What these models suggest is that on one hand the predefined activity units do not have a very high discriminatory power as compared with the latent activities derived from K-means clustering. That is, the results are not very natural. On the other hand, these models are highly successful in simulating competence, that is, if certified status is a sign of competence. Unfortunately, these models are relatively opaque.

Predicting Translation Phases

Similar to Martínez-Gómez et al. (2014, 2018), Schaeffer et al. (2016) and Schaeffer and Carl (2017a) also classify sequences of single observations into event types. Schaeffer et al. (2016) use six activity unit types (“1: ST reading, 2: TT reading, 4: translation typing (no gaze data recorded), 5: ST reading and typing (touch typing), 6: TT reading and typing (translation monitoring), 8: no gaze or typing activity recorded for more than 5 seconds” (Schaeffer et al. 2016: 337). Again, using simpler statistical models (linear mixed effects regressions), the authors investigate to what extent source text reading patterns can predict what happens subsequently. The authors use scanpath measures which broadly describe the linearity of a scanpath. A scanpath is a sequence of fixations which may occur in the order in which the text is written or they may divert the text from this order. The extent to which these fixations divert from the order of the words on the screen correlates, to a large extent, with how long a scanpath lasts (Schaeffer et al. 2016: 338). The longer the scanpath, the more it diverts from the sequential word order which may include repeated re-reading of short stretches of text. Schaeffer et al. (2016) find that the longer a scanpath on the ST is, the more likely it is that translators will read the TT, while shorter ST reading durations are more likely to be followed by TT typing. Interestingly, this situation remains the case if two subsequent activity units are taken into account and even if there is post-editing. What these findings suggest is that bigrams of activity units (ST reading and subsequent typing or TT reading) might be more or less self-contained sequences of behavior encapsulating information retrieval and text production cycles or problem identification and solution cycles. When looking at two subsequent activities after ST reading, the effect is determined by the activity immediately following the ST reading and not by the one occurring after that. In other words, the larger context at times taken into account in the study by Martínez-Gómez et al. (2014, 2018) might be optimized by looking at simpler structures of N-grams of tuples of activity units (ST reading, TT reading, or typing), that is, a smaller context than that proposed by Martínez-Gómez et al. (2014, 2018).

The study by Saikh et al. (2015) also employs machine learning (Support Vector Machine). Saikh et al. aim to predict total reading time on the ST. Total reading time is the sum of all fixations on a particular word, irrespective of when these occurred. The authors build three models, the most successful only looks at eye movement during the drafting phase. Carl and Kay (2011) identified three phases of the translation process as a whole: the orientation phase, during which the translator normally does not type and just reads the ST, the drafting phase during which the TT is typed, and the revision phase which starts once a complete first draft of the translation is produced and during which translators may read the ST and/or the TT and may correct the TT. Saikh et al. place the total reading times into four categories (0–159 ms, 160–440 ms, 441–970 ms, and 971–5000 ms), each of which has roughly the same amount of data points. The task is then to predict into which of these categories the durations fall. After iterating through a number of features extracted from the TPR-DB, the authors end up with final models. The most relevant features for the most successful model were the unigram frequency (the frequency with which the ST words appear in the British National Corpus (BNC 2007)), the bigram frequency (the frequency of two words occurring in sequence in the BNC), of the number of TT words aligned to the ST word, the word length of the ST word in characters, Cross, the first fixation duration, and the word translation entropy (see Carl and Schaeffer 2017). Not too surprising is that the first fixation duration is predictive of total reading time, the effects of word length and frequency (Rayner 1998), and the effect of speech tags as they relate to word length. The number of target words aligned to a single source word says something about the complexity of the alignment unit; thus, it is a translation-specific effect and interesting and related to the effect of word translation entropy. In other words, translation-specific aspects and monolingual aspects are predictive of total reading time on the ST. However, surprising is that the highest accuracy achieved in this model is 49.1%, which is rather low and leaves a rather large margin of error. Two explanations are worth mentioning: Either the relevant features are not sufficient in order to explain total reading times, that is, too little about translation relevant aspects of the process are known, or some inherent randomness in the data makes prediction difficult, either due to noise or behavior which is random to a certain extent.

Finally, the model developed by Läubli and Germann (Läubli 2014; Läubli and Germann 2016) takes a slightly different approach. Similar to what Martínez-Gómez et al. aimed for, Läubli and Germann want to infer what they call (HTPs) from behavioral data (eye movement and keystrokes). HTPs are states “such as reading the source text, reading a draft translation, revising the draft translation, etc.” (Läubli and Germann 2016: 157). Using human annotations, the authors created a “gold standard.” The human annotators viewed replays of translators’ behavior and had to decide which of the predefined categories best described the activity they viewed. It was decided to only work with three labels: orientation, revision, and pause; in other words the question was whether post-editors were mainly reading, mainly writing, or mainly not doing anything.

Next, the model is the one developed by using K-means similar to Martínez-Gómez et al. (2014). The features are the number of mouse clicks, insertions, deletions, and fixations on the source and target certain temporal segments of the stream of data. While modelling a number of parameters was varied in order to produce the best fit, these parameters included the size of the temporal segment (from 500 ms to 5 s) and the number of clusters for the K-means, which were simultaneously the hidden states in the Hidden Markov Model used. The best model fit was achieved with ten clusters (hidden states) and a segment size of 500 ms. The ten hidden states had to be simplified, of course, if they were compared against the gold standard which consisted of just three states (orientation, revision, and pause). This task was done manually by visually inspecting the probability distributions in the automatically learned states. If there was a clear and high probability that insertions or deletions occurred in a state, it was tagged “…as an instance of revision (R). Otherwise, we tagged it as orientation (O), unless the probability mass for all observable actions was centred around zero, in which case we tagged it as pause (P)” (Läubli and Germann 2016: 175). In other words, the fine-grained features were again made coarser in order to match those provided by the human annotations. The accuracy in the automatic prediction regarding whether post-editors were either revising machine-translated text or orienting it in the source or target or doing nothing was very high compared to the human annotations. The best human annotation, as compared to the gold standard, assigned the wrong label to 21.03 s out of a 5 min snippet, while the automatic labelling had an error of 68.04 s out of a 5 min snippet (an error of 22.68%). The theoretical insight gained from this model is, however, limited. The model can predict to what extent it is likely that a post-editor mainly reads, writes, or does nothing. Buried in the model are ten different states which are clearly relevant for the model, but it remains unclear again how they differ and why they are relevant. In addition, what this study exemplifies once again is the fact that much effort has gone into finding ways of describing, clustering, processing the stream of data from eye tracking, and keylogging (cf. Carl and Schaeffer 2018). These efforts go hand in hand with the attempts to model the cognitive processes.

The next section will discuss the impact of digitization and machine learning on human translation processes and whether the resulting implications can be predicted.

Predicting Human Resource Implications in the Digital Age

Since post-editing machine translation is now a key element in the everyday life of many translators, it is important to estimate the post-editing scope for a proposed task before accepting it and starting work on it. However, automatic metrics developed in computational linguistics, for example, BLEU (Bilingual Evaluation Understudy), do not always correlate well with actual cognitive effort or time spent post-editing (e.g., Tatsumi 2009; Doherty et al. 2010; O’Brien 2011; Koponen et al. 2012; Vieira 2014). A complementary method is to assess MT quality by using human quality ratings or post-edits (Specia 2011; Moorkens et al. 2015), which are both time- and cost-intensive. Kim et al. (2017) and Wang et al. (2018) offer approaches which more or less automatically predict MT quality. The technical complexity of the framework presented by Kim and colleagues belies, however, the simplicity of the underlying assumptions, as will be discussed below. Almost all current MT systems use neural networks (Neural MT – NMT) with remarkable improvements in quality when compared to the previous paradigms. Until recently, phrase-based systems calculated probabilities that a given phrase were accurate translations of a source phrase. While successful at the time and also widely used, these systems often produced text which was very unnatural. It was obvious that a machine had produced it. NMT typically produces text which is sometimes indistinguishable from human translation (Klubička et al. 2017), although human and machine translations have not reached parity (Ruiz Toral et al. 2018). NMT systems encode contextual semantics of both the source text and the target text in complex mathematical models which is one reason why they are so much better than phrase-based statistical systems. The downside is that these models are not very transparent, that is, it is difficult to know why they are as good as they are (e.g., Belinkov et al. 2017).

The fact that neural networks encode such a wealth of information has been put to excellent use in the study by Kim et al. (2017). The approach by these authors uses two different kinds of recurrent neural networks (RNNs): one predictor and one estimator. The predictor uses vectors to represent the source and target words of human translations, in addition to the context preceding and following each word of the MT output. In other words, the predictor is fed copious amounts of published translations in order to learn just one thing: context. The task of the predictor is to, as the name indicates, predict the final translation. To be able to predict how humans translate is, however, only interesting because it produces mathematical models of context in source and target texts as a by-product. These are then passed on to the estimator. The estimator is trained on labels for each MT translated word (“Good,” “Bad”) produced by humans on a much smaller corpus. The simplicity of the idea behind this framework is that the predictor learns context from good translation which is produced by humans and the estimator learns quality estimates from good human labels. The estimator is fed the context model learned from the human translation by the predictor. The resulting quality estimates at word, phrase, sentence, and text level are rather accurate (Kim et al. 2017); in 2017 it outperformed all other current quality estimators.

In a study by O’Brien (2007), participants were asked to translate a short text with the translation software SDL Trados (a TM system which also suggested MT solutions) while pupil size was measured. The text consisted of sentences that were either an exact TM match; a fuzzy match (partially correct translation), translated by an MT system; or sentences without any assistance (“no match”). The results revealed that the participants’ pupil size was the smallest for exact matches (least effort). The pupil size was similar for fuzzy matches and for sentences that were machine-translated. The pupil size was the largest for sentences that had to be translated traditionally with no assistance. These results suggest that machine-translated texts require a similar amount of cognitive effort to fuzzy matches from a TM.

The study by Daems et al. (2015) investigated the effect of the quality of machine-translated texts on a whole range of indicators for cognitive effort. The source texts were translated with Google Translate, and the errors in these translations were manually annotated. Furthermore, the errors were rated on a scale from 0 to 4 (0 means not really an error, and 4 means the error causes serious problems in understanding). These annotations were then correlated with various indicators for cognitive effort. The average error rating (between 0 and 4) had a significant effect on all the behavioral indicators; the more serious the error, the more effort is needed to process it. In a second analysis, the authors found that differences in meaning had no effect on the behavioral indicators and only acceptability had a significant effect. The most interesting findings were obtained in connection with the different types of errors: Mistranslations, grammatical and structural aspects, as well as word order errors affected above all the average number of production units (one production unit consists of coherent keystrokes and is bounded by pauses of more than 1 s) and the relationship between pauses and production units. These types of errors can be corrected with greater technical effort and low cognitive effort. Errors that require more cognitive effort such as coherence and mistranslations had a significant effect on the average fixation duration, the average number of fixations, and the average production duration per word. These results were confirmed and refined in a later study (Daems et al. 2017).

The study by Specia (2011) is a good example of how the quality of machine-translated text can be predicted on the basis of measures of cognitive effort. In this study, three measurements were used: the time post-editors needed to edit a sentence, a subjective judgment of the effort needed to post-edit the sentence (on a scale from 0 to 4, where 0 implies a totally new translation and 4 means that no change was required), and HTER (Human-targeted Translation Error Rate). HTER (Snover et al. 2006) is an automatically calculated metric on a scale from 0 to 1, where 0 means that the MT sentence and the post-edited sentence are identical; larger values indicate that more changes were made. The source and target texts were then annotated with 80 additional features (e.g., sentence and word length, frequencies of the words and expressions in large corpora, etc.). Next, algorithms were used to predict how long the post-editors would need to process individual sentences on the basis of these three metrics (time, human judgment, and HTER) along with the 80 features. The predictions based on these three models were relatively accurate. Therefore, based on time measurements per sentence, it was possible to predict the time needed by a translator to post-edit a sentence and also the quality of the MT sentence, with relatively high accuracy (with an error of less than 2 s). In a second step, Specia (2011) gave the same translators new sentences that were ordered based on the three models so that simpler sentences came first and more difficult ones later. The question was thus: which ranking allows post-editing of more sentences? With the model based on subjective judgment, the most sentences could be post-edited in 1 h (97–71). The time model came second (82–69 sentences per hour). The HTER model took third place (65–38). As mentioned above, it is very costly and time-consuming to employ human judgments, but time can be measured without any problem in a workbench. These results mean that it should be feasible to integrate some time measurement per sentence along with the algorithms used by Specia (2011) into a workbench in order to provide customized instructions to post-editors about which MT sentences to translate manually and which are more suitable for post-editing.

The quality estimation model by Kim et al. (2017) has a number of premises, and the implementations of the model are rather complex. But it accomplishes one thing; it is rather accurate in predicting where the machine translation system made a mistake. It goes beyond predicting how a given source text is translated, which is what an MT system does. While it does not explicitly model human effort, it makes predictions about where the human post-editors may want to focus their attention and where they may require more cognitive effort in fix the mistakes.

Future Perspectives

From a methodological point of view, the integration of methods from cognitive sciences will become increasingly important. Using, for instance, EEG or fMRI, it will be possible to identify biological correlates of cognitive and motoric processes in translation (Hansen-Schirra 2017; Oster 2018). Still, they have not yet been included into multi-method approaches concerning translation. Focusing on single words as stimuli or not being able to actually produce the translation orally or written poses several problems for the investigation of cognitive translation processes: As translators usually translate whole texts in context and with respect to a given skopos, it is difficult to study complex translation processes or strategies by using EEG or fMRI technologies (Göpferich 2008). Decoding the source text and encoding the target text involve such a wide range of information processing that it is difficult to isolate a single stimulus during an authentic translation task. This means that studies in which single word translation is used or in which the translation has to be produced mentally are not ecologically valid because they neglect the complex problem-solving mechanism employed during translation. Although the context problem has been relativized by van Hell and de Groot (2008), who found similar effects for context-free versus sentence context conditions, there is still a huge gap between existing translation-oriented theories and models and their operationalization with methods from cognitive science. However, Nagels et al. (2013) integrate authentic text reception and production into fMRI methodology, which seems to be very promising for translation process research. Finally, it can be argued that it is beneficial to triangulate data collected in manipulated studies, on the basis of which clear effects can be derived with sentence context studies that corroborate the results gained from the controlled setting. By interfacing different methods with different data sets, experimental control can be complemented with ecological validity.

So far, most of the empirical modelling has addressed the written mode of translation; this is text-to-text translation. Future work will include the integration of speech data such as interpreting, translation dictation, or sight translation on the one hand (Gieshoff 2018; Seubert 2018; Carl et al. 2016a) and multimodal translation settings such as subtitling, voice-over, or sign interpreting on the other (Diaz Cintas 2009). The mapping of speech data and written data to audio or visual data and oral data is not a trivial problem. With multimodal annotation tools such as the web application developed by Alves and Vale (2009), it is possible to annotate and query user activity data of these corpora. Such tools facilitate the correlation of corpus annotations with eye tracking and keylogging data and shed light on multimodal and simultaneous translation processes.

Another promising research area are user response or reception studies. Within audiovisual translation studies, the reception of subtitles has already been researched. Orrego-Carmona (2015), for example, contrasts the reception of the professional and nonprofessional subtitles (in Spanish) for a US sitcom, showing that there was no difference in terms of attention distribution, although fixation durations were shorter for professional subtitles. In another study, Fox (2018) challenges the traditional positioning of subtitles and tests their optimal placement to ensure minimum time spent reading subtitles and maximum time spent enjoying the movie. So far, eye-tracking studies addressing translation reception have been carried out predominantly in front of computer or film screens, in controlled situations. Future research will include collecting data using print media or mobile devices with eye-tracking glasses or other methods (e.g., questionnaires, ratings). Furthermore, the scope of user response and reception research will be extended to other settings and text types, e.g., museum translation (Deane-Cox 2014).

Finally, production and reception processes of intralingual translation will challenge future translation research. Language varieties for target groups with specific language barriers will open new markets for professional translators. Intralingual translations into easy-to-read or plain language are nowadays demanded by European and national laws. Several sets of rules and regulations (e.g., by Inclusion Europe) help the translator to produce comprehensible texts (Bredel and Maaß 2016). However, these rules still lack empirical investigation (Bock et al. 2017). It is still unclear whether the reduction of complexity really leads to more comprehensibility for the target groups and whether the linguistic levels are competing against each other (e.g., low morphological complexity causes high phrase complexity). Furthermore, it remains unclear how difficult translation into easy or plain language might be; the empirical investigation of intralingual translation processes can still be seen as a general research desideratum.


In sum, the main obstacle for implementing a model which is capable of predicting both the product, that is, the target text and the human cognitive processes which led to the latter, is that those researchers who model the product chase the human gold standard in terms of its quality, viz., the dependent variable in these models is a quality metric of some sort. Translation process research has completely disregarded quality with a few exceptions, which, if they did so, disregarded to a large extent possible cognitive architectures. Translation process research has, to a large extent, been occupied with finding, on the one hand, descriptive tools with which to describe human behavior during translation, and, on the other hand, it has been aiming at discovering ways of predicting cognitive effort during translation so as to model the said architecture constraining and enabling the production of a target text.

MT research and associated fields and translation process research share the gargantuan task of having to produce one model which is so general that it fits 25 million language combinations and directions, or at least a representative subset of these, while at the same time being so accurate and particular that it will be of use in a myriad of contexts involving an unknown number of variables which might impinge on the outcome. However, rather than pulling on the same string, these disciplines talk at cross purposes. The one point where both strands of inquiry could converge is where the quality of the text produced by an MT system is accurately predicted. There is the tantalizing possibility that MT systems are approximate models of the human translation process, that is, quality estimation could be interpreted as a prediction of the cognitive effort expanded by the MT system. This effort could be easily tested by investigating to what extent predictions by systems such as that by Kim et al. (2017) are indicative of human effort during translation. Should this be the case, then the comparison between machine translation and human translator would cease to be a crude analogy, and ways of exploring how the mathematical models could be exploited to predict human processes would be in reach. It is not unthinkable, on the other hand, that human cognition can be predicted on the basis of behavior, that is, behavior and its associated models could be translated into a form which an MT system can read. By making the interaction between mathematical models and human more intimate, the risks mentioned at the outset of this chapter remain essentially the same, although the upshot is, of course, that an intimate dialogue between these two actors is likely to further improve the quality of the resulting texts while at the same time increasing the responsibility of the human in the loop to guarantee that the final texts fit for overall purpose.


  1. Alves, F., & Vale, D. C. (2009). Probing the unit of translation in time: Aspects of the design and development of a web application for storing, annotating, and querying translation process data. Across Languages and Cultures: A Multidisciplinary Journal for Translation and Interpreting Studies, 10(2), 251–273. Scholar
  2. Alves, F., & Vale, D. C. (2011). On drafting and revision in translation: A corpus linguistics oriented analysis of translation process data. Translation: Corpora, Computation, Cognition. Special Issue on Parallel Corpora: Annotation, Exploitation, Evaluation, 1(1), 105–122. Available at: Scholar
  3. Balling, L. W., & Carl, M. (2014). Production time across languages and tasks: A large-scale analysis using the CRITT translation process database. In A. Ferreira & J. Schwieter (Eds.), Psycholinguistic and cognitive inquiries in translation and interpretation studies (pp. 239–268). Newcastle upon Tyne: Cambridge Scholars Publishing.Google Scholar
  4. Belinkov, Y. et al. (2017). Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (volume 1: Long Papers) (pp. 1–10). Asian Federation of Natural Language Processing. Available at:
  5. BNC. (2007). The British National Corpus, version 3. Available at:
  6. Bock, B. M., Lange, D., & Fix, U. (2017). Das Phänomen “Leichte Sprache” im Spiegel aktueller Forschung – Tendenzen, Fragestellungen und Herangehensweisen. In B. M. Bock, D. Lange, & U. Fix (Eds.), Leichte Sprache“ im Spiegel” theoretischer und angewandter Forschung (pp. 11–31). Berlin: Frank & Timme.Google Scholar
  7. Bredel, U., & Maaß, C. (2016). Ratgeber Leichte Sprache. Die wichtigsten Regeln und Empfehlungen für die Praxis. Berlin: Dudenverlag.Google Scholar
  8. Canfora, C., & Ottmann, A. (2015). Risikomanagement für Übersetzungen. trans-kom, 8(2), 314–346.Google Scholar
  9. Carl, M. (2010). A computational framework for a cognitive model of human translation processes. In Proceedings of Aslib translating and the computer 32. London.Google Scholar
  10. Carl, M., & Dragsted, B. (2012). Inside the monitor model: Processes of default and challenged translation production. TC3, Translation: Computation, Corpora, Cognition, 2(1), 127–145.Google Scholar
  11. Carl, M., & Kay, M. (2011). Gazing and typing activities during translation: A comparative study of translation units of professional and student translators. Meta: Translators’ Journal, 56(4), 952–975. Scholar
  12. Carl, M., & Schaeffer, M. J. (2017). Why translation is difficult: A corpus-based study of non-literality in post-editing and from-scratch translation. Hermes – Journal of Language and Communication Studies, 56, 43–57.CrossRefGoogle Scholar
  13. Carl, M., & Schaeffer, M. J. (2018). The development of the TPR-DB as grounded theory method. Translation, Cognition & Behavior, 1(1), 168–193. Scholar
  14. Carl, M., Aizawa, A., Yamada, M. (2016a) English-to-Japanese translation vs. dictation vs. post-editing. In Lrec 2016 Proceedings: Tenth International Conference on Language Resources and Evaluation (pp. 4024–4031).Google Scholar
  15. Carl, M., Bangalore, S., & Schaeffer, M. (Eds.). (2016b). New directions in empirical translation process research: Exploring the CRITT TPR-DB (New Frontiers in Translation Studies). Cham: Springer. Scholar
  16. Čulo, O., et al. (2014). The influence of post-editing on translation strategies. In S. O’Brien, L. W. Balling, M. Carl, M. Simard, & L. Specia (Eds.), Post-editing of machine translation: Processes and applications (pp. 200–218). Cambridge, UK: Cambridge Scholars Publishing.Google Scholar
  17. Daems, J. et al. (2015). The impact of machine translation error types on post-editing effort indicators. In S. O’Brien, M. Simard (Eds.), Fourth workshop on post-editing technology and practice, proceedings (pp. 31–45). Association for Machine Translation in the Americas. Available at:
  18. Daems, J., et al. (2017). Identifying the machine translation error types with the greatest impact on post-editing effort. Frontiers in Psychology, 8(August), 1–15. Scholar
  19. Deane-Cox, S. (2014). Remembering Oradour-sur-Glane: Collective memory in translation. Translation and Literature, 23, 272–283.CrossRefGoogle Scholar
  20. Diaz Cintas, J. (2009). New trends in audiovisual translation. Bristol/Buffalo/Toronto: Multilingual Matters.CrossRefGoogle Scholar
  21. Doherty, S., O’Brien, S., & Carl, M. (2010). Eye tracking as an automatic MT evaluation technique. Machine Translation, 24(1), 1–13. Scholar
  22. Dragsted, B. (2004). Segmentation in translation and translation memory systems: An empirical investigation of cognitive segmentation and effects of integrating a TM system into the translation process. Copenhagen: Samfundslitteratur.Google Scholar
  23. Dragsted, B. (2005). Segmentation in translation: Differences across levels of expertise and difficulty. Target: International Journal on Translation Studies, 17(1), 49–70.CrossRefGoogle Scholar
  24. Dragsted, B. (2010). Coordination of reading and writing processes in translation: An eye on unchartered territory. In G. M. Shreve & E. Angelone (Eds.), Translation and cognition. Amsterdam/Philadelphia: John Benjamins.Google Scholar
  25. Dragsted, B., & Hansen, I. G. (2008). Comprehension and production in translation: A pilot study on segmentation and the coordination of Reading and writing processes. In S. Göpferich, A. L. Jakobsen, & I. M. Mees (Eds.), Looking at eyes. Eye-tracking studies of reading and translation processing (pp. 1–21). Copenhagen: Samfundslitteratur (Copenhagen Studies in Language.Google Scholar
  26. Eisele, A. et al. (2008). Hybrid machine translation architectures within and beyond the EuroMatrix project. In: J. Hutchins, & W. V. Hahn (Eds.), Hybrid MT methods in practice: Their use in multilingual extraction, cross-language information retrieval, multilingual summarization, and applications in hand-held devices (pp. 27–34). Proceedings of the European Machine Translation Conference. Hamburg, Germany, HITEC e.V., European Association for Machine Translation.Google Scholar
  27. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.Google Scholar
  28. Fox, W. (2018) Can integrated titles improve the viewing experience?: Investigating the impact of subtitling on the reception and enjoyment of film using eye tracking and questionnaire data. Language Science Press.Google Scholar
  29. Gieshoff, A. C. (2018). The impact of audio-visual speech input on work-load in simultaneous interpreting. Phd Thesis. Germersheim: Johannes Gutenberg-Universität Mainz.Google Scholar
  30. Göpferich, S. (2008). Translationsprozessforschung: Stand – Methoden – Perspektiven. Tübingen: Narr.Google Scholar
  31. Hansen-Schirra, S. (2017). EEG and universal language processing in translation. In J. W. Schwieter & A. Ferreira (Eds.), The handbook of translation and cognition (pp. 232–247). Malden/Oxford, UK: Wiley-Blackwell.CrossRefGoogle Scholar
  32. Heyn, M. (1998). Translation memories: Insights and prospects. In L. Bowker et al. (Eds.), Unity in diversity? Current trends in translation studies (pp. 123–136). Manchester: St. Jerome.Google Scholar
  33. Hutchins, W. J., & Somers, H. L. (1992). An introduction to machine translation. London: Academic.Google Scholar
  34. Jakobsen, A. L. (2011). Tracking translators’ keystrokes and eye movements with Translog. In C. Alvstad, A. Hild, & E. Tiselius (Eds.), Methods and strategies of process research. Integrative approaches in translation studies (pp. 37–55). Amsterdam/Philadelphia: John Benjamins.CrossRefGoogle Scholar
  35. Jakobsen, A. L., & Jensen, K. T. H. (2008). Eye movement behaviour across four different types of reading task. In S. Göpferich, A. L. Jakobsen, & I. M. Mees (Eds.), Looking at eyes. Eye-tracking studies of reading and translation processing (pp. 103–124). Copenhagen: Samfundslitteratur.Google Scholar
  36. Jensen, K. T. H., Sjørup, A. C., & Winther Balling, L. (2009). Effects of L1 syntax on L2 translation. In F. Alves, S. Göpferich, & I. M. Mees (Eds.), Methodology, technology and innovation in translation process research: A tribute to Arnt Lykke Jakobsen (pp. 319–336). Copenhagen: Samfundslitteratur.Google Scholar
  37. John, B. E. (1996). Typist: A theory of performance in skilled typing. Human-Computer Interaction, 11, 321–355.CrossRefGoogle Scholar
  38. Kim, H., Lee, J.-H., & Na, S.-H. (2017). Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation. Proceedings of the Second Conference on Machine Translation, 2(2016), 562–568. Scholar
  39. Klubička, F., Toral, A., & Sánchez-Cartagenac, V. M. (2017). Fine-grained human evaluation of neural versus phrase-based machine translation. The Prague Bulletin of Mathematical Linguistics, 108, 121–132. Scholar
  40. Koponen, M. et al. (2012). Post-editing time as a measure of cognitive effort. In Proceedings of the AMTA 2012 workshop on post-editing technology and practice (WPTP 2012) (pp. 11–20). San Diego.
  41. Krings, H. P. (1986). Was in den Köpfen von Übersetzern vorgeht: eine empirische Untersuchung zur Struktur des Übersetzungsprozesses an fortgeschrittenen Französischlernern. Tübingen: Günter Narr Verlag.Google Scholar
  42. Lacruz, I., Denkowski, M., Lavie, A. (2014). Cognitive demand and cognitive effort in post-editing. In Proceedings of the AMTA 2014 workshop on post-editing technology and practice (pp. 73–84).Google Scholar
  43. Läubli, S. (2014). Statistical modelling of human translation processes. Edinburgh: University of Edinburgh.Google Scholar
  44. Läubli, S., & Germann, U. (2016). Statistical modelling and automatic tagging of human translation processes. In M. Carl, S. Bangalore, & M. Schaeffer (Eds.), New directions in empirical translation process research: Exploring the CRITT TPR-DB (pp. 155–181). Cham: Springer International Publishing (New Frontiers in Translation Studies). Scholar
  45. Lin, D. (1996). On the Structural Complexity of Natural Language Sentences. In Proceedings of the 16th Conference on Computational Linguistics -, 2:729. Copenhagen, Denmark: Association for Computational Linguistics.
  46. Macizo, P., & Bajo, M. T. (2006). Reading for repetition and reading for translation: Do they involve the same processes? Cognition, 99(1), 1–34. Scholar
  47. Martínez-Gómez, P. et al. (2014). Recognition of translator expertise using sequences of fixations and keystrokes. In ETRA ‘14 Proceedings of the Symposium on Eye Tracking Research and Applications (pp. 299–302).
  48. Martínez-Gómez, P., et al. (2018). Recognition and characterization of translator attributes using sequences of fixations and keystrokes. In C. Walker & F. M. Federici (Eds.), Eye tracking and multidisciplinary studies on translation (pp. 97–120). Amsterdam/Philadelphia: Benjamins.CrossRefGoogle Scholar
  49. Mishra, A., Bhattacharyya, P., Carl, M. (2013) Automatically predicting sentence translation difficulty. In ACL 2013 – 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. vol 2, (pp. 346–351). Available at:
  50. Moorkens, J., et al. (2015). Correlations of perceived post-editing effort with measurements of actual effort. Machine Translation. Springer Netherlands, 29(1), 1–18. Scholar
  51. Nagels, A., et al. (2013). Neural substrates of figurative language during natural speech perception: An fMRI study. Frontiers in Behavioral Neuroscience, 7, 121. IF: 4,800.Google Scholar
  52. Nitzke, J., Hansen-Schirra, S., & Canfora, C. (2019). Risk management and post-editing competence. Journal of Specialised Translation, (31), 239–259.Google Scholar
  53. O’Brien, S. (2007). Eye-tracking and translation memory matches. Perspectives: Studies in Translatology, 14(3), 185–205. Scholar
  54. O’Brien, S. (2011). Towards predicting post-editing productivity. Machine Translation, 25(3), 197–215. Scholar
  55. Orrego-Carmona, D. (2015). The reception of (non) professional subtitling. PhD Thesis. Universitat Rovira i Virgili.Google Scholar
  56. Oster, K. (2018). Lexical activation and inhibition of cognates among translation students. Phd Thesis. Germersheim: Johannes Gutenberg-Universität Mainz.Google Scholar
  57. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3), 372–422. Scholar
  58. Reichle, E. D., Warren, T., & McConnell, K. (2009). Using E-Z reader to model the effects of higher level language processing on eye movements during reading. Psychonomic Bulletin & Review, 16(1), 1–21. Scholar
  59. Richter, E. M., Engbert, R., & Kliegl, R. (2006). Current advances in SWIFT. Cognitive Systems Research, 7(1), 23–33.CrossRefGoogle Scholar
  60. Ruiz Toral, A. et al. (2018). Translation attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Third conference on machine translation (pp. 113–123). Brussels.Google Scholar
  61. Saikh, T. et al. (2015) Predicting source gaze fixation duration: A machine learning approach. In Proceedings – 2015 international conference on cognitive computing and information processing, CCIP 2015.
  62. Schaeffer, M. J., & Carl, M. (2013). Shared representations and the translation process: A recursive model. Translation and Interpreting Studies, 8(2), 169–190. Scholar
  63. Schaeffer, M. J., & Carl, M. (2017a). A minimal cognitive model for translating and post-editing. In S. Kurohashi & P. Fung (Eds.), Proceedings of MT summit XVI (pp. 144–155). Japan: International Association for Machine Translation.Google Scholar
  64. Schaeffer, M. J., & Carl, M. (2017b). Language processing and translation: Translation and non-translational language use. In S. Hansen-Schirra, O. Czulo, & S. Hofmann (Eds.), Empirical modelling of translation and interpreting, (Translation and Multilingual Natural Language Processing 7, pp. 117–154). Berlin: Language Science Press. Scholar
  65. Schaeffer, M. J., et al. (2016). Measuring cognitive translation effort with activity units. Baltic Journal of Modern Computing, 4(2), 331–345.Google Scholar
  66. Schaeffer, M. J., et al. (2017). Reading for translation. In A. L. Jakobsen & B. Mesa-Lao (Eds.), Translation in transition. Amsterdam/Philadelphia: John Benjamins.Google Scholar
  67. Schilperoord, J. (1996). It’s about time: Temporal aspects of cognitive processes in text production. Amsterdam: Rodopi.Google Scholar
  68. Seubert, S. (2018). Die Verarbeitung von vvisuellen Inforationen beim Simultandolmetschen. Phd Thesis. Germersheim: Johannes Gutenberg-Universität Mainz.Google Scholar
  69. Snover, M. et al. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas (pp. 223–231).
  70. Specia, L. (2011). Exploiting objective annotations for measuring translation post-editing effort. In Proceedings of the 15th conference of the European Association for Machine Translation (pp. 73–80). Leuven.Google Scholar
  71. Tatsumi, M. (2009). Correlation between automatic evaluation metric scores, post-editing speed, and some other factors. In Proceedings of MT Summit XII (pp. 332–339). Available at:
  72. van Hell, J. G., & de Groot, A. M. B. (2008). Sentence context modulates visual word recognition and translation in bilinguals. Acta Psychologica, 128, 431–451.CrossRefGoogle Scholar
  73. Vieira, L. N. (2014). Indices of cognitive effort in machine translation post-editing. Machine Translation, 28(3–4), 187–216. Scholar
  74. Wang, J. et al. (2018). Alibaba submission for WMT18 quality estimation task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers (pp. 809–815). Association for Computational Linguistics. Available at:
  75. Winther Balling, L., Hvelplund, K. T., & Sjørup, A. C. (2014). Evidence of parallel processing during translation. Meta: Translators’ Journal, 59(2), 234–259. Scholar
  76. Wu, Y. et al. (2016). ‘Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv e-prints. pp. 1–23. Available at:

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Moritz Schaeffer
    • 1
    Email author
  • Jean Nitzke
    • 1
  • Silvia Hansen-Schirra
    • 1
  1. 1.Johannes Gutenberg University MainzGermersheimGermany

Personalised recommendations