1 Introduction

The main objective of this study is to investigate the link between source text’s linguistic features associated with translation difficulty and the quality of machine translation (MT). We first survey the rather separate literature on measures of text-based translation difficulty and automatic evaluation of MT quality. Different methods for automatic evaluation of MT quality have been proposed for a faster, cheaper, and more objective assessment of translation quality compared to manual human evaluation. For example, automatic evaluation can speed up the evaluation of MT systems and the process for identifying problems and weaknesses to improve the MT systems (Comelles & Atserias, 2019). However, while most existing automatic evaluation methods use metrics based on similarity measures between the reference and translated text, they ignore the possible variation in the source text in terms of translation difficulty. Unaccounted variation in source text’s translation difficulty could bias automatic evaluation of MT quality.

Specifically, we aim to investigate whether metrics of translation difficulty based on linguistic features can provide additional information to enhance the existing MT algorithms and their automatic quality evaluation methods. There are at least two reasons why MT research and development can benefit from having reliable metrics of source text’s translation difficulty. First, for MT systems development, the information provided by such metrics can be crucial when it comes to comparing the performance of different MT algorithms. If there is a systematic negative correlation between linguistic features of the source text and the quality of MT and if it is not possible to use the same source texts to compare different MT systems, then the MT quality scores from the different text can be adjusted based on the translation difficulty metrics. Second, for users of MT systems, if there is a systematic relationship between source text linguistic features (that is, translation difficulty) and MT quality, the information on “translation difficulty" level would aid in assessing the reliability of the output of the MT systems and deciding on whether a human, professional translator is required.

Surprisingly, while there are many studies which focus on the measurement of either translation difficulty or MT quality, fewer studies exist that consider the potential link between the two. On the one hand, the translation difficulty literature suggests the importance of linguistic features in the process and product of the translation. On the other hand, the literature on automatic quality evaluation of MT has not paid much attention to the importance in the variation of source text translation difficulty level. In fact, in some WMT shared tasks,Footnote 1 the quality of different MT systems may be measured and compared based on different source text with little consideration on how the source text variation would introduce additional variation in the value of the MT performance metric beyond the variation arising from differences in the quality of the MT systems. If a specific source text has a higher translation difficulty level than the other, then we may incorrectly conclude that the machine used to translate it produces more translation errors than the other machine processing a different but easier source text. Adjusting the translation quality metric to take into account the variation in source text’s translation difficulty could help avoid such an incorrect conclusion. In other words, if MT system A is fed with input text set 1 which is more (less) difficult to translate than input text set 2 fed into MT systems B, then the value of the automatic metrics to measure Machine A’s translation quality may need to be adjusted upward (downward) to reflect the fact that it has to process a more difficult set of text.

Furthermore, there is a serious criticism of the existing MT quality evaluation approach because of its reliance on the reference translation text (Lommel, 2016). In the real world of translation practice outside of experimental studies, users of MT systems are unlikely to have the reference translation text. Hence, it is often impossible for users to assess the quality of the output of MT systems. To address this criticism, researchers have focused on predicting the quality of MT output under the setting where no reference text is available (Specia et al., 2009, 2010; Almaghout & Specia, 2013). Our study is also aimed at contributing to this literature by investigating the potential use of linguistic features of the source text as inputs for MT quality predictive modeling.

Evaluation of the level of translation difficulty of the source text is an important item in translation education, accreditation, research, and language industry. When tasked with translating a relatively easy-to-translate text, human novice and professional translators exhibit fundamentally different cognitive segmentation and speed of translation reflecting the higher capability of the latter. However, such difference disappears when the text to be translated is significantly more difficult, suggesting that even professional translators suffer and are likely to deliver lower quality translation (Dragsted, 2004, 2005). We expect a similar relationship exists between the translation difficulty of the source text and the quality of machine translation.

Without a reliable metric of translation difficulty of the source text, it would be hard to evaluate objectively the quality of different MT systems based on their translation output and the extent of translation errors when the (potential) translation difficulty of the source texts varies. While there are only a few existing studies compared to studies which have attempted at measuring text readability, some researchers in the translation literature have investigated potential metrics for measuring translation difficulty based on the source text. For example, Campbell and Hale (1999, 2003) suggested the essential features of the source text such as effective passage length and the time that it takes (a human translator) to complete a translation as potential bases for such metrics since the difficulty in translation can be defined in terms of “the processing effort needed”. Hence, studies such as Jensen (2009) and Sun and Shreve (2014) have investigated whether readability metrics can be used to measure translation difficulty given that one of the sources of translation difficulty reflects the reading and reverbalization of the translation process.

In this study, we investigate whether some of the proposed metrics of translation difficulty from the human/professional translation literature can be used to reduce the effect of variation in source text’s translation difficulty on variation in MT systems’ quality. To the best of our knowledge, studying the link between metrics of translation difficulty and MT quality has not been presented in literature before. Specifically, we investigate the relation between certain linguistic features of the source text (which have been identified in the literature on human/professional language translation as related to translation difficulty) and human judgement on the quality of machine translated text. All else equal, as in the case of human translators, we can expect the more difficult the source text for human to translate (as reflected by longer sentence length, greater structural complexity and higher degree of polysemy), the lower the quality of the MT would be. Human judgement based on comparing the MT output with the original text in the source language or with the reference translation in the target language has been used as the main input for evaluating the performance of MT systems in the annual Conference on MT.Footnote 2 In parallel, human judgment has also been used to guide the development of different metrics for automatic evaluation of MT quality. However, to our knowledge, the importance of variation in source text’s translation difficulty has not received much consideration.

Our study is particularly inspired by the work done by Mishra et al. (2013) on “automatically predicting sentence translation difficulty”. They developed a support vector machine (SVM) to predict the difficulty of a sentence to be translated based on three linguistic features of the sentence: “length”, “structural complexity”, and “average of words polysemy” as inputs. They found a positive correlation between these input features and the time spent for translation. To our knowledge, Mishra et al. (2013) is the first study which shows an automatic way of linking linguistic features and translation difficulty.

Other related studies on how translation difficulty can be automatically assessed have also paid more attention on the relevant issues on assessment and scoring, such as Williams (2004), Secară (2005), and Angelelli (2009). However, none of these provide as clear and systematic relationships between linguistic features and translation difficulty. For example, Williams (2004) provides a comprehensive discussion on different aspects of translation quality assessment and why they are important. Secară (2005) discusses various frameworks used in the process of translation evaluation in the translation industry and teaching institutions, with a special focus on error classification schemes such as wrong term, misspelling, omission, and some other features. Finally, Angelelli (2009) suggests potential features to look at: 1. Grammar and Mechanics (target language); 2. Style and Cohesion (purpose); 3. Situational Appropriateness (includes audience, purpose, and domain terminology); 4. Source Text Meaning; 5. Translation Skill (evidence of effective use of resource materials). However, the study seeks to measure “translation ability” rather than translation quality, with aims to measure whether the translator has properly understood audience and purpose.

From the machine translation based studies, Costa et al. (2015) show errors associated with structural complexity and polysemy as the most pervasive types of errors in MT of English-Portuguese texts in their sample. This finding is consistent with Mishra et al. (2013)’s findings. Altogether, these and the studies we discussed earlier serve as another reason why in this study we propose to use the features and metrics of Mishra et al. (2013). Specifically, we implement the approach of Mishra et al. (2013) to measure three linguistic properties associated with translation difficulty of the source texts (length (L), degree of polysemy (DP) and structural complexity (SC)) on all English source texts used in the annual Conference on Machine Translation (WMT) over 2017–2019 (WMT 2017, WMT 2018, and WMT 2019).Footnote 3 We then compute the pairwise Pearson correlation coefficients of these measures and human judgement scores of the MT quality of the translation of the English source text to eleven different languages (Chinese, Czech, Estonian, Finnish, Latvian, Lithuanian, German, Gujarati, Kazakh, Russian, and Turkish). Our analysis shows mostly statistically significant (weak) negative correlations between our proxies of translation difficulty and the quality of MT systems.

The rest of the paper is structured as follows. In Sect. 2 we review the literature on translation difficulty measure and automatic evaluation of MT. In Sect. 3 we discuss the data we use for our analysis and the approach for analysing the relationship between linguistic features of the source text associated with translation difficulty and the human evaluated quality of MT systems. In Sect. 4 we present and discuss the results. Finally, in Sect. 5, we provide some concluding remarks.

2 Related background

2.1 Measuring translation difficulty

In this section we review the literature on translation difficulty. Most studies on the relationship between source text’s translation difficulty and the quality of translation are based analyses of the translation process and product associated with human translators. However, as discussed later, recent studies have begun to explore the relationship between source text’s translation difficulty and MT errors. For our purpose, we will focus on Mishra et al. (2013)’s translation difficulty index (TDI) and how it can be useful in machine translation evaluation. However, it is plausible that some other translation difficulty metrics that have been developed by the literature are also relevant for the evaluation of machine translation quality.

For the case of human translators, to measure translation difficulty, there are four measurement items that can be considered (Akbari & Segers, 2017): (1) the identification of sources of translation difficulty, (2) the measurement of text readability, (3) the measurement of translation difficulty by means of translation evaluation products such as holistic, analytic, calibrated dichotomous items, and the preselected items evaluation methods, and (4) the measurement of mental workload. Studies such as Campbell and Hale (1999) have investigated whether certain essential linguistic features related to the source text reflect “the processing effort needed" or mental workload involved in translating a text. If these features such as the effective passage length and required time for translation serve as potential sources of translation difficulty, they can be used for constructing translation difficulty metrics. Specifically, the study proposed a metric based on cognitive effort required under time constraint to gauge source text difficulty “by identifying those lexical items that require higher amounts of cognitive processing". However, the study admitted that such difficulty criterion may not be necessarily related to the idea of translation correctness, which is the focus of translation quality studies. In other words, it is possible for different translators to mistranslate a segment of text in a similar way that requires little cognitive effort. Hence, the cognitive approach suggested may fail to identify the real difficulty of the text.

In a subsequent study, Hale and Campbell (2002) investigated the importance of the relationship between source text linguistic features such as official terms, complex noun phrases, passive verbs, and metaphors and the accuracy of the translated text. However, they concluded that there is no clear correlation between these source text features and accuracy of the translation.

Jensen (2009) investigated whether some of the standard readability indices of relative differences in the complexity of a text (such as word frequency and non-literalness) can be used as predictors of translation difficulty. Unfortunately, while the study explored the potential and weakness of different readability indicators for predicting text difficulty, it is not comprehensive or systematic enough to reach any conclusive finding.

A more recent study, Sun and Shreve (2014), investigated in a more systematic way how readability is related to translation difficulty using experimental data from a sample of 49 third-year undergraduate students in translation/English and 53 first-year graduate students in translation. All the students in the experiment spoke Mandarin Chinese as their first language and started learning English as a foreign language from the 6th grade. These students were assigned the tasks of translating short English texts (about 121 to 134 words) to Chinese. The Flesch Reading Ease (FRE) formula, one of the most popular and influential readability formulas (DuBay, 2004), was used for scoring the readability of the source English texts to be translated and classify them into three categories: easy (FRE scores of 70–80), medium (scores of 46–55), and difficult (FRE scores of 20–30). The study found a weak, negative correlation between readability and translation quality score (obtained by averaging human evaluation scores from three independent graders all of whom had translation teaching experience and had translated at least two books). The study computed an R-square of 0.015, which means the variation in translation quality score could only account for 1.5% of the variation in the translation difficulty scores of the source text. Furthermore, the study found translation quality level as evaluated by better translators (defined as students with higher grades) is not consistently different from the translation quality evaluated by the worse translators (lower grade students). Finally, the study found the time spent on translating the source text as statistically significant positively (but weakly) correlated with the translation difficulty level.

In another related study, Sun (2015) provided a theoretical and methodological overview of translation difficulty and emphasized that an accurate assessment of the level of translation difficulty of a given source text is a critical factor to consider in translator training, accreditation and related research. Traditionally, people rely on their general impression from reading the source text to gauge its translation difficulty level. However, for a more effective evaluation process, Sun (2015) argued that a more objective and systematic instrument is required. For that purpose, there are two basic questions to answer: what to measure and how to measure it. The potential sources of translation difficulty can be classified into the translation factors and the translator factors. Accordingly, to measure translation difficulty, we need to be able to measure the source text difficulty, identify the translation-specific difficulty (e.g., non-equivalence, one-to-several equivalence, and one-to-part equivalence situations as mentioned by Baker (2011)), and assess the cognitive factors associated with the translation difficulty (such as the mental workload for the translator). The study mentioned that readability formulas are most often used to measure text difficulty and suggested that, for identifying translation-specific difficulty, grading translations, analysing verbal protocols, and recording and analysing translation behaviour are also required.

Howard (2016) argued that although short-passage test scores are frequently used for translator certification, we know little about how the text’s features and the test scores are linked to the objective or the purpose of the test. He analysed the text features associated with translation difficulty in Japanese-to-English short-passage translation test scores. The analysis revealed that, first, it is possible to link specific passage features such as implicit or explicit cohesion markers to a desired trait of a good translation such as the creation of a coherent target text. Second, there are elements in the text features that could signal coherence and be objectively scored as acceptable or unacceptable to be used for calculating facility values (percentage of correct responses) in the test population. Finally, these facility values can be used to create a profile of comparative passage difficulty and to quantitatively identify items of difficulty within the passage.

Wu (2019) explored the relationship between text characteristics, perceived difficulty and task performance in sight translation. In the study, twenty-nine undergraduate interpreters were asked to sight-translate six texts with different properties. Correlation analysis showed the students’ fluency and accuracy in performing the tasks as related to sophisticated word type, mean length of T-units, and the lexical and syntactic variables in the source texts.

Eye tracking is a process-based alternative to outcome-based measures such as translation test scores for measuring the effect of translation difficulty. In this case translation difficulty is inferred from an analysis of the translator’s attention (as reflected by their eyes and gaze) on both the source and target texts during actual translation process. The analysis assumes that when “the eye remains fixated on a word as long as the word is being processed" it reflects difficulty (Just & Carpenter, 1980). Thus, this approach rests on the theory that “the more complex texts require readers to make more regressions in order to grasp the meaning and produce a translation" (Sharmin et al., 2008).

Mishra et al. (2013) developed a translation difficulty index (TDI) based on a theory that “difficulty in translation stems from the fact that most words are polysemous and sentences can be long and have complex structure". They illustrated this by comparing two simple sentences of eight words: (i) “The camera-man shot the policeman with a gun.” and (ii) “I was returning from my old office yesterday.” According to them, the first sentence is more difficult to process and translate because of the lexical ambiguity of the word “shoot” which may represent taking a picture or firing a shot and the structural ambiguity (policeman with a gun or shot with a gun). They argued that to obtain fluent and adequate translations requires the ability of the translator to analyse both the lexical and syntactic properties of the source text. They then proceeded to construct the TDI measure based on the eye tracking cognitive data, defining the TDI value as the sum of the fixation (gaze) time and saccade (rapid eye movement) time. In addition, they measured three linguistic features of the source text: (i) length (L), defined as the total number of words occurring in a sentence (L), (ii) degree of polysemy (DP), defined as the sum of senses possessed by each word in the Wordnet (Miller, 1995) normalized by the sentence length, and (iii) structural complexity (SC), defined as the total length of dependency links in the dependency structure of the sentence (Lin, 1996). The idea behind SC is that words, phrases, and clauses are syntactically attached to each other in a sentence, and the sentence has higher structural complexity if these units lie far from each other. For example, as shown in Fig. 1, the structural complexity of the sentence “The man who the boy attacked escaped." is 1+5+4+3+1+1=15. They showed that the TDI and text linguistic measures (L, DP, and SC) are positively correlated, confirming the hypothesis that linguistic features can serve as indicators of translation difficulty.

Fig. 1
figure 1

Dependency graph

Interestingly, the finding that the degree of polysemy (DP) positively correlates with translation difficulty (Mishra et al., 2013) appears to be consistent with the findings of a study from a separate machine translation study that analysed the different types of MT errors (Costa et al., 2015). In that study, as many as seventeen machine translation error types are identified including orthography errors such as those from misspelling, lexical errors such as those from omission, grammatical errors such as incorrect ordering, discourse style errors such as variety, and semantic errors such as confusion of senses. The study found, based on a sample of 750 sentence pairs English-Portuguese translations, lexical, grammatical and semantic groups of errors are the most pervasive. Among each individual error type, the error from confusion of senses is one of the two most pervasive ones. The other most pervasive error type is misselection, from the grammatical group of errors, which is most closely related to source text’s structural complexity.

Furthermore, the findings from the following two studies are particularly relevant for our analysis because they help us understand why a metric based on sense ambiguity in the source text could indicate the degree of translation difficulty. First, Raganato et al. (2019)’s work focuses on word sense ambiguity. They presented MUCOW, a multilingual contrastive test suite that covers 16 language pairs with more than 200,000 contrastive sentence pairs, automatically built from word-aligned parallel corpora and the wide-coverage multilingual sense inventory of BabelNet. They then evaluated the quality of the ambiguity lexicons and of the resulting test suite on all submissions from nine language pairs presented in the WMT19 news shared translation task, plus on other five language pairs using pretrained NMT models. They used the proposed benchmark to assess the word sense disambiguation ability of neural machine translation systems. Their finding shows the state-of-the-art and fine-tuned neural machine translation systems still present some drawbacks on handling ambiguous words, especially when evaluated on out-of-domain data and when the encoder has to deal with a morphologically rich language.

Second, Popović (2021) carried out an extensive analysis of MT errors observed and highlighted by different human evaluators according to different quality criteria. Her analysis includes three language pairs, two domains and eleven NMT systems. The main findings of the work show that most perceived errors are caused by rephrasing, ambiguous words, noun phrases and mistranslations. Other important sources of errors include untranslated words and omissions.

2.2 Automatic machine translation evaluation

The measurement of translation quality, specifically when it comes to MT, is one of the highly active areas in translation research. The use of automatic evaluation translation metrics has distinctly accelerated the development cycle of MT systems. Currently, one of the most widely used metrics for automated translation evaluation is BLEU, a string-matching metric based on the idea that “the closer a MT is to a professional human translation, the better it is" (Papineni et al., 2002). However, several problems have been detected in translation evaluation based on BLEU. Callison-Burch et al. (2006) and Koehn and Monz (2006) discussed possible disagreements between automatic system evaluation rankings produced by BLEU and those of human assessors. They argued BLEU may not be reliable when the systems under evaluation are different in nature, such as rule-based systems and statistical systems or human-aided and fully automatic systems. Reiter (2018) reviewed 34 papers and, based on the reported 284 correlation coefficients between human score of MT quality and BLEU metrics, concluded that overall the evidence supports the use of BLEU as a diagnostic evaluation metrics for MT systems. However, they also concluded that the evidence does not support the use of BLEU for outside of MT evaluation such as individual texts evaluation or scientific hypothesis testing.

Essentially, the automated evaluation metrics for MT such as BLEU and other earlier introduced metrics discussed below are all based on the concept of lexical similarity of the reference translation text and the MT systems’ output. These metrics assign higher translation quality scores to machine translated text having higher lexical similarity to the reference, human translated, text. The most basic lexical similarity metrics include metrics based on edit distances such as PER (Tillmann et al., 1997), WER (Nießen et al., 2000), and TER (Snover et al., 2006). More sophisticated metrics which are based on lexical precision but without consideration of any linguistic information include BLEU and NIST (Doddington, 2002). Both metrics and others such as ROUGE (Lin & Och, 2004) and CDER (Leusch et al., 2006) are based on lexical recall. Metrics that consider a balance between precision and recall include GTM (Melamed et al., 2003), METEOR (Banerjee & Lavie, 2005), BLANC (Lita et al., 2005), SIA (Liu & Gildea, 2006), and MAXSIM (Chan & Ng, 2008). Lexical information such as synonyms, stemming, and paraphrasing for evaluation are considered by the following metrics: METEOR, M-BLEU and M-TER (Agarwal & Lavie, 2008), TERp (Snover et al., 2009), SPEDE (Wang & Manning, 2012), and MPEDA (Zhang et al., 2016). Popović (2015) proposed the use of character n-gram F-score for automatic evaluation of machine translation output. Wang et al. (2016) proposed translation edit rate on character level (CharacTER), which calculates the character level edit distance while performing the shift edit on word level.

Although lexical based measures appear to generally perform well over a variety of translation quality evaluation metrics, there is a broad criticism against their use for such purpose (see, for examples, Coughlin (2003) and Culy and Riehemann (2003)). The main argument against lexical measures is that they are more for document similarity measures rather than translation quality measures. Hence, the suggested improvements include the use of models of language variability by comparing the syntactic and semantic structure of candidate and reference translations. Liu and Gildea (2005), for example, proposed different syntactic measures based on comparing head-word dependency chains and constituent subtrees. Popović and Ney (2007) introduced several measures based on edit distance over parts of speech. Owczarzak et al. (2007) proposed measure based on comparing dependency structures from a probabilistic lexical-functional grammar parser. Mehay and Brew (2006) developed a measure based on combinatorial categorical grammar parsing without the need to parse the possible ill-formed automatic candidate translations, just parsing the reference translations. Kahn et al. (2009) used a probabilistic context-free grammar parser and deterministic head-finding rules. Other proposed measures are based on morphological information like suffixes, roots, and prefixes. Among them we can mention AMBER (Chen et al., 2012) and INFER ((Popović et al., 2012). There are also similar measures based on syntax such as part of speech tags, constituents and dependency relation information about morphology such as HWCM (Liu & Gildea, 2005) and UOWREVAL (Gupta et al., 2015) or on applying semantic information such as SAGAN-STS (Castillo & Estrella, 2012), MEANT (Lo & Yu, 2013), MEANT 2.0 (Lo, 2017).

Combining different evaluation methods using machine learning has also been proposed to improve automatic MT quality evaluation. The focus of such solution is on evaluating the well-formedness of automatic translations. For example, Corston-Oliver et al. (2001) applied decision trees to distinguish human-generated translations from machine-generated ones. In their study, from each sentence, they extracted 46 features by performing a syntactic parse using the Microsoft NLPWin natural language processing system (Heidorn, 2000) and language modeling tools. Another example, Akiba et al. (2001) proposed a ranking method by using multiple edit distances to encode machine translated sentences with a human-assigned rank into multi-dimensional vectors from which a classifier of ranks is learned in the form of a decision tree. On the other hand, Kulesza and Shieber (2004) used support vector machines and features inspired by BLEU, NIST, WER, and PER.

Gamon et al. (2005) presented a support vector classifier for identifying highly dysfluent and ill-formed sentences. Similar to Akiba et al. (2001), the classifier uses linguistic features obtained by using French NLPWin analysis system (Heidorn, 2000). The machine learning model is trained on the extracted features from machine translated and human translated sentences. Quirk (2004) and Quirk et al. (2005) suggested the use of a variety of supervised machine learning algorithms such as perceptron, support vector machines, decision trees, and linear regression on a rich collection of features extracted by their developed system.

Ye et al. (2007) also considered MT evaluation as a ranking problem. They applied a ranking support vector machine algorithm to sort candidate translations based on several features extracted from three categories: n-gram-based, dependency-based, and translation perplexity according to a reference language model. The approach showed higher correlation with human assessment at the sentence level, even when they use an n-gram match score as a baseline feature. Some other studies using machine learning techniques to combine different types of MT metrics include Yang et al. (2011), Gautam and Bhattacharyya (2014), Yu et al. (2015), and Ma et al. (2017).

Some researchers also applied neural based machine learning models in their works such as Thompson and Post (2020) who frame the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on a human reference. They propose training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task (e.g., Czech to Czech). This results in the paraphraser’s output mode being centred around a copy of the input sequence, which represents the best case scenario where the MT system output matches a human reference. As another example, Rei et al. (2020) present COMET, a neural framework for training multilingual machine translation evaluation models. Their framework leverages cross-lingual pretrained language modelling resulting in multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality.

Giménez and Màrquez (2010) argued that using only a limited number of linguistic features could lead to bias in the development cycle which can cause negative consequences to MT quality. They introduced an automatic MT quality evaluation metric based on a rich set of specialized similarity measures operating at different linguistic dimensions. The approach can both analyse individual and collective behaviour over a wide range of evaluation scenarios. However, instead of using machine learning techniques, their proposed method is based on uniformly averaged linear combinations of measures (ULC). That is, the combined score from various metrics is the normalised arithmetic mean of individual measures:

$$\begin{aligned} ULC_M(t,R) = \frac{1}{|M|}\sum \limits _{m \in M} m(t,R) \end{aligned}$$
(1)

where M is the measure set, and m(tR) is the normalised similarity between the automatic translation t and the set of references R, for the given test case, according to the measure m. Normalised scores are computed by dividing the maximum score attained over the set of all cases by actual scores.

Based on their implementation results, presented as an online tool called Asiya (Gimenez & Marquez, 2010), Giménez and Màrquez (2010) concluded that measures based on syntactic and semantic information can provide a more reliable metric for MT system ranking than lexical measures, especially when the systems under evaluation are based on different paradigms. They further showed that certain linguistic measures perform better than most lexical measures at the sentence level, some others perform worse when there are parsing problems. However, they argued that combining different measures is still suitable and can yield a substantially improved evaluation quality metric.

Comelles and Atserias (2019) introduced VERTa, a MT evaluation metric based on linguistic information inspired by Giménez and Màrquez (2010)’s approach, except that they used correlation with human judgements and different datasets to find the best combination of linguistic features. Comelles and Atserias (2019) argued that VERTa checks for the suitability of the linguistic features selected and how they should interact to better measure adequacy and fluency in English. In essence, VERTa is a modular model which includes lexical, morphological, dependency, n-gram, semantic, language model modules. VERTa uses the Fmean to combine Precision and Recall measures. If there is more than one reference text, the maximum Fmean among all references is returned as the score. When the scores per module are calculated the final score is a weighted average of the different scores (Fmean) of the modules.

Most recently, some researchers have applied neural based machine learning models in their works such as Thompson and Post (2020) who framed the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on a human reference. They proposed training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task (e.g., Czech to Czech). This results in the paraphraser’s output mode being centred around a copy of the input sequence, which represents the best-case scenario where the MT system output matches a human reference. As another example, Rei et al. (2020) presented COMET, a neural framework for training multilingual machine translation evaluation models. Their framework leverages cross-lingual pretrained language modelling resulting in multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality.

3 Data and approach

3.1 Data

The data for our empirical analysis come from the publicly available data used in the series of workshops on machine translation (WMT), an annual international workshop on various topics related to machine translation and the automatic evaluation of MT quality going back to 2006. Specifically, we used data on human judgement scores of MT outputs of a given set of English source text translated into eleven different languages (Chinese, Czech, Estonian, Finnish, German, Gujarati, Kazakh, Latvian, Lithuanian, Russian, and Turkish). We focused our analysis on translation of English as the source language due to the need to identify linguistic features using readily available and most developed tools. We use NLTK Natural Language Toolkit (Loper & Bird, 2002) and Stanford natural language processing parser named CoreNLP (Manning et al., 2014) to extract the linguistic features from these English source texts. We selected the sample period of 2017, 2018, and 2019 workshop years to ensure that we can use the absolute human quality scores, instead of the relative scores provided in earlier years. Because different workshop years cover different sets of language pairs, our estimating sample is an unbalanced panel of 11 different target languages over three years. For the 2017 workshop, our sample contains target languages of Chinese, Czech, Latvian, Finnish, German, Russian, and Turkish. The 2018 data contain all previous year’s target languages except Latvian, which was replaced with Estonian. The 2019 data add Lithuanian, Gujarati and Kazakh.

3.2 Approach

Our analytical approach is based on correlation analysis between translation difficulty level of the source English text and the quality of MT systems’ output in each of the eleven target languages. We used WMT’s human evaluator scoring data as the metrics for MT quality. For the translation difficulty metrics of the source English text, we considered a similar set of linguistic features in Mishra et al. (2013):

Length The number of words in the sentence.

Polysemy Sum of senses based on Wordnet for non-stop words in a sentence.

Structural complexity Total length of dependency links in the dependency structure of the sentence.

In the paper, the authors showed the positive correlation between L, DP, SC and the time spent on the translation that is known as an important item in measuring translation difficulties studies. In our implementation, from each sentence in the set of English source texts, we constructed measures of the following linguistic features (using the same measure definition as in Mishra et al. (2013)) and computed their correlation coefficients with MT quality scores. The computation is done separately for each of the ten available target languages.

4 Results and discussion

Table 1 provides summary statistics (sample mean and standard deviation) of the human judgement of MT quality scores and each of the three linguistic features. In WMT2019, the source English texts were the same for all target languages. Hence, the linguistic features are identical. In WMT2017 and 2018, the sets of source English text vary due to the use of both genuine English source text and English source text which were translated text from the target language.

To ensure that we use the evaluation scores that reflect the difficulty of the translation of the source English text, we restrict our sample as follows. First, we only consider source sentences which have been translated by more than one machine. This is to ensure that our analysis does not capture machine variation instead of source text variation. Second, we exclude evaluated translation scores marked as either “REF", “BAD-REF", and “REPEAT" because they are inserted quality control pairs instead of true machine translation output. For example, “BAD-REF" indicates the use of damaged MT outputs to detect if the human judge is as expected in terms of assigning significantly worse scores. The data associated with missing translation scores are also removed. In addition, because of structural complexity checking, we just keep the data that have just one sentence. Because of this restriction, the sample size in Table 1 differs from the original sample size in the WMT Submitted Data. Last, but not least, to take into account the fact that the WMT2017 and 2018 samples used both genuine English pre-translated target sentence into English, we analyse both the full sample and the subsample which only includes genuine English source test.

Table 1 Summary statistics

First, Table 1 shows a significant variation in the translation quality (Q) and, for 2017 and 2018, in the linguistic features (L, DP and SC) across language pairs within the same study year. For example, in 2017 data, the lowest MT quality is observed for EN-TR pair (32.7), whereas the highest is observed for EN-ZH at 65.9 (or around 100% higher). However, because each set of EN-XX language pair in that year may contain different English source text (as confirmed by the variation in each linguistic feature measure), we are not sure whether the variation in MT quality is purely due to variation in the quality of MT algorithm across target language or it is also due to variation in source text translation difficulty. For the case of 2019, however, all language pairs use the same source English text. Hence, there is no variation in linguistic feature across language pairs and the variation in translation quality is likely due to cross-target language variation in MT algorithm and training data.

We also see from Table 1 a significant variation across the years in terms of translation quality. For example, in 2017 data, the highest level in terms of length, polysemy, and structural complexity is approximately 25%, 28%, and 35% higher than the lowest levels in order. The degree of variation for 2018 is slightly lower, but it is not trivial either. Unconditional comparison of translation quality of the same language pairs across the years may be confounded by cross-year variation in linguistic feature. However, as shown in Table 1, the higher translation quality (Q) in 2019 compared to the other two years seems to indicate a genuine increase in MT quality since most of the linguistic features that represent translation difficulty in 2019 are at least as high as those in the earlier years.

Table 2 presents the correlation coefficient between average translation quality scores from all MT systems’ output for each sentence in the language pair and each linguistic feature of the source English text in each data year. As expected, with a few exceptions particularly for the EN-ZH pair, linguistic features associated with higher translation difficulty are negatively correlated with translation quality. Most of the correlation coefficients are significantly different from zero with a p-value of less than one per cent. Ignoring the non-statistically significant coefficients and the positive correlation coefficients displayed by EN-ZH data in 2017 and 2018 (separately discussed in a subsequent paragraph), the strength of the negative correlation between translation quality and linguistic feature ranges from \(-\)0.07 (EN-FI 2017; L) to \(-\)0.32 (EN-LT 2019; DP).

Table 2 Pairwise Pearson correlation of machine translation quality and linguistic features

Comparing the correlation coefficient across different linguistic features within each language pair each year presented in Table 2, polysemy appeared to be most strongly correlated with translation quality. Excluding all non-statistically significant coefficients and the coefficients for EN-ZH and EN-RU for 2017 and 2018, the average correlation coefficients for L, DP and SC are, respectively, \(-\)0.186, \(-\)0.215 and \(-\)0.156. The evidence that polysemy presents the most important translation difficulty to MT systems appears to be consistent with the works of Costa et al. (2015) and Popović (2021) we summarised earlier, who suggested the presence of ambiguous words as a potentially important source of translation errors.

Table 2 also shows a significant variation in the link between translation difficulty linguistic features and MT quality, particularly for 2017 and particularly as shown by the case of the EN-ZH pair. We consider several plausible reasons behind such variation: (1) as shown in Table 1, variation in the linguistic features of the source text; (2) variation in the quality and sensitivity of MT systemsFootnote 4; and (3) variation in the quality of human assessment. Without a more extensive data analysis, possibly in a controlled experimental setting, it is difficult to identify which of these reasons is the most important. However, Barrault et al. (2019) highlighted a significant problem in the quality of Mechanical Turk workers used for the 2017 EN-RU and EN-ZH arising from higher rates of gaming. For example, only eight out of the original 43 workers were retained for providing the “good" human judgement data we analyse. In their accompanying data notes, Bojar et al. (2017) suggested a minimum of 15 human assessments to obtain an accurate sentence-level score. In 2018, the gaming problem for both language pairs was still the worst among all other pairs, but it was not as bad as in 2017. Only in 2019 data the extent of gaming for EN-RU and EN-ZH pairs appeared to be comparable to the rest of language pairs (Barrault et al., 2019). Therefore, we believe the positive correlations are anomalies, indicative of problems of the human judgement data.

Furthermore, the most consistent negative correlation across the three-year data is exhibited by the 2019 data. This is possibly due to the use of genuine English source text in 2019 as opposed to mixed genuine and ‘translationese’ (pre-translated target sentence into English) sentences (Barrault et al., 2019; Bojar et al., 2017). Graham et al. (2019), as cited in (Barrault et al., 2019), argued that the inclusion of test data (that is, pre-translated target sentence as the source sentence) could introduce inaccuracies in the evaluation of MT quality. We believe the pre-translated text may also affect our computed linguistic features and the human judgement score. To verify this, we redo the correlation analysis for the 2017 and 2018 data by excluding source sentences whose original language is not English. The results, summarised in Table 3, show that indeed the statistically significant and positive correlation coefficients discussed above were driven by the ‘translationese’. In other words, the use of ‘translationese’ may also affect the accuracy of the relationship between the linguistic measures of translation difficulty and the quality of MT.

Table 3 Pairwise Pearson correlation of machine translation quality and linguistic features; Excluding ‘translationese’ (that is, source sentences which are not English original)

5 Conclusions

The quality of translation of a text depends on the capability of the translator, human or machine, and the level of difficulty of the translated source text. One may argue that the link between translation difficulty of the source text and human translation quality may be weak because, provided with enough time a human translator may always be able to deliver a high level of translation quality. However, there is virtually no “translation time" parameter for a MT system since MT output is delivered instantaneously. In other words, variation in source text translation difficulty level is much more likely to be directly reflected by the translation quality produced by a MT system than a human translator.

Surprisingly, when we surveyed the existing studies on linguistic measures of translation difficulty and the quality of MT, we did not find many articles which cover both topics. In the MT literature, most attention has been focused on developing evaluation metrics for translation quality based on test case comparisons of machine translated text and the “reference” translation. This focus is reasonable given the main objective is to improve upon the algorithms used behind the MT systems. However, a more comprehensive understanding on the determinants of translation quality is potentially valuable to introduce refinements to existing algorithms to reduce the most pervasive error types in MT systems related to grammar and, particularly, word confusion.

Hence, in this paper, in addition to providing a survey of the relevant literature, we aimed at contributing to such understanding by empirically investigating the relationship between measurable linguistic features that reflect the translation difficulty of the source and the quality of MT. Specifically, we constructed measures of translation difficulty that have been shown to be correlated with cognitive based measures that capture the extent of efforts required by human translators to complete a given translation task in an experimental setting or, in other words, the extent of translation difficulty. These measures include the length of the sentence, the degree of polysemy and the sentence’s structural complexity. We found mostly negative correlation between each of these translation difficulty measures and MT quality as assessed by human judges/evaluators for the full sample of English to other language pair tests in WMT2017, WMT2018 and WMT2019. This finding is consistent with the existing evidence that MT systems tend to suffer mostly from making the type of errors associated with grammar and sense confusion.

In summary, the results of our analsys suggest that there are measurable linguistic features that can be used to measure translation difficulty and maybe even predict translation quality. Thus, the ability to measure translation difficulty is important for, for example, normalising source text when comparison of translation quality across different translators and different level of source text difficulty is required. Furthermore, we found anomalies in the relationship between translation difficulty and MT quality which upon our closer inspection appeared to be caused by inaccurate human judgement data. In other words, the linguistic features we evaluated might potentially be used also to identify if there is any gaming problem in human judgement of MT quality. Finally, the fact that we found a systematic relationship between translation quality and translation difficulty in terms of word senses suggests a high potential reward from developing better sense disambiguation algorithm in MT systems.

Finally, there are several areas where our analysis could be fruitfully extended, two of which we discuss below. First, while our evidence suggests a negative relationship between translation difficulty features of the source text and the quality of machine translation, the negative correlation is weak. One possible reason for this is because different linguistic features of the source text are likely to be associated with different types of translations errors. However, the human judges’ scores of the MT quality in the WMT data may reflect various types of translation errors. Thus, a further analysis on the type of errors by distinguishing between the errors associated with accuracy (such as mistranslation) and fluency (incorrect word order) as examined by, for example, Carl and Băez (2019), would likely provide a better understanding of the relationship between source text translation difficulty features and MT quality. Second, the information about translation difficulty could be useful for automatic evaluation metrics, so one direction for future work is to investigate incorporating these linguistic features into existing metrics and/or develop new ones.”. We leave these for future research.