Machine translation systems and quality assessment: a systematic review

Nowadays, in the globalised context in which we find ourselves, language barriers can still be an obstacle to accessing information. On occasions, it is impossible to satisfy the demand for translation by relying only in human translators, therefore, tools such as Machine Translation (MT) are gaining popularity due to their potential to overcome this problem. Consequently, research in this field is constantly growing and new MT paradigms are emerging. In this paper, a systematic literature review has been carried out in order to identify what MT systems are currently most employed, their architecture, the quality assessment procedures applied to determine how they work, and which of these systems offer the best results. The study is focused on the specialised literature produced by translation experts, linguists, and specialists in related fields that include the English–Spanish language combination. Research findings show that neural MT is the predominant paradigm in the current MT scenario, being Google Translator the most used system. Moreover, most of the analysed works used one type of evaluation—either automatic or human—to assess machine translation and only 22% of the works combined these two types of evaluation. However, more than a half of the works included error classification and analysis, an essential aspect for identifying flaws and improving the performance of MT systems.


Introduction
Language barriers can be an obstacle to accessing information in the globalised context in which we find ourselves.Such is the abundance of information generated that it is on occasions impossible to satisfy the demand for translations by relying solely on professional human translators (Lagarda et al., 2015;Way, 2018).One of the implications of this situation is the growing demand for tools that provide different types of audiences with multilingual access to information.Machine translation (MT) is therefore profiled as one of the resources with the greatest potential for solving this problem and has been a point of focus both in terms of research and from the perspective of professional settings.Consequently, new MT paradigms and systems frequently emerge that could also integrate other resources such as translation memories or terminology databases to optimise the effectiveness of the professional translation process (Koponen, 2016).
One of the MT battlefields, however, refers to the quality of the product it creates, which is generally inferior to that reached by professional human translations.In this regard, measuring the quality of an MT system can present great difficulties, given that in the majority of cases there is not just one translation for an original text that may be considered correct (Mauser et al., 2008;Shaw & Gros, 2007).Despite this, it should be possible to determine the quality of how an MT system works, and its impact on the workflow of professional translators, in an objective manner.This will require both automated and human metrics that need, in addition, to take into account the human post-editing that is usually necessary for MT.Moreover, the annotation and classification of translation errors is fundamental for contributing to the improvement of MT systems, in order to understand the criteria of human metrics-given that this type of assessment has an element of subjectivity-and to optimise the post-editing process (Costa et al., 2015;Popovic ´, 2018).
The aim of this study is to identify what MT systems are currently most employed, their architecture, the quality assessment procedures applied to determine how they work, and which of these systems offer the best results.The methodology is based on a systematic review of the specialised literature created by translation experts, linguists and specialists in related fields.Thus, the approach from which this study is tackled seeks to complement others that are frequently centred on the sphere of computational sciences.We start out from the consideration that, in order to determine translation quality, it is essential to incorporate the perspective of translation experts or areas related to language study because human evaluation and error annotation are extremely relevant when measuring MT quality-they are both processes that must be carried out by evaluators trained in the field of translation-.
The state of the art is developed below, which includes the evolution that machine translation has undergone, along with the main proposals concerning its assessment.The next section details the methodology employed for the systematic literature review process, specifying both the sample selection process and the analysis thereof.Following on, the results obtained are presented and discussed.Lastly, the conclusions that can be extracted from the study are formulated.

Evolution of MT
MT has traditionally presented two different approaches: the approach based on rules (RBMT, Rule-Based Machine Translation), and the corpus-based approach (Hutchins, 2007).Nevertheless, the last few years have seen the development of new architectures, giving rise to hybrid approaches and, most recently, neural MT (Castilho et al., 2017;Espan ˜a-Bonet & Costa-jussa `, 2016).
RBMT systems use bilingual and monolingual dictionaries, grammars and transfer rules to create translations (Castilho et al., 2017;Espan ˜a-Bonet & Costajussa `, 2016).The problem with these systems is that they are extremely costly to maintain and update and, furthermore, due to language ambiguity they can create problems when translating, for example, idiomatic expressions (Charoenpornsawat et al., 2002).At the end of the 1980s, corpus-based systems began to gain in popularity (Hutchins, 2007).These machine translators employ bilingual corpora of parallel texts to create translations (Hutchins, 1995).Corpus-based automatic systems are divided into statistical MT systems (SMT) and example-based systems (EBMT); despite this, both approaches converge on many aspects and isolating the distinctive characteristics of each one is very complicated (Hutchins, 2007).Moreover, the statistical approach was the predominant model until the recent emergence of neural MT systems (Bojar et al., 2015;Espan ˜a-Bonet & Costa-jussa `, 2016;Hutchins, 2007).The advantage of these compared to RBMT systems is their solid performance when selecting the lexicon-especially if focusing on a thematic domain-and the little human effort they require in order to be trained automatically (Hutchins, 2007;Koehn, 2010).However, they can sometimes produce translations that are badly structured or have grammatical errors, added to which is the difficulty in finding corpora of certain thematic domains or language pairs (Espan ˜a-Bonet & Costa-jussa `, 2016;Habash et al., 2009).
The hybrid approaches arose with the objective of attempting to overcome the problems caused by the RBMT and SMT systems and combine the advantages of both, in order to improve translation quality and precision (Hunsicker et al., 2012;Tambouratzis et al., 2014;Thurmair, 2009).The hybrid approach can be implemented in different ways and, generally speaking, a distinction can be made between those architectures with an RBMT system at their core or, in contrast, an SMT system.Thus, in some cases the output of an RBMT system is adjusted and corrected using statistical information, while others see rules being employed to process both the input and the output of an SMT system (Espan ˜a-Bonet & Costajussa `, 2016).
Neural MT is currently dominating the paradigms of machine translation, this kind of MT ''attempts to build and train a single, large neural network that read a sentence and outputs a correct translation'' (Bahdanau et al., 2015, p.1).These systems are based on neural networks to create translations thanks to a recurrent neural architecture, based on the encoder-decoder model in which the encoder reads and encodes the source sentence into a fixed-length vector while the decoder produces a translation output from the encoded vector (Bahdanau et al., 2015;Cho et al., 2014).Consequently, this architecture implies a simplification regarding previous paradigms, given that they use less components and processing steps, moreover, they require much less memory than SMT and allow to use human and data resources more efficiently than RBMT (Bentivogli et al., 2016;Cho et al., 2014).Such has been the success of these models that the main MT companies-Google, Systran, Microsoft, etc.-have already integrated them into the technologies of their machine translators.

MT quality assessment
It is essential to measure the quality of an MT system to improve how it performs.There is, however, a great lack of consensus and standardisation in relation to translation quality assessment-both human and machine-given the complicated cognitive, linguistic, social, cultural and technical process this supposes (Castilho et al., 2018).According to House (2014) translation quality assessment will mean a constant to and from a macro-analytic approach, wherein questions of ideology, function, gender or register are considered, to a micro analytical one in which the value of collocations and individual linguistic units are considered.Nevertheless, it should be taken into account that these approaches can differ enormously according to the individuals, groups or contexts in which quality is assessed.Thus, quality assessment in the industry is normally focused on the final product or customer, whereas in the field of research the purpose can be to demonstrate significant improvements over prior studies or different translation processes (Castilho et al., 2018).
Taking into consideration this panorama, together with the difficulty and lack of consensus regarding MT quality assessment, general distinctions between human (or manual) and automated metrics can be made.It is worth mentioning, though, that there are other ways to evaluate quality, focused on the human revision process rather than on the translation output, for example by measuring the post-editing effort in temporal, technical and cognitive terms.

Automated and human metrics
Automated metrics in general compare the output of an MT system with one or more reference translations (Castilho et al., 2018;Han, 2016).One of the first metrics used, Word Error Rate (WER) was based on the Levenshtein or edit distance (Levenshtein, 1966;Nießen et al., 2000).This measurement does not admit the reordering of words and substitutions, deletions and insertions are equal.The number of edit operations is divided between the number of words in the reference translation (Castilho et al., 2018;Han, 2016;Mauser et al., 2008).The PER (Position-Independent Word Error Rate, Tillmann et al., 1997) and TER (Translation Error Rate, Snover et al., 2006) metrics attempt to solve the problem created by WER by not allowing the reordering of words.Thus, PER compares the words in the two sentences without taking into account the order and TER counts the reordering of words as a further edit (Han, 2016;Mauser et al., 2008;Nießen et al., 2000;Snover et al., 2006).The most popular metric is Bilingual Evaluation Understudy (BLEU), a precision measurement carried out at the level of n-grams, indivisible language units.It employs a modified precision that takes into account the maximum number of each n-gram appearance in the reference translation and applies a brevity penalty that is added to the measurement calculation (Papineni et al., 2002).This measurement became very popular as it showed good correlations with human evaluations and its usage extended amongst different MT evaluation workshops (Castilho et al., 2018).There are also other precision-centred metrics such as NIST (Doddington, 2002), ROUGE (Lin & Hovy, 2003), F-measure (Turian et al., 2003) and METEOR (Banerjee & Lavie, 2005), amongst others.
In relation to human evaluation, this usually occurs in terms of adequacy and fluency.Adequacy evaluates semantic quality, that is, if the information has been correctly transmitted or not, which requires comparison with reference translations (monolingual) or with the original text (bilingual).For its part, fluency evaluates syntactic quality; in this case, comparison with the original text is unnecessary and evaluation is monolingual.Moreover, other methods can be employed to evaluate the legibility, comprehension, usability and acceptability of translations (Castilho et al., 2018).A number of instruments can be employed to measure these aspects, such as for example the Likert-type ordinal scales, rankings-either selecting the best translation from various, or ordering the different options from better to worse according to specific criteria-or employing other methods such as error correction or gap-filling tasks, the latter not involving direct judgement from the evaluator (Castilho et al., 2018;Chatzikoumi, 2020).Further, it is important to mention that error identification, annotation and classification is another widely used human evaluation method, which shall be looked at in more detail in the following section.
There are pros and cons to both human and automatic evaluation.On the one hand, automatic evaluation requires less human effort, is more objective and less costly than human evaluation.Nevertheless, it must be taken into account that the majority of automated metrics require reference translations created by humans and, in many cases, the quality of these translations is assumed, but not verified, which could introduce an element of subjectivity and variability (Castilho et al., 2018).In addition, these metrics evaluate translation in relation to its similarity to the reference translations-despite there being no one single correct translation-and its capacity to evaluate syntactic and semantic equivalence is extremely limited (Castilho et al., 2018;Han, 2016).Finally, the large majority of these metrics arose from systems with outdated architectures, to which on occasions they are not adjusted to current paradigms (Way, 2018).
In contrast, although the results of human metrics are considered to be more reliable than those provided by automated metrics, the disadvantages of this method include large demands on time and resources, and it cannot be reproduced (Han, 2016).Furthermore, human evaluators-or annotators-must fulfil certain criteria in order to assure the reliability of the results.In the same vein, there is a need for training, evaluation criteria and familiarity with the subject area of the texts on the part of evaluators.Additionally, the ideal procedure includes more than one evaluator and calculating the inter-annotator agreement (Chatzikoumi, 2020;Han, 2016).It is worth mentioning that some authors have addressed the drawbacks on time and cost of human evaluation by means of crowdsourcing (Graham et al., , 2013(Graham et al., , , 2015)).In the face of this situation, the combination of both human and automatic metrics appears to be one of the most reliable methods for MT evaluation (Chatzikoumi, 2020).

Classification and analysis of MT errors
It will frequently be necessary to investigate the strengths and weaknesses of MT systems and the errors they produce, along with their impact on the post-editing process.In this regard it is very difficult to find a relationship between these aspects and the quality scores obtained both by automated and human metrics (Popovic ´, 2018).Thus, the identification, classification and analysis of errors are fundamental to determining the failures of an MT system and being able to improve the performance thereof.
Error classification can be implemented automatically, manually or using combined methods.These methods have advantages and drawbacks in the same way as human and automated metrics do.Manual error classification, as in the case of human metrics, is costly and requires a lot of time and effort and additionally, normally presents problems regarding inter-annotator agreement.In contrast, automatic methods can overcome these problems, but they give rise to confusion between the different types of errors, especially for very detailed typologies, and they also require human reference translations (Popovic ´, 2018).
There has traditionally been, as in the case of that which occurs with quality assessment, a lack of standardisation for MT error analysis and classification (Lommel, 2018).In this regard, many authors (Costa-Jussa `& Farru ´s, 2015;Costa et al., 2015;Farru ´s et al., 2010;Gutie ´rrez-Artacho et al., 2019;Krings, 2001;Laurian, 1984;Scha ¨fer, 2003;Vilar et al., 2006) have proposed different typologies and classifications for errors related to MT and, generally speaking, the majority of these distinguish errors at different levels (spelling, vocabulary, grammar, discourse), divided into subcategories that include, for example, errors relating to concordance, style, confusion in word meaning with various exceptions, etc.Nevertheless, the last decade has seen the appearance of projects seeking to standardise these methods with the objective of facilitating the adaptation of different tasks and language pairs to reduce effort and inconsistencies when developing an error typology (Popovic ´, 2018).This is the case for the Multidimensional Quality Metrics (MQM) frameworks, created by the QTLaunch-Pad and Dynamic Quality Framework (DQF) project, developed by the Translation Automation User Society (TAUS), which started independently and were integrated in 2014 in ''DQF/MQM Error Typology'' (Go ¨ro ¨g, 2014; Lommel, 2018).

Methodology
The research methodology is based on the systematic review of specialised literature from 2016 onwards.This methodology consists in the analysis of scientific journals using explicit and rigorous methods that allow the summarising of the results, with the aim of responding to specific research questions (Gough et al. 2012).
The study undertaken endeavours to respond to the following research questions: 1. What MT systems that include the English-Spanish language combination are the most analysed in the specialised literature? 2. What procedures are being applied to measure MT quality in the field of translation?3. What MT systems are obtaining the best results?
The publications that comprise the study sample, which formed the basis for the analysis carried out, originate from different bibliographical databases to which the queries were put.The methodology therefore has different stages: The procedure applied is set out below.

Database selection
The study sample was obtained from different bibliographical databases, both general and specialised.The typology and number of databases queries permitted the guarantee of an adequate representation of articles on MT published by translation specialists, linguists and experts in related fields.Searches were carried out on 10 specialised databases: -Dialnet is a bibliographical database focused on Hispanic scientific literature in the spheres of Human, Legal and Social Sciences.-Hispanic American Periodical Index (HAPI) includes bibliographical references on political, economic, social, art and humanities subjects in scientific publications from Latin America and the Caribbean from 1960 onwards.-Humanities Full Text: includes complete texts from the Humanities field.
-InDICEs is a bibliographical resource that compiles research articles published in Spanish scientific journals.-International Bibliography of the Social Sciences (IBSS)-Proquest includes bibliographical references from the field of Social Sciences from 1951 onwards.-Library and Information Science Abstracts (LISA)-Proquest includes bibliographical references from Library and Information Science and other related fields.
-Library, Information Science and Technology Abstracts (LISTA) is a bibliographical database developed by EBSCO that includes references from the fields of Library and Information Science.-Linguistics Collection-Proquest.This database also includes the Linguistics and Language Behavior Abstracts (LLBA) collection and compiles bibliographical references related to all aspects of the study of language.-MLA International Bibliography is a bibliographical database developed by EBSCO that includes references from all fields relating to modern languages and literature.-Social Science Database-Proquest is a database that includes the comprehensive text of scientific and academic documents relating to the Social Sciences disciplines.
The search was also carried out on two of the main multidisciplinary databases: -Scopus is a database edited by Elsevier that includes bibliographical references from scientific literature belonging to all fields of science, including Social Sciences, Art and Humanities.Scopus is, according to the information posted in its official blog,1 ''the largest abstract and citation database of peer-reviewed literature''.-Web of Science is a platform managed by Clarivate Analytics that includes references from the main scientific publications in all fields of knowledge from 1945 onwards.

Keywords and search string
Taking into account the research questions and objectives set by our study, the main search terms were identified, both in Spanish and English, which best represent the concepts involved in our analysis (Table 1).These are: The search string was then created, which was adapted to the characteristics of each of the databases with the objective of recovering all of the relevant documents possible (Table 2): Table 3 shows the number of documents recovered in the different databases queried.The search string used offered 318 results and permitted Scopus to be identified as the database with the greatest index of exhaustivity in relation to the subject of our study.
The results were filtered by applying the following inclusion criteria: (i) publication language, (ii) publication date, (iii) document type and (iv) speciality of the authors.
Thus, the publications in Spanish and English were considered, in line with the language combination contemplated in the objectives of this work.Moreover, it should be borne in mind that the evolution of MT technologies are in constant development to which recently published documents were included (2016 onwards) to guarantee that the MT systems were up to date.Reviews and essays were also rejected, selecting empirical research papers.Lastly, the speciality of the authors was taken into account in a way that at least one author was required to be from the field of languages (translation, linguistics or similar), given that human evaluation and error annotation should be carried out by evaluators withs specific training in the field of translation.The application of these criteria resulted in 19 documents that were all relevant for our study and permitted the identification of the most prominent authors in relation to the subject in question.Hence, a second query on authors in Scopus allowed us to complete the initial sample with a further 8 documents refer to languages close to Spanish-Portuguese and Catalan-that could enrich the study.The final sample was therefore comprised of 27 documents (Fig. 1).

Qualitative analysis process
The study sample documents were ordered alphabetically and identified by an ID composed by the word Item followed by its corresponding number, e.g., Item 12 (Annex 1).All the documents were then stored on a bibliographical reference manager-Mendeley-and it was verified that their metadata were correct.The documents were then exported together with their metadata in order to facilitate a qualitative analysis of the content via the NVivo software package (Release 1.0).The analysis of the content allows for the application of systematic and objective procedures for describing the content of the messages (Bardin, 1996;Mayring, 2000).
To afford a greater rigour to the analysis, two researchers with experience in qualitative research using the NVivo package identified and defined the categories.Being an analysis of content, it was determined that the categories were exclusive, as they were required to have been formed by stable units of meaning (Trigueros-Cervantes et al., 2018;Weber, 1990).The objective of this initial coding was to identify what systems and architectures, evaluation measurements and MT error analysis processes were employed in the different studies, and to determine whether there had been comparisons between MT systems or architectures in any of the studies analysed.Following a consultation of experts in qualitative research, a representative sample of the documents was selected (approximately 20%), which were independently coded by both researchers to identify the underlying categories.After agreement was reached, the definitive category system was created (Fig. 2).
All of the documents were subsequently analysed and coded from their in-depth reading in categories or nodes by both researchers, in accordance with the previously established category system.The coding was carried out in both independent NVivo projects that when merged allowed a comparison of coding to be carried out in accordance with the Kappa index, which permits the calculation of the inter-annotator agreement.In this regard, as shown by Fig. 3, there is a very high level of agreement in the large majority of categories.Those categories with a percentage over 10% of disagreement were reviewed and agreement was reached on their coding.The high percentage of agreement is due to the use of very specific concept and exclusive categories, as recommended for this type of analysis.
Once the final categorisation was complete, different coding matrices were generated to carry out a meticulous analysis of all of the coded references in the different categories comprising the object of the study.The use of these matrices permits the exploration of the relationships between different categories and the Fig. 2 Category system studies analysed.These matrices were subsequently exported for the creation of tables and graphs in MS Excel.

Clarifications on the coding
Regarding the MT systems used in the empirical studies from the analysed publications, it should be pointed out that in the category design they were grouped by type.As Fig. 2 shows, different resources belonging to the same company were grouped in a single category.This is the case for the categories Google (or Google API/products) and Microsoft (or Microsoft API/products), where not only are their machine translations included, but also other resources offering these companies as application programming interfaces (APIs).All those systems that were specifically trained with systems such as Open NMT, Moses and Nematus to carry out the studies were grouped into the Ad hoc trained systems category.Finally, those MT systems with coding frequencies under 2 were included in the Other category, and this is the case for the DeepL, Reverso and Yandex systems.
Regarding the MT quality evaluation measurements, three large categories were created, which were then divided into subcategories.On the one hand, the automated metrics (BLEU, METEOR, TER, etc.) were classified in the Automatic evaluation category and the manual methods such as questionnaires and interviews were classified in Human evaluation.On the other hand, the Other category was created to include those measurements not directly related to MT quality such as, for example, the post-editing effort (technical, temporal and cognitive) or measurements orientated towards information retrieval.
Finally, for the typologies and error classifications, despite the fact that the MQM and DQF reference frameworks were integrated into a combined typology in 2014, the difference between both has been maintained as the analysed works referenced Fig. 3 Inter-annotator agreement them individually.Furthermore, included in the DQF Taus category are those studies employing the DQF platform-despite them not expressly mentioning the error typology-as the aim was to distinguish between those works that used standardised methods and those that did not.

Machine Translation architectures and systems employed
In terms of the architectures employed (Fig. 4) close to 89% of the works-24used some type of MT system.The 3 remaining studies correspond to Items 6, 7 and 8, and focus on the description and validation of a new MT measurement.
A total of 37 different architectures were employed in the 24 articles, indicating that more than one type of architecture was studied in some of them.Thus, 45% of the studies analysed-Items 2, 3,4,5,10,16,17,20,21,22 and 27-used two different types of architecture, whereas only a single study-item 18-used three types.The most used architectures were statistical MT, in 66.7% of the works analysed, and neural MT, in 62.5%.In addition, 41.7% of these works combined both architectures.The use of rule-based or hybrid architectures drops to 12.5% of works in both cases.Finally, it is worth mentioning that in the case of Item 22 it was not possible to accurately define what type of architecture the systems used in the analysis employed.Therefore, although the study publication date was taken as a reference to determine it, given the lack of the study date, the architecture of the systems may have been changed between the analysis and publication dates.
Regarding the MT systems used, the Google translator-or products offered by Google-is the MT system employed by over half of the articles; this is followed by the translators and products offered by Microsoft and MT systems that were specifically trained via Moses, Nematus and OpenNMT, in both cases accounting for 25% of the analysed works.

Evaluation metrics
For the evaluation metrics employed (Fig. 5), again, close to 89% of the works analysed used some type of evaluation metric, either automatic (Aut.), human (Hum.) or other (Oth.)type of metric (measurements related to post-editing effort, information retrieval, etc.).In contrast, Items 7, 13 and 26 did not employ any type of evaluation metric for MT quality, although they did employ error detection and classification methods.50% of the studies only used one type of evaluation metric.Of these 12 works, five employed automated metrics, five human metrics and two other types of measurement.For the other 50%, between two and three types of evaluation were employed.Of these, 10 studies used two evaluation metrics, of which four combined automated and human metrics, one combined automated metrics with another type of evaluation and five studies combined human metrics with another type.Finally, only two of the works analysed employed the three types of evaluation metric: automated, human and other, in this case focused on the postediting effort.

Automated evaluation metrics
Automated evaluation metrics (Fig. 6) were employed by 44.4% of the works analysed.On average, these works used three automated evaluation metrics, with Item 12 being the study that used the most-7 metrics-and Items 6, 8 and 19 being those that used the least-1 metric-.The most used metric is BLEU, employed by nine of the 12 works, followed by TER (eight works) and METEOR (seven works); 50% of the works analysed employed these three metrics combined or together with others (Items 4,10,12,14,15 and 18).

Human evaluation metrics
Over 55% of the works analysed-15-employed human evaluation metrics (Fig. 7).Of these 15, 86.7% used closed questionnaires with Likert type scales either as a single evaluation method (Items 5,9,12,20 and 15) or combined with others such as ranking or the assignment of scores, pre and post-tests or qualitative methods such as open questionnaires or interviews (Items 3,4,10,11,16,21,23,24).

Classification and analysis of errors
Regarding the classification and analysis of errors (  4, 7, 13 and 20) developed their own methods of error annotation or typologies.

Combination of evaluation metrics and error analysis
Of the 14 works that included error analysis, all of them apart from 3 (7, 13 and 26) complemented these analyses with automatic, human or other evaluation types.
Figure 8 shows that the works that employed the DQF Taus platform, with the exception of Item 26, employed at least human evaluation and other types of measurement.In contrast, all of the articles that employed the MQM framework complemented the error analysis with MT evaluation; in this case, 4 of 5 works that used MQM employed automated metrics together with error analysis.Finally, 3 of the 5 works that developed their own error analyses or typologies combined them with other types of evaluation metrics.Regarding the comparison between different systems (Table 5), 59.3% of the studies made comparisons between systems or architectures.Of these, 62.5% (Items 3,4,5,10,16,17,20,21,25,27) established that neural MT was better than statistical, rule-based or hybrid systems.Items 4, 5, 16, 25 and 27, however, offer a number of clarifications regarding the results of the comparison of neural MT with other architectures.Hence, in Item 4, following a description of the results of three different studies, it is concluded that neural MT obtains better results with automated metrics than statistical MT; notwithstanding this, the results of human metrics are not so evident.Item 5 shows that, despite the good general MT compared to statistical MT results, those outcomes in categories for certain errors and for time and post-editing effort are not so evident.This is similar to that described in Items 16, 25 and 27, which highlight a shorter edit distance for neural than for statistical MT; however, their post-editing time is greater.Finally, Item 27 brings attention to the fact that the results for fluency, adequacy and productivity in neural MT were greater in neural than in statistical translation.Conversely, 18.8% of the works-Items 9, 18 and 22-established that statistical MT was better than the rule-based or hybrid kind.It is worth pointing out that in these three cases no comparisons with neural technology systems were made.
As regards MT systems, Items 12 and 13 determined that ProMT obtained better results than another hybrid technology translator-Systran-in terms of automatic and human evaluation, and in error analysis.For their part, Items 1, 9 and 22 highlight the performance of the Google translator compared to other systems.In the case of Items 9 and 22 mention should be made of the fact that Google had not yet adopted the neural system in its machine translation engine.
Finally, given that the majority of the works highlighted the results of neural MT against statistical systems, there was an analysis of what type of evaluation was Fig. 9 Types of evaluation employed to determine that NMT obtained better performance employed in those studies that determined that neural MT was better than other architectures.In Fig. 9 it can be observed that the 10 studies employed human evaluation, and in the case of 3 of them-Items 20, 21 and 27-this was the only evaluation method employed; in 4 of the cases-Items 3, 16, 17 and 25-it was combined with another type of evaluation measurement such as post-editing effort or measurements related to information retrieval effectiveness; in 1 of the cases-Item 5-automatic evaluation was combined with human evaluation; and lastly, in 2 cases-Items 4 and 10-the three evaluation methods were employed.

Discussion and conclusions
Following the systematic review of the publications that make up our study sample it is observed, firstly, that neural MT is the predominant model in the current MT scenario.Thus, despite statistical MT being employed in one more study than neural MT, when both architectures were compared the latter obtained better results than the former in all of the studies analysed.These results are along the line of those obtained in one of the main MT evaluation forums (WNT 2015), which confirmed the better performance of neural MT compared to the predominant statistical model up to that point (Bojar et al., 2015).In the same vein, despite the existence of certain clarifications regarding neural MT performance in relation to the order or treatment of long sentences, other later studies confirm this change of paradigm (Bentivogli et al., 2016;Toral & Sa ´nchez-Cartagena, 2017).Moreover, the adoption of neural technologies by the main MT companies such as Google, Systran or Microsoft, among others, confirm that predominance of neural MT in nowadays MT scenario is undoubtable.In relation to the systems employed, Google-or products or APIs offered by Google-is the most used MT system, followed by Microsoft or MT systems that were specifically trained for the objectives of the studies.In this regard, it is noteworthy that despite the current widespread adoption and popularity of DeepL (Schmitt, 2019), only one of these studies employed this machine translator that, furthermore, registered a somewhat lower performance than Google.Therefore, it would be advisable to include DeepL in similar research and to compare its results with those of the nowadays predominant system: Google.
As far as the way of assessing MT is concerned, in spite of the recommendation being to combine both human and automated metrics to obtain the most reliable results possible (Chatzikoumi, 2020;Way, 2018), only 22% of the works analysed combined these two types of measurement, which evidences the fact that the research in the MT field involving translation and language specialists is still scarce and that human evaluation requires a considerable investment of time and resources.Mention should be made of the fact that 2 of these studies, as well as employing both types of evaluation, also utilised other metrics related to post-editing effort.The most used automatic metric is BLEU, which is foreseeable given that it is the most popular automatic metric (Castilho et al., 2018) despite the suggestion on the part of some authors that these metrics may not be adequate for measuring the performances of new neural MT systems, along with the fact that this type of metric does not measure the quality of translations, rather their similarity with reference translations (Boitet et al., 2006;Castilho et al., 2018;Way, 2018).Thus, given that MT is a constantly evolving sphere, and the possibility of the development of new technologies that go beyond the current paradigms, further research is needed for the development of new MT evaluation methods specifically adapted to these systems.
As regards human evaluation, the majority of the studies employed Likert type questionnaires and rankings; likewise, these results are unsurprising, as they are the most habitual way of carrying out this type of assessment (Castilho et al., 2018;Chatzikoumi, 2020).Nevertheless, it is of note that 4 of the works included open questions or participant interviews, introducing a qualitative analysis approach that is somewhat unusual in this type of evaluation.This type of approach can undoubtedly be interesting, despite its analysis requiring a considerable effort, as on many occasions they offer the possibility of information that is much more enriching and extensive than that provided by the statistical analysis of questionnaire items.Concerning the evaluation techniques employed in the papers which compared neural and statistical MT it is worth mentioning that all of them involved human assessment-along with other kind of procedures or not-.From this fact, it can be concluded that, in the analysed publications, automatic evaluation by itself was not enough to determine which system worked better and, consequently, human evaluations play an essential role in order to determine whether neural techniques have a better performance than statistical ones.On the contrary, the situation changes when there is no comparison between systems or architectures, as in this case, human assessment is included in less than a half of the works.Given that the objective of these works was not to establish comparisons, human involvement does not seem essential to evaluate MT quality with regard to its performance when only one system or architecture is involved.Finally, it should be pointed out that around 52% of the works analysed included error classification and analysis, a fundamental aspect for identifying flaws and improving the performance of MT systems and, in addition, over half of these works carried out this analysis employing standardisation frameworks such as DQF and MQM.Although it is a remarkable amount of works, given the paramount importance of error analysis concerning MT improvement (Popovic ´, 2018), more research is needed including this aspect, specially concerning EN-ES language pair, which necessarily entails the involvement of trained professionals from the translation and languages fields.
The reduced sample size of this study could be its major limitation.However, this limitation is due to the selection criteria involved, wherein two restrictions were applied that affected the final sample substantially.One of these is the chosen language pair-English and Spanish-and the other, which possibly imposed a greater restriction, is the fact that at least one of the authors belongs to the field of translation or similar.Regardless, these results draw attention to the need to involve people with training in translation or related spheres in this type of study given that, as indicated earlier, the most reliable way of evaluating an MT system is by combining automatic and human methods and, as regards the implementation of the latter, there is a need for evaluators who are trained and who have wide knowledge of both languages involved and the translation process.Furthermore, these results could be the product of the traditional rejection of MT on the part of the language 1. Selection of bibliographical databases 2. Undertaking of bibliographical queries to determine the final sample a. Identification of search keywords-terms, synonyms, variants-.b.Creation of the search string-Boolean operators-.c.Filtering of results-document type, publication date, amongst others-.3. Analysis of documents from the sample with the NVivo software package (Release 1.0).

Fig. 1
Fig. 1 Document search and selection process

Fig. 4
Fig. 4 MT architectures and systems

Fig
Fig. 5 MT evaluation

Fig
Fig. 6 Automated evaluation metrics

Fig. 8
Fig. 8 Combination of evaluation metrics and error classification and analysis

Table 1
Query keywords

Table 2
Search stringSearch string (''machine translation'' OR ''automated translation'' OR ''automatic translation'' OR ''traduccio ´n automa ´tica'') AND (quality OR error* OR evaluat* OR assess* OR evaluacio ´n OR calidad OR) AND ((Spanish AND English) OR (espan ˜ol AND ingle ´s)) Table 4), close to 52% of the studies analysed included analyses of errors committed by MT.In this regard, four of the studies (Items 10, 16, 17 and 26) used Dynamic Quality Framework (DQF) and the directives created by the Translation Automation User Society (TAUS), and five (Items 2, 5, 19, 22 and 23) employed the Multidimensional Quality Metrics (MQM) framework, developed by the QTLaunchPad project.In contrast, the five remaining works (Items 1,