A roadmap to neural automatic post-editing: an empirical approach

Shterionov, Dimitar; Carmo, Félix do; Moorkens, Joss; Hossari, Murhaf; Wagner, Joachim; Paquin, Eric; Schmidtke, Dag; Groves, Declan; Way, Andy

doi:10.1007/s10590-020-09249-7

A roadmap to neural automatic post-editing: an empirical approach

Open access
Published: 03 September 2020

Volume 34, pages 67–96, (2020)
Cite this article

Download PDF

You have full access to this open access article

Machine Translation

A roadmap to neural automatic post-editing: an empirical approach

Download PDF

Dimitar Shterionov ORCID: orcid.org/0000-0001-6300-797X^1,4,
Félix do Carmo^2,4,
Joss Moorkens³,
Murhaf Hossari⁴,
Joachim Wagner⁴,
Eric Paquin⁴,
Dag Schmidtke⁵,
Declan Groves⁵ &
…
Andy Way⁴

4654 Accesses
8 Citations
12 Altmetric
Explore all metrics

Abstract

In a translation workflow, machine translation (MT) is almost always followed by a human post-editing step, where the raw MT output is corrected to meet required quality standards. To reduce the number of errors human translators need to correct, automatic post-editing (APE) methods have been developed and deployed in such workflows. With the advances in deep learning, neural APE (NPE) systems have outranked more traditional, statistical, ones. However, the plethora of options, variables and settings, as well as the relation between NPE performance and train/test data makes it difficult to select the most suitable approach for a given use case. In this article, we systematically analyse these different parameters with respect to NPE performance. We build an NPE “roadmap” to trace the different decision points and train a set of systems selecting different options through the roadmap. We also propose a novel approach for APE with data augmentation. We then analyse the performance of 15 of these systems and identify the best ones. In fact, the best systems are the ones that follow the newly-proposed method. The work presented in this article follows from a collaborative project between Microsoft and the ADAPT centre. The data provided by Microsoft originates from phrase-based statistical MT (PBSMT) systems employed in production. All tested NPE systems significantly increase the translation quality, proving the effectiveness of neural post-editing in the context of a commercial translation workflow that leverages PBSMT.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Article 27 November 2023

Assessing the Strengths and Weaknesses of Large Language Models

Article Open access 11 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine Translation (MT) is widely employed in industrial translation workflows. MT for dissemination is an intermediate step which generates a raw translation of a given source document or a sentence, followed by a post-editing step that ensures that the quality of the final translation meets required quality standards. Automatic Post-editing (APE) is an area of research aiming at exploring methods that apply editing operations on an MT output to produce a better translation and thus reduce the human effort in the translation workflow.

APE covers a wide range of post-editing approaches, from regular expressions applied on the MT output to post-editing simple error patterns, to deep learning techniques that can transform complete sentences, paragraphs or even documents into a more correct variant. Needless to say, while APE aims to reduce certain MT errors, it is up to the human translator to accept or further post-edit the output. In this article, we focus on APE with deep neural networks—neural APE or simply neural PE (NPE)—and the sentence-to-sentence post-editing case.

In the rest of this article we use the following abbreviations and notations:

SRC segment(s) in the source language;
MT the output of a non-specified MT system;
PE the version of the MT segment(s) after post-editing by professional translators;
SMT the output of an SMT system, usually the baseline;
NMT the output of an NMT system;
NPE the version of the MT segment(s), after post-editing with an NPE system;
TER(npe, pe), TER(smt, pe) the TER score between NPE or SMT (the hypothesis) and PE (the reference), a human post-edited version of the machine translation output is used as reference.^{Footnote 1}

1.1 Automatic post editing and its parallelism with machine translation

APE systems convert a segment e in the target language L2 to a corrected variant \(e'\) in the same language. The APE task can be seen as a monolingual translation task where the source and the target language are the same. As such, APE implementations are rather similar to MT systems and even employ similar methodologies. However, while in an MT scenario a system is trained on pairs of sentences (f, e) in two different languages, in an APE scenario the available data includes MT input and output, as well as the human post-edited variant of it. That is, an APE system is trained on triplets of sentences—\((f, e, e')\)—where \(e'\) is the post-edited variant of e. Such triplets (i) reveal the transformations of e into \(e'\) that should be learned by the APE, for it to correct automatically any new data and (ii) allow for a consistency check with the source sentence f. The learning process depends on the availability of enough triplets.

1.2 Data demands and data scarcity

While the collection of parallel data (for training MT engines, for example) has been an ongoing process since the beginning of SMT, APE is a recently-emerged approach. For many language pairs, thus, enough parallel data is available. However, that is not the case for triplets with human post-edited data, required in data-driven APE approaches.

To mitigate this issue for the open development of APE systems, datasets of artificially generated triplets have been produced and made publicly available. In Junczys-Dowmunt and Grundkiewicz (2016) one such set for the English → German language direction is described. It is generated via round-trip translations using two PBSMT engines, one for the German → English and another for the English → German language directions. The synthetic post-edit triplets are composed of the German source data as the post-edited data, the German → English translated data as the English source, and the round-trip translation output as the uncorrected MT data. Consecutively, the data is filtered according to TER to mimic the quality of the provided APE data. More recently, Negri et al. (2018) present the eSCAPE corpora covering multiple language pairs. Their method is to translate freely-available data and use the target side as a human post-edited version of this translation, thus creating triplets of sentences.

Exploiting synthetic data has shown to lead to improvements in APE (Bojar et al. 2017; Chatterjee et al. 2018) and in NMT systems (Sennrich et al. 2016b; Poncelas et al. 2018). A detailed summary is presented in do Carmo et al. (2020). However, Poncelas et al. (2018) show that using excessive synthetic (i.e. backtranslated) data can lead to deterioration of quality. Nevertheless, in an industry environment where human post-editing is a standard procedure, a sufficient quantity of closed-access triplet data is often available (Crego et al. 2016; Mathur et al. 2017). This was the case in the collaborative project between the ADAPT centre and Microsoft described in this article. This data is typically (i) optimised towards the domain of application (i.e. with respect to terminology, style, etc.) and (ii) conforms with the quality standard requirements.

In this article, we present NPE systems trained on industry-standard data for the English–German (EN–DE) and English–Spanish (EN–ES) language pairs. And while the data we used is not publicly available, this article aims to convey our knowledge and experience on such data, making it easier for researchers to understand industry requirements and to provide solutions that apply not only in academic but also in commercial conditions.

1.3 The ADAPT-microsoft APE project

The systems we present in this article are the result of a collaboration between the ADAPT Centre and the Microsoft GSX Language Technology group that took place between May and September 2018.

This collaborative project aimed to test the use of NPE in a commercial environment and with industry-standard data. The data, provided by Microsoft, is part of the production data. Within the scope of this project, we considered two language pairs: English–German (EN–DE) and English–Spanish (EN–ES), exploited in two rounds.

The project was divided into two main stages: stage i State-of-the-art review and analysis, and stage ii Implementation and Empirical evaluation.

In stage i we conducted a review of the state-of-the-art of APE systems. The purpose of this first stage was to inform the empirical one, i.e. stage ii, and guarantee that the best technology available was employed for the purposes of the project. A summary and analysis of state-of-the-art APE systems is presented in do Carmo et al. (2020).

Stage ii started with a full analysis of the data provided by Microsoft. Section 3 of this article summarises our findings with respect to the data. This analysis allowed us to identify specific features in the data that conformed the models tested at the next stage.

The implementation and evaluation part was divided in two. First, we trained and evaluated NPE systems with EN–DE data. This round ran as a standalone project encompassing data analysis, data preprocessing, deciding on NPE systems to train, and training and assessment, all using only EN–DE data. Based on the results, observations and acquired knowledge from the first round, we conducted a second set of experiments with EN–ES data. That is, we selected some of the best approaches and we ran experiments on EN–ES data, with fewer systems involved. Overall we trained 15 different systems (11 systems with EN–DE data and 4 with EN–ES data), exploring diverse setups with and without augmentation of the input data. These systems are described in Sect. 4.

In the evaluation part we collected standard edit scores—TER (Snover et al. 2006) and BLEU (Papineni et al. 2002)—of the different systems under comparison. A detailed analysis of the results was also performed. Our evaluation results are reported in Sect. 5.

2 Neural APE

In this work, we analyse and empirically evaluate different NPE approaches and present the most efficient ones. The ultimate goal of this work is to inform the reader about the end-to-end process for achieving high-quality APE output, along with the conditions and limitations of the various approaches. In addition, we exploit industry data composed of triplets (\(\{src, mt, pe\}\)) where the post-edited segments originate from professional translators. Thus, we aim to draw a roadmap over existing neural APE techniques. The major decision points are related to (i) the neural architecture and (ii) how the data is used for training the NPE system. We do not investigate the effects of adapting low-level settings of the neural systems, such as the learning optimiser, the size of the neural networks, etc. as they are outside of the scope of this project; we consider the default settings to be effective for our tasks. In (Junczys-Dowmunt and Grundkiewicz 2017) an analysis of various sequence-to-sequence architectures for APE is presented, over the differences and the effects of the various attention mechanisms on APE quality. Our work aims to explore various architectures in a commercial environment and it focuses on the dependencies between data and model architectures.

As noted in Sect. 1, the APE task involves handling multi-source input and a single source output (Input: \(\{src, mt\}, Output\): pe). In a sequence-to-sequence encoder-decoder architecture, the multi-source input can be handled with either a single encoder or with multiple encoders. That would impose different requirements towards preprocessing the data and building dictionaries. For example, if a single encoder is used, then the SRC and MT need to be concatenated and a joint dictionary needs to be built.

A double-encoder NMT model, where SRC and MT inputs are encoded separately and the corresponding context vectors are used together as input to the decoder, would require separate dictionaries: two for the SRC and MT and one for the PE data. In the latter case the dictionary size would be smaller than the one of a joint dictionary in the former (single-encoder) case.

Another decision on how to approach NPE is whether to use ensembles of models or a single-model NMT system. To handle different types of input/output pairs (e.g. based on character count or word count) we can either implement an ensemble NMT system, where different networks will be trained on the different input/output pairs, or one single network trained on data carrying extra information regarding its type. In this case, a prefix token can be added to each input pair or triplet, which identifies the type(s) of input. While ensembling is widely used for NMT, quality estimation, and APE, it complicates the architecture, adding an extra layer of training and optimisation. This type of engineering overhead makes it prohibitive for large-scale QE employment that is required in a commercial workflow. For example, Microsoft operates with more than 80 languages, therefore it is easy to see how important it is to choose an efficient and scalable approach for production. The latter approach, i.e. adding a prefix token to the input sequence, follows from transfer learning and has been employed successfully on multi-language MT (Johnson et al. 2017; Mattoni et al. 2017), gender identification in MT (Vanmassenhove et al. 2018) and controllability in MT (e.g., to manage forms of politeness (Sennrich et al. 2016a)). Using a prefix token allows only one system to be trained jointly to perform APE on different types of input. To our knowledge, at the time of conducting the experiments, our work was the first to employ such an approach for APE.

For NMT, splitting words into subword units has led to state-of-the-art results. The most common method is unsupervised Byte Pair Encoding (Sennrich et al. 2016c), BPE in short, a fast and language-independent method. Other methods based on morphology (e.g. based on Morfesor (Creutz et al. 2005; Smit et al. 2014)) have also led to good results for specific languages (Ataman and Federico 2018). Their main drawback is the language dependency. Segmenting tokens into their basic building blocks, i.e. characters, has been explored for character-based NMT (Lee et al. 2017) and also for character-based APE (Junczys-Dowmunt and Grundkiewicz 2017). For our NPE systems, we considered words and subword units generated with BPE. Increasing the granularity of subword units to characters would imply that the APE system would have to learn how to correct the spelling of specific words—a task that is computationally more expensive and unnecessary for our use case (the dataset originates from SMT systems and it is therefore expected to be correctly spelled).

Another decision point is whether to extend the training data with data-specific features. Similar to Hokamp (2017), an option is to add syntactic features as factors and train a factored NMT system.

These different decision points are mapped in Fig. 1 in the form of a roadmap. We follow this roadmap to systematically construct our experiments and train the corresponding systems. Such a systematic empirical evaluation aims to inform the reader of the possible options and their implications in constructing APE systems for other use cases.

3 Data analysis

The data provided by Microsoft (actual production data), constitutes a collection of 201,000 triplets, translated from English into two different target languages: German (EN–DE) and Spanish (EN–ES). These triplets include the source English string (SRC), its machine translation (MT)—originating from an SMT system—in different moments in time, and their corrected versions created by human post-editing (PE). The data consisted of user interface (UI) strings, that is, it contained menu entries, help messages, etc. from different Microsoft software products.

The following analysis was produced only for the EN–DE language pair, during the first round of stage ii of the project.

3.1 Pre-processing

First, we performed a data analysis on the English-German data to identify issues and irregularities that might impede the performance of an APE system. We investigated strings such as untranslatable items, file names/paths/locations, hyperlinks, markup, structured alphanumericals, and so on, which are very frequent in UI data. The inappropriate handling of these strings, e.g., incorrect tokenisation, could trigger an APE system to post-edit an already correct translation, i.e. the problem of overcorrection. As a result, we proposed and implemented a pre-processing step to clean and normalise the data. This step included the normalisation of spaces, punctuation, quotes, and other special characters.

Segment duplication was also analysed. It is a typical situation in production scenarios, where the same segment may be produced in different projects and translated repeatedly. We identified 0.4% of the data as full duplicates—same SRC, MT and PE—and removed them.

We also noted that some segments in the provided data form small paragraphs, containing more than one sentence. We analysed the distribution of such segments and identified that 22% of the segments contained more than one sentence, and only 0.4% of the segments contained more than five sentences. We assessed their structure and found it impractical to segment these to the sentence level. It is typical in NMT and APE to cut sentences to around 60 tokens for efficiency and performance purposes. To accommodate these long sentences, we extended this cut-off limit to 300 and 150 tokens, for concatenated and multi-source systems respectively (see Sect. 4).

We implemented a pre-processing step that tackles the aforementioned issues and further cleans the data. While some of the pre-processing is language independent, e.g., file names/paths/locations, other modifications are language dependent. In Sect. 4.1 we discuss the employment of this step in our experiments for the EN–DE data.

3.2 Partitions

Next, we analysed the data with respect to different common characteristics that might show similar patterns and potentially guide the APE systems. Besides the triplets of SRC, MT, PE, the provided dataset included information such as the software package that was translated, the translation project, a timestamp and other metadata. We hypothesised that each of these metadata categories could act as a factor in grouping the data into partitions that share common features. These common features in turn would help the decoding process to find a better post-edited candidate. We studied several ways to partition the data, using a criterion of relevance, based on the distribution of data in the classes, to decide which would be used in the experiments. Following, we present the main partitions we considered together with the factors and reasons for focusing on these.

Length: Microsoft provided a word count of each source segment, produced with Microsoft’s internal tools. Microsoft wanted to receive results and observations for subsets of the data based on the following length intervals: (i) 0–4; (ii) 5–9; (iii) 10–30; (iv) 31–\(\infty\). We distributed the data into 4 partitions based on the length of the source segments: \(Len_1, Len_2, Len_3, Len_4\).
Tenant: another useful metadata label is the tenant description. A tenant is a grouping of projects, according to Microsoft organisation. We used this information to form 14 partitions of our data according to the tenant label.
TenantPartition: because some partitions based on the tenant label contained a very small number of segments, we further organised the data into 6 partitions—the top five tenants as specific partitions, and the others into one single partition called “Other”. In the rest of this article we refer to this partitioning as “TenantPartition”.

Aside from these three classes for partitioning the data, we also considered Project and Number of sentences in a segment as relevant criteria, in addition to different ways to calculate word and token counts. However, these were not considered relevant due to the unbalanced way in which segments were distributed according to these criteria. For example, there was a high number of projects (583), many of which contained a small number of segments:

The largest project, “DevSuitePortal”, had 16.5k segments (8.2%).
The second largest, “word-Office-ios”, had 9.6k segments (4.8%).
The top 5 cover 47.9k (23.8%), the top 10 cover 73.2k (36.4%).
350 projects had less than 100 segments.
94 projects had less than 5 segments.

As for the number of sentences in a segment, we analysed their distribution, and we identified that 88% of the segments contained 1 sentence, and there were only 0.4% segments with more than five sentences. Due to the skewed distribution, we decided not to use this as an informative feature for data partitioning.

3.3 Editing patterns

A brief analysis of some of the editing patterns in the data was also done at this stage, although it was not intended to apply these as components of the training systems. A type/token ratio analysis and an analysis of unique trigrams showed that PE sentences had a richer vocabulary, which was used more consistently than in MT output. These data conform with the findings presented in (Vanmassenhove et al. 2019) and Toral (2019) about the differences the lexical richness between human and machine translated text.

We also observed that in 27.4% of the segments the MT output had not been post-edited. A close analysis of some of these segments shows segments composed of placeholders, numbers, URLs, or other non-translatable elements. The number of segments in which the PE content is the same as the SRC content is around 10%, again with some cases of untranslatable elements or placeholders.

The distribution of editing operations observed in the training data was as follows: a higher number of substitutions (22%), followed by deletions (15.4%), with insertions and shifts at similar proportions (ca. 4%). This distribution is typical of PE scenarios and it is more or less reproduced by the best APE systems (do Carmo et al. 2020).

3.4 Training, development and test data

The provided data consisted of 180,198 triplets of segments as training data (SRC,SMT → PE), 10,000 triplets as test and 10,000 as development sets. For the two language pairs the source side of the training data is the same, however the test and development sets are different, since the sampling method did not only use features of the source segments, but also features of the target segments.

Our data analysis also guided the selection of the development and test sets. We took into account the main features and partitions identified in the data. The features that were considered relevant for extracting a balanced sample of the dataset were: length of the segment (source words), token count of the PE data, the tenant, the TER scores estimated between MT and PE, and number of sentences in a segment. We selected randomised and stratified subsets for the training, development, and test sets.

For both training and translation, we use tokenised, normal-cased sentences. We applied neither lowercase nor truecase to the tokens, but used their original form, in order to account for casing errors. By using original-case data, the NPE system would learn to recognise casing errors we aim to fix along with everything else.

4 Experiments setup and tested systems

We followed the roadmap of Fig. 1 and built 15 NPE systems, alternating between the different options on the choice points. We first trained and evaluated 11 systems for the EN–DE language pair. Following their evaluation, we used the parameters that led to the best performance to train 4 systems for the EN–ES language direction. Our empirical assessment then aimed to: (i) identify the best system for our use case and for the two language pairs and (ii) identify how different system variables affect the NPE performance.

4.1 Systems

Pre-processing: Following the discussion in Sect. 3 with respect to the EN–DE data, we added an extra decision point on our roadmap regarding the pre-processing of the data. There were two pre-processing methods—we could either use the original data as preprocessed by Microsoft, or the data that resulted from applying ADAPT’s pre-processing which fixes spaces, quotes, and other issues as presented in Sect. 3.1. That gave rise to two different types of systems.

We ought to note that while the ADAPT pre-processing led to better results (see later in Sect. 5) we decided to use Microsoft’s (pre-processed) data in the second experiment round with EN–ES data. The reason is two-fold: (i) a lot of the pre-processing is language dependent and (ii) the observed improvements are not big enough to justify the manual labour required to identify a good pre-processing procedure for the EN–ES data.

Tokens and dictionaries The choice for word segmentation granularity impacts not only the data vocabulary (i.e., the system’s dictionary), but also the choice of Encoder/Decoder. We considered three different strategies to build dictionaries: (i) Character-based; (ii) BPE, including 50k BPE operations; (iii) Word-based. For different use cases, each of these methods has been shown in literature to have a positive impact on the translation quality, and under different conditions it can be preferred to the others. While in a post-editing scenario it is important to learn how to correct complete words, rather than sub-word particles (characters or BPE-based subwords) an important shortcoming of word-based dictionaries is the large vocabularies that, if reduced for the model to fit in memory, may result in out-of-vocabulary (OOV) issues. At the other extreme, using characters as basic tokens implies long sequences that are hard to process (from time and resource perspectives and diminishing performance for LSTM models) with sequence-to-sequence models (Pascanu et al. 2012). To target this issue, convolutional neural networks (CNNs) have been successfully employed in MT (Lee et al. 2017) and APE (Varis and Bojar 2017). In this work we aim to address the performance of the more mainstream LSTM and Transformer models.

In our experiments we look into BPE- and word-based dictionaries.

Data augmentation At this stage we had to decide whether and what extra input information to add. In Sect. 3.2 we identified several data partitions based on certain properties of the input/output data. We consider the three partitions—(i) Length, (ii) TenantPartition and (iii) Tenant—as most characteristic and use them to augment our data. To do so, we introduce an extra token in front of the input sequence that states the partition it belongs to. We refer to systems trained with extra information about the Length, TenantPartition and Tenant as Augmented 1, Augmented 2 and Augmented 3 accordingly (see Tables 1 and 2).

Similar to Hokamp (2017) we also explored features based on part-of-speech (POS) tags and dependency parses. However, our preliminary experiments showed that in a scenario where the SRC and/or MT segments do not constitute well-formed sentences, as is the case of the UI data in our use case, adding such features impedes the performance of the system. Our results were below acceptable and after the preliminary tests, we discontinued experimenting with word-level linguistic features and focused on the prefix-token augmentation. We ought to note that while our results in the specific use case discard linguistic features, other types of word-level features may contribute to the overall performance. However, this reaches out of the scope of our work and we did not pursue this direction.

Input representation Given that the input consists of two types of sequences—the SRC and the SMT—we can choose between (i) Single sequence input where SRC and SMT are concatenated or (ii) Multi-sequence input where SRC and SMT are fed separately. The former case would imply the use of one encoder, while for the latter, two separate encoders—one for the SRC and another for the SMT sequences.

Furthermore, in the single sequence case, it is important to consider the order in which sentences are presented—either the SRC is the first part of the sequence followed by the SMT, or the other way round. We used concatenated input, i.e. single sequence input, with the SRC being the first part of the input. We also tested multi-sequence input, where SRC and SMT are provided as two different inputs to a multi-encoder architecture.

Ensembles As justified in Sect. 2, we do not explore system combination with ensembles. In our exploration we focus on single-model systems only.

Based on these variables, we implemented a set of APE systems using the Marian-NMT toolkit. The set of tested systems contains 8 Vanilla systems (when we only train with the provided language data and we test different NMT implementations) and three Augmented systems (when extra information is added to the data). This strategy allowed us to incrementally test characteristics from different state-of-the-art systems, and to explore knowledge retrieved during the data analysis, as presented in Sect. 3. Table 1 summarises the systems and the system options we trained for the EN–DE language pair.

Table 1 NPE systems and the choice of variable options for the EN–DE language pair

A roadmap to neural automatic post-editing: an empirical approach

Abstract

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Assessing the Strengths and Weaknesses of Large Language Models

1 Introduction

1.1 Automatic post editing and its parallelism with machine translation

1.2 Data demands and data scarcity

1.3 The ADAPT-microsoft APE project

2 Neural APE

3 Data analysis

3.1 Pre-processing

3.2 Partitions

3.3 Editing patterns

3.4 Training, development and test data

4 Experiments setup and tested systems

4.1 Systems

4.2 System setup

4.3 Experiment pipeline

4.4 Vocabulary sizes

4.5 Training statistics

5 Evaluation

5.1 Standard evaluation scores

5.2 Detailed analysis of editing results

5.3 Evaluation of NPE on NMT output

6 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation