1 Introduction

Neural machine translation (briefly, NMT), which is to translate a sentence from the source language to target language with deep neural networks, has made great progress in recent years (Wu et al. 2016; Hassan et al. 2018; Luong et al. 2016). Despite the success of previous works, most of them only focus on improving the general translation quality, where each word makes the same contribution to the evaluation metrics.

However, words are not all equally important in a sentence and can have different impacts on the translation quality for human evaluation. Intuitively, entities are likely to contain critical information in the sentence and are more important for translation quality  (Li et al. 2018; Post et al. 2019; Niehues and Cho 2017), where detailed statistics are available in Sect. 5.1. Consider two translation systems A and B and a source sentence with ground-truth translation “Both Alice and Bob live in Washington State.” as a simple example. Suppose the translation result of system A is “Both Alice and Bob live in Wisconsin State.”, and that of system B is “Both Alice and Bob live at Washington State.”. Although both systems make a mistake on single word, clearly, we vote system B as a better one, since it correctly conveys the important information about the location. That is, the named entities (e.g., locations, numbers) are important for user experience.

Therefore, in this work, we focus on improving the translation quality of named entities. Unfortunately, entities are not easily translated. Previous studies (Hassan et al. 2018) have shown that today’s NMT systems do not perform very well for entity translation.

In previous work, the general solution is adding extra entity information to the NMT system during both training and inference time. Multiple steps are required for this solution: first, we need to detect entities in both source and target sentences using a named entity recognition (briefly, NER) tool; second, the entities should be tagged, like replaced with special placeholders (Wang et al. 2017; Post et al. 2019; Li et al. 2018), adding special tokens to indicate the boundaries (Li et al. 2018; Modrzejewski et al. 2020), using code-switching method Song et al. (2019), or labeled with entity embedding to enhance the translation (Sennrich and Haddow 2016; Niehues and Cho 2017; Ugawa et al. 2018); finally, NMT models are trained on the processed sentences. During the inference time, the input should be processed using NER tools, translated to the target language, and postprocessed (e.g., replacing placeholders back, removing extra tags) to get the translation.

The accuracy of recognizing the named entities greatly affects the translation quality. As the development of pre-training (Devlin et al. 2019; Liu et al. 2019), the NER tools are significantly improved (Burtsev et al. 2018; Luo et al. 2020). However, those NER modules built upon pre-trained models are even heavier than NMT models. For instance, the numbers of parameters of DeepPavlov (Burtsev et al. 2018), one of the state-of-the-art models for NER, is more than ten times of Transformer used NMT in industry (Kim et al. 2019). The reason is that the DeepPavlov models are based on the large scale pre-training. As a result, it is not feasible to directly integrate such a heavy NER module during the inference time due to a large amount of overhead.

In this work, we design an end-to-end entity-aware NMT model, where both the encoder and decoder can serve as named entity recognizers but there is no extra cost at inference time. During training time, similar to aforementioned works, we leverage a NER tool to provide entity tags for the source and target sentences in the training corpus. When training the translation models, in addition to translation loss, we add NER detection loss to both the encoder and decoder, so that they can correctly recognize the entities tagged by the NER tool. Furthermore, to pay more attention to named entities and differentiate them from the other words, we assign the entities in the target sequence with larger weights inspired by the focal loss (Lin et al. 2017). In this way, the NER module and the translation network are closely coupled and collaborate through end-to-end training, boosting the performance of both tasks. The inference process is the same as that for standard NMT, which does not invoke additional costs. This allows us to use arbitrary high quality NER model during the training, without hurting the inference efficiency.

To summarize, the main contributions of this work are three-fold:

  • We introduce a novel end-to-end method to improve the translation quality, especially for the accurate translation of named entities in sentences.

  • Compared with previous methods, we keep the one-pass decoding process without the dependency of heavy NER model for inference.

  • Experiments on six translation tasks extensively verify the effectiveness of our method. According to the results, our method can improve both BLEU score and entity \(F_{1}\) score.

2 Related work

Enhancing NMT systems with knowledge is a promising research direction in recent years. For example, in Lu et al. (2018), Zhao et al. (2020), Zhao et al. (2020), knowledge graph is incorporate into machine translation task, and in Zhu et al. (2020), Clinchant et al. (2019), Yang et al. (2020), Shavarani and Sarkar (2021), pre-trained language models are used to enhance the translation. In this work, named entity information is leveraged to boost the performance. Named entity is an important topic in the NLP area and there are many previous work to improve the entity translation quality. The common approach is introducing the entity information to the NMT systems. And then translation models can handle the entity in input sentences better with the help of such information. In previous work, there are different approaches to make use of the entity information. The details are listed as follows:

Placeholder In this kind of methods, entities in the source sentences are masked by placeholders. Wang et al. (2017) use the $TERM token to mask person name. And Post et al. (2019) mask various entity tokens like numbers, names, cities, emoji, etc. In Li et al. (2018), entities are masked by the type and index, e.g. LOC1, LOC2, etc. After translation, the masks will be replaced back in target languages, either by entity index or alignment.

Special tokens In Li et al. (2018), Modrzejewski et al. (2020), special tokens are used to indicate the beginning and end of entities in source sentences. For example, the “Hyrule” in a source sentence will become “<LOC>Hyrule</LOC>” after preprocess to indicate that the word is a location. After translation, the extra tokens will be removed from the model output.

Code-switching In Song et al. (2019), the authors use a code-switching method on entity words. The source side entities are replaced by the corresponding translation in target language. After such preprocess, the input to the model is a combination of source and target language. Therefore, the NMT models only need to copy the those tokens.

Entity embedding The embedding based methods are important direction in previous work. In Sennrich and Haddow (2016), Niehues and Cho (2017), linguistic input features are used to improve model quality. And in Ugawa et al. (2018), the entity embedding is added to token embedding to enhance the representation of sentences.

Despite the success of previous work, the complexity of those methods is still a obstacle for them to be used in real scenario. Especially, the extra cost of NER is not negligible and will significantly affect the decoding latency. Compared with the existing methods, our system has almost zero extra cost during the inference time and better performance.

3 Our methods

In this section, we first introduce the notations, and then describe the network architecture in Sect. 3.1 and the training strategy is in Sect. 3.2.

Notations Let \(X=(X_0, X_1, \cdots , X_{M-1})\) denote a source sequence with length M, and let \(Y=(Y_0, Y_1, \cdots , Y_{N-1})\) denote the corresponding target sequence with length N. \(X_i\) and \(Y_i\) represent the i-th token in X and Y, which can be words or subwords (Sennrich et al. 2016) in natural language. Let \(X^{\text {ne}}\) and \(Y^{\text {ne}}\) denote the entity sequences for X and Y. \(X^{\text {ne}}_i\) and \(Y^{\text {ne}}_i\) are the named entity tags for \(X_i\) and \(Y_i\) respectively. The entities are represented as IOB tagging, where “I” represents the inside and is extended to the end of an entity, “O” means that the token is outside of entity, and “B” stands for the beginning of an entity,

Following multilingual version of DeepPavlov NER modelFootnote 1, we have 19 different kinds of entities in total, which constructs a set \(\mathbb {N}\). There is a special token O in \(\mathbb {N}\), which represents that the token is not a named entity. The full list of the supported entity types can be found at Appendix.

3.1 Network architecture

We use Transformer (Vaswani et al. 2017) as the backbone of our model, where the encoder and decoder are modified to be an entity-enhanced version. However, our technique can be easily integrated into other encoder-decoder based models as well. The network architecture is shown in Fig. 1.

Fig. 1
figure 1

Network architecture

Entity-enhanced encoder Let \(\texttt {enc}\) denote the encoder of the standard Transformer made up of several stacked blocks. Each block consists of a self-attention layer and a feed-forward layer. Given the input X, \(\texttt {enc}\) processes it into hidden representations, which is mathematically defined as \(H^{\text {src}}=\texttt {enc}(X)\). \(H^{\text {src}}\) is the output of the last block in \(\texttt {enc}\), regarded as a \(M\times d\) matrix, where the i-th row \(H^{\text {src}}_i\) denotes the representation of token \(X_i\), and the d means the embedding dimension.

After that, the encoder works as follows:

$$\begin{aligned} \begin{aligned} H^{\text {ne}}&= \texttt {ReLU}(H^{\text {src}}W_{\text {ne}}^{src}), \\ H^{\text {enc}}&= H^{\text {src}} + H^{\text {ne}}, \\ \hat{X}^{\text {ne}}&=\texttt {softmax}(H^{\text {ne}}E^{\text {s-ne}}), \end{aligned} \end{aligned}$$
(1)

where \(W_{\text {ne}}^{src}\) is a \(d\times d\) matrix to be learned, \(E^{\text {s-ne}}\) is the entity embedding of the source language with size \(d\times \vert \mathbb {N}\vert \), and \(\hat{X}^{\text {ne}}\) means the matrix of the predicted entity tokens of X. \(\hat{X}^{\text {ne}}\) is only required during training, and we do not need it at inference time.

In Eqn.(1), the representation \(H^{\text {src}}\) is fed into a feed-forward layer and get \(H^{\text {ne}}\). After applying an affine transformation to \(H^{\text {ne}}\) and a softmax operation, we can get the predicted entities \(\hat{X}^{\text {ne}}\) of the input sequence X. We will minimize the difference between \(\hat{X}^{\text {ne}}\) and \(X^{\text {ne}}\) (i.e., outputted by the NER model) so that \(H^{\text {ne}}\) can be regarded as the features of the entities. We add \(H^{\text {src}}\) and \(H^{\text {ne}}\) together as the eventual output of the encoder and feed it into the decoder. In this way, both the named entity information and the semantic information represented by natural words can be passed into the decoder.

Entity-enhanced decoder Similarly, we define \(\texttt {dec}\) as the decoder of the standard Transformer, which is also made up of a series of blocks. Beside a self-attention and a feed-forward layer, each block also consists of an additional encoder-decoder attention layer, which is used to aggregate the information from the encoder, i.e., \(H^{\text {enc}}\). Let \(Y_{<t}\) denote the sub-sequence \((Y_0,Y_1,\cdots ,Y_{t-1})\), where \(Y_0\) is a special token indicating the beginning of a sentence. The decoder works as follows:

$$\begin{aligned} \begin{aligned} H^{\text {tgt}}_t&=\texttt {dec}(H^{\text {enc}}, Y_{<t}),\; \\ \hat{Y}^{\text {ne}}_t&=\texttt {softmax}(\texttt {ReLU}(H^{\text {tgt}}_t W^{\text {tgt}}_{\text {ne}})E^{\text {t-ne}}),\; \\ \hat{Y}_t&= \texttt {softmax}(H^{\text {tgt}}_tE^{\text {t}}), \end{aligned} \end{aligned}$$
(2)

where \(E^{\text {t-ne}}\) is the entity embedding of the target language, and \(E^{\text {t}}\) is the embedding of target words. The \(W_{\text {ne}}^{\text {tgt}}\) is a \(d\times d\) affine matrix. Specifically, we first get the representation of the last block in the decoder and transform it into entity embedding space, and then use different softmax operations to get the predicted translated token and the predicted entity tag, respectively.

During the inference time, we can skip the entity tag predictions and only keep the token prediction, which leads to similar decoding cost as standard MT methods.

Discussion The key challenge in this task is that there are no entity labels available during the inference time. To solve this, we leverage a multi-task framework to build the enhanced encoder and decoder, where the primary task is machine translation, and two auxiliary tasks are source-side and target-side named entity detection. Previous work Niehues and Cho (2017) shows that multi-task learning can help improve the performance. Therefore, we can not only improve the accuracy on named entities, but also regularize the training. As for the NER classification, we can also leverage the outputs of internal blocks of the encoder and decoder. We empirically verified the effect on the choices of these different outputs and found that there are no significant differences compared with using the output from last block.

3.2 Training and inference strategies

Let \(\theta \) denote the parameters of \(\texttt {enc}\), \(\texttt {dec}\) and word embeddings. Let \(\theta _{\text {s-ne}}\) and \(\theta _{\text {t-ne}}\) denote the parameters related to source-side NER and target-side NER.

The training loss consists of the following three parts:

$$\begin{aligned} \begin{aligned}&\ell _{\text {s-ne}} = -\frac{1}{M}\sum _{i=0}^{M-1}\log P(X^{\text {ne}}_i|X;\theta ,\theta _{\text {s-ne}}),\; \\&\ell _{\text {t-ne}} = -\frac{1}{N}\sum _{j=0}^{N-1}\log P(Y^{\text {ne}}_j|X,Y_{<j};\theta ,\theta _{\text {s-ne}},\theta _{\text {t-ne}}),\\&\ell _{\text {mt}} = -\frac{1}{N}\sum _{t=0}^{N-1}(1 + P_{\text {NE},t}^\gamma )\log P(Y_t|Y_{<t},X;\theta ,\theta _{\text {s-ne}}),\, \\&P_{\text {NE},t}=P(Y^{\text {ne}}_t\ne \texttt {O}|X,Y_{<t};\theta ,\theta _{\text {s-ne}},\theta _{\text {t-ne}}), \end{aligned} \end{aligned}$$
(3)

where there are two losses for named entity recognition \(\ell _{\text {s-ne}}\), \(\ell _{\text {t-ne}}\), and a translation loss \(\ell _{\text {mt}}\). Because the human annotations of source and target entity labels are unavailable, the labels extracted by DeepPavlov are used to compute the entity loss for both \(\ell _{\text {s-ne}}\) and \(\ell _{\text {t-ne}}\). For the translation loss \(\ell _{\text {mt}}\), considering that we should enhance the entity tokens, we design an adaptive way inspired from the focal loss (Lin et al. 2017): \(P_{\text {NE},t}\) is the probability that \(Y_t\) is an entity token (instead of \(\texttt {O}\)). The more likely the token is an entity, the larger weight we will assign to it. Following Lin et al. (2017), the weight is controlled by a positive hyper-parameter \(\gamma \) for flexibility. The weight of each token is at least one to stabilize training.

The final training objective function of entity-enhanced NMT model on data pair (XY) is

$$\begin{aligned} \ell = \ell _{\text {mt}} + \alpha \ell _{\text {s-ne}} + \beta \ell _{\text {t-ne}}, \end{aligned}$$
(4)

where \(\alpha \) and \(\beta \) are hyper-parameters to be tuned according to validation performance. Practically, the hyper-parameter setting in all our experiments are: \(\gamma =1.0,\alpha =0.5,\beta =0.5\).

At inference time, we are interested in the translation, therefore we will ignore the related named entity recognition modules. Specifically, the aforementioned \(\hat{X}^{\text {ne}}_t\) and \(\hat{Y}^{\text {ne}}_t\) only affect the training process, our method therefore maintains efficiency in inference.

The NER module in the decoder is also disabled and we only generate the translation sentence.

4 Experiments

Data processing We conduct experiments on the translation of four languages, English, German, Chinese, and Japanese, which are briefly denoted as En, De, Zh and Ja respectively. It includes both linguistic distance close language pairs En\(\leftrightarrow \)De and more different languages like En\(\leftrightarrow \)Zh, En\(\leftrightarrow \)Ja. We follow Ott et al. (2019) to process data for IWSLT’14 En\(\leftrightarrow \)De, where all words are lowercased and tokenized. We follow Zhu et al. (2020) to process the data for IWSLT’17 En\(\leftrightarrow \)Zh. For En\(\leftrightarrow \)Ja, we follow Michel and Neubig (2018), Wang et al. (2019) to combine the training sets of KFTT, JESC, and TED talks together (Neubig 2011; Pryzant et al. 2018; Cettolo et al. 2012), and test on the corresponding test sets separately. For En\(\leftrightarrow \)Zh, we use Moses and Jieba tokenizer respectively, after which we use BPE to split them into subwords. For Ja, we use SentencePiece to process it directly. Detail information are in Table 1 and URLs of data and tools are in Appendix.

Table 1 Dataset statistics

Practically, the tokenizer leveraged by the NER tool is different from that in NMT preprocessing. To solve this problem, we leverage the fact that tokenization will only affect the non-space characters. We therefore align the entity tags with NMT data by character overlap and adjust the IOB notations accordingly. An example is shown in Table 2, where the location entity word “Winterfell” is split into three parts by BPE then assigned tags accordingly.

Table 2 Example of subword entity tags assignment

Entity-rich test set To better evaluate the performance of our methods, we build two extra entity rich test sets: (1) ER-IWSLT, where ER is short for entity rich. for En\(\rightarrow \)De, we concatenate the test sets of IWSLT from year 2010 to 2017, IWSLT-10 validation sets as a larger one. (2) ER-WMT: for En\(\rightarrow \)De, we concatenate the test sets of WMT from 2014 to 2019. Then, we filter the sentences from them with different thresholds of the number of entities per sentence in the English side and report the corresponding scores.

Configuration The backbone of models consists of six layers in both encoder and decoder. In transformer_small configuration. the feed-forward layer dimensions and dropout rate are 256, 1024 and 0.3, and in transformer_base setting, they are 512, 2048, 0.1, respectively. Following Vaswani et al. (2017), all models are trained with learning rate \(5\times 10^{-4}\) by Adam optimizer Kingma and Ba (2015) with invert_sqrt learning rate scheduler (Vaswani et al. 2017) and 4096 tokens per GPU. The transformer_small models are trained on single P40 GPU while transformer_base models are trained on 4 P40 GPUs.

Evaluation We evaluate both translation quality and entity accuracy. For En \(\leftrightarrow \) De, we use multi-bleu.perl scriptFootnote 2 to evaluate the translation BLEU score for fair comparison with previous works. For other language pairs, we use sacreBLEU (Post 2018). We use beam size with 5 and length penalty 1.0 for all language pairs. For entity accuracy, we choose the entity \(F_{1}\) score as the metric. We use DeepPavlov NER model to extract the entities of both reference and translation files, and then calculate the \(F_{1}\) score between them by exactly matching. To avoid the bias of DeepPavlov, we also measure the entity quality by Stanford NER tagger (Finkel et al. 2005) as well.

5 Results and discussions

This section is organized as follows: First, we show the relation between entity translation quality and human evaluation, which indicates the importance of entity translation. Then we show the model performance on various language pairs and data sets, as well as the case study to better demonstrate the effects. Finally, we have comprehensive study on the effect of entity types, model configuration, NER tools, and decoding loss.

5.1 Entity and human evaluation

We firstly study how entity translation quality affects human evaluation. We collect 70 translation submissions of 5 languages pairs from WMT19 websiteFootnote 3, which contains 131k sentences. The corresponding official human evaluation scores from the WMT19 machine translation challenge report (Bojar et al. 2017) are also collected to calculate the Pearson correlation coefficient between the entity quality (in terms of \(F_{1}\) score) and the human evaluation results which are shown in Table 3. The results indicate that the entity translation quality measured by DeepPavlov is consistent with the human judgment of the sentence quality.

Table 3 Correlation between entity \(F_{1}~\) score and human evaluation

5.2 Translation quality on normal test sets

The results of 10 normal test sets are listed in Tables 4 and 5. For En \(\leftrightarrow \) {De, Zh} we compare our system with standard Transformer and other entity placeholder-based methods Post et al. (2019), where the entities in data are replaced with special token indicating the entity type and index (e.g. “\(\langle \text {PER-0} \rangle \)”, “\(\langle \text {PER-1} \rangle \)”, etc.). To simulate the inference process of these methods, we first build an entity mapping table from the training data for each language pair with DeepPavlov and Fast Align Dyer et al. (2013), then replace the entity back by encoder-decoder attention (denote as PH_Align) and entity index (denote as PH_Index). We also compare our method with code-switch method Song et al. (2019) and the entity tagging method Li et al. (2018). For En \(\leftrightarrow \) Ja, due to the computation resource we only compare with LSTM and standard Transformer.

Table 4 Experiment results backboned on transformer_small on En \(\leftrightarrow \) {Zh, De}
Table 5 Experiment results backboned on transformer_base on TED, KFTT and JESC En \(\leftrightarrow \) Ja test sets.

From these tables, we can see that our models enjoy improvements for both BLEU score and entity \(F_{1}\) score with the help of end-to-end training of both NMT and NER tasks. Compared with standard Transformer, the entity \(F_{1}\) score is improved from 0.7 point to 4.6 points on various test sets. For BLEU score, we achieve at most 1.7 point improvement on JESC En\(\rightarrow \)Ja. On En \(\leftrightarrow \) {De, Zh} data, we additionally use the paired bootstrap resampling method Koehn (2004) for testing the statistical significance and report the p-value of BLEU score by comparing our system and the Transformer baseline system.Footnote 4 The results suggest that the improvements are statistically significant. The best p-value is 0.001 and the worst is about 0.1. We also report the BLEU scores of Transformer from previous works, which show that our reproduction of the baseline system is comparable or stronger than before. Another finding is that the placeholder-based methods will hurt both BLEU and entity translation performance. The index-based replacement is usually better than alignment but still worse than the baseline. We suspect the reason is the difficulty of obtaining high quality entity translation pairs without large amount of human effort.

Compared with the entity tag method, our system yields similar or even better results, under the condition that we lost help on DeepPavlov NER model during the inference time. We also record the relative latency of adopting our method and other methods against the baseline Transformer as well as the number of model parameters for further comparison. It can be witnessed that our model only yields 5.24% latency and contains 2.78% parameter compared with the entity tag method. The additional cost is almost negligible when compared with the standard Transformer. Such results indicate that our end-to-end method, which saves both time and memory, is more appropriate in the practical implementation of NMT systems.

Table 6 Compare with previous works on De \(\rightarrow \) En

Moreover, we compare our method with previous non-entity methods for De \(\rightarrow \) En in Table 6, like Joint Attention Transformer model  Fonollosa et al. (2019), LightConv, and DynamicConv Wu et al. (2019). Even though they can also improve the BLEU score, the entity accuracy scores are all below our method. It tells that simply improving the general translation quality cannot guarantee the improvement of entity translation quality.

5.3 Translation quality on entity-rich test sets

Fig. 2
figure 2

Translation results on entity-rich test sets. The values on horizontal axes are the thresholds of the number of entities per sentence. The scores on vertical axes are corpus BLEU (left graph) or entity \(F_{1}\) (right graph) for selected sentences

To further assess the ability of entity translation of our method, we also test our system on the entity-rich test sets that are described in Sect. 4. The evaluation results are shown in Fig. 2, where the x-axes represent the least number of entities in a sentence, and y-axes denote the BLEU score in left graph and entity \(F_{1}\) score in right graph. The dash lines represent the results from baseline systems and the solids lines are from our methods. We use different colors to distinguish the models and test sets, i.e., orange and red for our method on ER-WMT and ER-IWSLT respectively, purple and blue denotes the baseline Transformer on these datasets. Our model consistently outperforms the baseline models in terms of both BLEU and \(F_{1}\) . While the WMT test set is in the news domain, which is different from that of the training data, our method can make improvements to the cross-domain translation compared to the baseline Transformer system It demonstrates that our methods can generalize better across different domains.

5.4 Case study

To emphasize how our method improves the quality of entities, we also conduct translation quality human evaluations with source-based direct assessment (DA) method  Bojar et al. (2017) based on Zh \(\rightarrow \) En test set. The detailed translation quality human evaluations are attached in the Appendix. The quality of our results outperforms Transformer by 2.1% in terms of average score judged by human annotators. Some of the cases are illustrated in Table 7.

Table 7 Examples of Zh \(\rightarrow \) En translation

The “Src” and “Ref” represent the source and reference sentences. The \(\text {H}_{\text {TR}}\), \(\text {H}_{\text {ET}}\) denote the hypotheses generated by standard Transformer and entity tag system. Our results are shown in rows starting with “Ours”. Considering that the entity quality serves as a useful indicator for human evaluation suggested in Table 3, promoting the entity quality as our method does is an appropriate way to enhance the user experience in the practical implementation of NMT models. By enhancing the named entities, our method can ease the following problems:

  • Entity under-translation In the second example, the location “Galapagos” appears in \(H_{\text {ET}}\) and our output, but it is missing in \(\text {H}_{\text {TR}}\).

  • Entity over-translation In the third example, the \(\text {H}_{\text {TR}}\) contains an extra “decades”. And in the fourth example, it generates an entity “Apple” for Apple company. However, the source sentence doesn’t have such meaning.

  • Entity error-translation In the first example, the correct name is “M. Scott Peck”. However, the baseline systems translate them as “Scott Papuk”, “MM Papker”. For numbers in the fourth example, the correct translation is “Ninety percent”, but \(\text {H}_{\text {ET}}\) result is “Ninety-nine percent”. For the fifth example, none of the system could get the name “Azuri” correct.

From these examples, we can see that all NMT system still have difficulty guaranteeing every entity is correctly translated. Fully solve this problem is still challenging and there are many potentials for this research topic.

5.5 Ablation study

To study the importance of the different parts of our system, we conduct the ablation study on De \(\rightarrow \) En translation task and the results are shown in Table 8. The minus symbol “–” means we remove the corresponding component from the system and the indentation level means the removal order. As the numbers show, when the components are gradually removed, the BLEU score and entity \(F_{1}\) score will become worse. This indicates that all parts added to the system are necessary for achieving high translation performance.

Table 8 Results of ablation study on De \(\rightarrow \) En

Furthermore, we studied the affect of the weights of encoder and decoder NER loss, which are controlled by the hyper-parameter \(\alpha \) and \(\beta \) respectively. Meanwhile, we analyzed how the hyper-parameter \(\gamma \) affect the translation quality. The experiments are based on IWSLT’14 De\(\rightarrow \)En dataset and the results are summarized in Tables 9, 10, and 11.

Table 9 BLEU scores for different \(\alpha \) and \(\beta \)
Table 10 Entity \(F_{1}\) for different \(\alpha \) and \(\beta \)
Table 11 BLEU scores and entity \(F_{1}\) for different \(\gamma \) value on De \(\rightarrow \) En

As can be seen from Tables 9 and 10, models trained with different combinations of \(\alpha \) and \(\beta \) had different BLEU and entity \(F_{1}\) scores. However the overall variance is small, which shows that our method is not sensitive to \(\alpha \) and \(\beta \). From Table 11, we can see that the performance was affected when the \(\gamma \) was too large or too small. Setting \(\gamma = 1\) could be a good choice for this task.

5.6 Encoder entity recognition ability

We measure the encoder NER ability because high quality entity translation relies on accurate entity information extracted by the encoder. To achieve this, we extract the encoder NER output on our test set, and compare it with the ground truth extracted by DeepPavlov. Table 12 includes the accuracy of all tokens (TACC), the entity tokens only accuracy (ETACC) where all ‘\(\texttt {O}\)’ tags are ignored as the labels are imbalance, and the entity \(F_{1}\) score. The X means other languages which are translated from/to English. As it shows, the encoders of all models have plausible NER ability on the inputs, with up to 98.36 on TACC and 89.34 on ETACC. Therefore, we can remove NER tools in during inference since our encoder can detect entity tokens and entity types of those tokens from the source sentences.

Table 12 Results on encoder NER ability

5.7 Entity accuracy for different types

To study the translation quality of different entity types, we collect data from En \(\leftrightarrow \) De and En \(\leftrightarrow \) Zh test sets and sort them by \(F_{1}\) score. The top and bottom three types of each language pair are shown in Table 13. Here positive rank means the better translated types and negative means the worse translated types.

The geopolitical entities (GPE), e.g. country or city names, are well translated in all language pairs. This may suggest that this type of entity is easier to learn. The “LANGUAGE” type entities in En \(\leftrightarrow \) De and “PERCENT” type entities in En \(\leftrightarrow \) Zh are also performing well. However, the “PERSON” and “EVENT” type entities are not well handled in En \(\leftrightarrow \) Zh. We suspect that is caused by the diversity of human names and the large linguistic difference between English and Chinese. We have some cases about name in Sect. 5.4 and left the way to improve it more for future study.

Table 13 Top and bottom three entity types in terms of \(F_{1}\) on different language pairs

5.8 Test with other NER tool

Moreover, we measure the entity translation quality with Stanford NER Tagger (Finkel et al. 2005), which can detect three entity types for English: PERSON, ORGANIZATION, and LOCATION.Footnote 5 Although the target entity types and detection algorithms are not same as DeepPavlov NER, we still have one point improvement on entity \(F_{1}\) score (from 27.00 to 28.06) on De \(\rightarrow \) En dataset over standard Transformer. This implies that our methods can enhance entity translation performance under the evaluation of different NER tools.

5.9 The gap between training and decoding loss

Our method benefits from the entity loss during training. And the loss is removed in decoding time. Therefore, it is a nature question that whether such a gap will hurt the decoding performance. Especially we are using the beam search and different loss function will leads to different ranks of hypothesis. We conduct experiments on En \(\leftrightarrow \) De dataset with two decoding strategy, including translation loss only (denoted as NMT), and translation loss plus entity loss (denoted as NMT + NE). The results are shown in Table14.

Table 14 Different decoding loss on De \(\leftrightarrow \) En translation

It can be witnessed that only decoding with NMT loss yields similar results as using both. Consequently, we simply decoding with NMT loss in all the experiments for efficiency.

6 Conclusions and future work

In this work, we propose a novel system to improve the translation quality of named entities for NMT, which is important for human evaluation but not well handled in previous works. The experiment results on four languages and six translation tasks demonstrate that by enhancing the encoder and the decoder with the NER ability, as well as the entity weighed loss, we can improve both entity \(F_{1}\) score and BLEU score. In addition to the quality improvement, our end-to-end inference algorithm keeps the one pass decoding with little extra inference cost. This is the key difference with previous works, which rely on the NER models for translation. Therefore, it allows us to use high quality and heavy NER models but is still cost free for real world usage.

For future, there are many important possibilities that are related to this work. First, we will explore how to solve the entity translation disambiguation issue that is important for improving the translation quality. Second, we plan to study how to import external entity information, e.g. a multilingual knowledge graph to further improve entity translation. Finally, more formal theoretical analyses about using entity information in machine translation is an important direction.