End-to-end entity-aware neural machine translation

Xie, Shufang; Xia, Yingce; Wu, Lijun; Huang, Yiqing; Fan, Yang; Qin, Tao

doi:10.1007/s10994-021-06073-9

End-to-end entity-aware neural machine translation

Published: 13 January 2022

Volume 111, pages 1181–1203, (2022)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

End-to-end entity-aware neural machine translation

Download PDF

Shufang Xie ORCID: orcid.org/0000-0002-7126-0139¹,
Yingce Xia²,
Lijun Wu²,
Yiqing Huang³,
Yang Fan⁴ &
…
Tao Qin²

1912 Accesses
13 Citations
6 Altmetric
Explore all metrics

Abstract

Accurate translation of entities (e.g., person names, organizations, geography) is important in neural machine translation (briefly, NMT), as they are usually more difficult to translate than other words, and an incorrect translation of them will greatly hurt user experiences. In previous works, entities are either treated in the same way as other words, which leads to inaccurate translation, or handled by multiple steps (including named entity recognition, translation, and replacing entities back), which significantly increase the inference latency. In this work, we propose an end-to-end algorithm that carefully handles the translation of entities. There are mainly two novel parts compared to conventional NMT model: (1) The encoder and the decoder are attached with entity classifiers, which are used to verify whether the input token is a named entity. In this way, the encoder and decoder are capable to treat named entities differently; (2) The translation loss of each target token is adaptively increased by the probability that the target token is a named entity, which results in more accurate translation of entities. During inference time, these two parts will be removed so that the translation model maintains the same inference speed as conventional NMT models. Empirical results on six translation tasks demonstrate the effectiveness of our methods of improving the translation quality. Specifically, we improve 1.7 BLEU scores on Japanese to English translation and 4.6 entity $F_{1}$ scores on English to Chinese translation, without additional inference cost.

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Article 27 November 2023

Deep neural network-based relation extraction: an overview

Article 05 January 2022

Pre-trained models for natural language processing: A survey

Article 15 September 2020

1 Introduction

Neural machine translation (briefly, NMT), which is to translate a sentence from the source language to target language with deep neural networks, has made great progress in recent years (Wu et al. 2016; Hassan et al. 2018; Luong et al. 2016). Despite the success of previous works, most of them only focus on improving the general translation quality, where each word makes the same contribution to the evaluation metrics.

However, words are not all equally important in a sentence and can have different impacts on the translation quality for human evaluation. Intuitively, entities are likely to contain critical information in the sentence and are more important for translation quality (Li et al. 2018; Post et al. 2019; Niehues and Cho 2017), where detailed statistics are available in Sect. 5.1. Consider two translation systems A and B and a source sentence with ground-truth translation “Both Alice and Bob live in Washington State.” as a simple example. Suppose the translation result of system A is “Both Alice and Bob live in Wisconsin State.”, and that of system B is “Both Alice and Bob live at Washington State.”. Although both systems make a mistake on single word, clearly, we vote system B as a better one, since it correctly conveys the important information about the location. That is, the named entities (e.g., locations, numbers) are important for user experience.

Therefore, in this work, we focus on improving the translation quality of named entities. Unfortunately, entities are not easily translated. Previous studies (Hassan et al. 2018) have shown that today’s NMT systems do not perform very well for entity translation.

In previous work, the general solution is adding extra entity information to the NMT system during both training and inference time. Multiple steps are required for this solution: first, we need to detect entities in both source and target sentences using a named entity recognition (briefly, NER) tool; second, the entities should be tagged, like replaced with special placeholders (Wang et al. 2017; Post et al. 2019; Li et al. 2018), adding special tokens to indicate the boundaries (Li et al. 2018; Modrzejewski et al. 2020), using code-switching method Song et al. (2019), or labeled with entity embedding to enhance the translation (Sennrich and Haddow 2016; Niehues and Cho 2017; Ugawa et al. 2018); finally, NMT models are trained on the processed sentences. During the inference time, the input should be processed using NER tools, translated to the target language, and postprocessed (e.g., replacing placeholders back, removing extra tags) to get the translation.

The accuracy of recognizing the named entities greatly affects the translation quality. As the development of pre-training (Devlin et al. 2019; Liu et al. 2019), the NER tools are significantly improved (Burtsev et al. 2018; Luo et al. 2020). However, those NER modules built upon pre-trained models are even heavier than NMT models. For instance, the numbers of parameters of DeepPavlov (Burtsev et al. 2018), one of the state-of-the-art models for NER, is more than ten times of Transformer used NMT in industry (Kim et al. 2019). The reason is that the DeepPavlov models are based on the large scale pre-training. As a result, it is not feasible to directly integrate such a heavy NER module during the inference time due to a large amount of overhead.

In this work, we design an end-to-end entity-aware NMT model, where both the encoder and decoder can serve as named entity recognizers but there is no extra cost at inference time. During training time, similar to aforementioned works, we leverage a NER tool to provide entity tags for the source and target sentences in the training corpus. When training the translation models, in addition to translation loss, we add NER detection loss to both the encoder and decoder, so that they can correctly recognize the entities tagged by the NER tool. Furthermore, to pay more attention to named entities and differentiate them from the other words, we assign the entities in the target sequence with larger weights inspired by the focal loss (Lin et al. 2017). In this way, the NER module and the translation network are closely coupled and collaborate through end-to-end training, boosting the performance of both tasks. The inference process is the same as that for standard NMT, which does not invoke additional costs. This allows us to use arbitrary high quality NER model during the training, without hurting the inference efficiency.

To summarize, the main contributions of this work are three-fold:

We introduce a novel end-to-end method to improve the translation quality, especially for the accurate translation of named entities in sentences.
Compared with previous methods, we keep the one-pass decoding process without the dependency of heavy NER model for inference.
Experiments on six translation tasks extensively verify the effectiveness of our method. According to the results, our method can improve both BLEU score and entity $F_{1}$ score.

2 Related work

Enhancing NMT systems with knowledge is a promising research direction in recent years. For example, in Lu et al. (2018), Zhao et al. (2020), Zhao et al. (2020), knowledge graph is incorporate into machine translation task, and in Zhu et al. (2020), Clinchant et al. (2019), Yang et al. (2020), Shavarani and Sarkar (2021), pre-trained language models are used to enhance the translation. In this work, named entity information is leveraged to boost the performance. Named entity is an important topic in the NLP area and there are many previous work to improve the entity translation quality. The common approach is introducing the entity information to the NMT systems. And then translation models can handle the entity in input sentences better with the help of such information. In previous work, there are different approaches to make use of the entity information. The details are listed as follows:

Placeholder In this kind of methods, entities in the source sentences are masked by placeholders. Wang et al. (2017) use the $TERM token to mask person name. And Post et al. (2019) mask various entity tokens like numbers, names, cities, emoji, etc. In Li et al. (2018), entities are masked by the type and index, e.g. LOC1, LOC2, etc. After translation, the masks will be replaced back in target languages, either by entity index or alignment.

Special tokens In Li et al. (2018), Modrzejewski et al. (2020), special tokens are used to indicate the beginning and end of entities in source sentences. For example, the “Hyrule” in a source sentence will become “<LOC>Hyrule</LOC>” after preprocess to indicate that the word is a location. After translation, the extra tokens will be removed from the model output.

Code-switching In Song et al. (2019), the authors use a code-switching method on entity words. The source side entities are replaced by the corresponding translation in target language. After such preprocess, the input to the model is a combination of source and target language. Therefore, the NMT models only need to copy the those tokens.

Entity embedding The embedding based methods are important direction in previous work. In Sennrich and Haddow (2016), Niehues and Cho (2017), linguistic input features are used to improve model quality. And in Ugawa et al. (2018), the entity embedding is added to token embedding to enhance the representation of sentences.

Despite the success of previous work, the complexity of those methods is still a obstacle for them to be used in real scenario. Especially, the extra cost of NER is not negligible and will significantly affect the decoding latency. Compared with the existing methods, our system has almost zero extra cost during the inference time and better performance.

3 Our methods

In this section, we first introduce the notations, and then describe the network architecture in Sect. 3.1 and the training strategy is in Sect. 3.2.

Notations Let $X=(X_0, X_1, \cdots , X_{M-1})$ denote a source sequence with length M, and let $Y=(Y_0, Y_1, \cdots , Y_{N-1})$ denote the corresponding target sequence with length N. $X_i$ and $Y_i$ represent the i-th token in X and Y, which can be words or subwords (Sennrich et al. 2016) in natural language. Let $X^{\text {ne}}$ and $Y^{\text {ne}}$ denote the entity sequences for X and Y. $X^{\text {ne}}_i$ and $Y^{\text {ne}}_i$ are the named entity tags for $X_i$ and $Y_i$ respectively. The entities are represented as IOB tagging, where “I” represents the inside and is extended to the end of an entity, “O” means that the token is outside of entity, and “B” stands for the beginning of an entity,

Following multilingual version of DeepPavlov NER model^{Footnote 1}, we have 19 different kinds of entities in total, which constructs a set $\mathbb {N}$. There is a special token O in $\mathbb {N}$, which represents that the token is not a named entity. The full list of the supported entity types can be found at Appendix.

3.1 Network architecture

We use Transformer (Vaswani et al. 2017) as the backbone of our model, where the encoder and decoder are modified to be an entity-enhanced version. However, our technique can be easily integrated into other encoder-decoder based models as well. The network architecture is shown in Fig. 1.

Entity-enhanced encoder Let $\texttt {enc}$ denote the encoder of the standard Transformer made up of several stacked blocks. Each block consists of a self-attention layer and a feed-forward layer. Given the input X, $\texttt {enc}$ processes it into hidden representations, which is mathematically defined as $H^{\text {src}}=\texttt {enc}(X)$. $H^{\text {src}}$ is the output of the last block in $\texttt {enc}$, regarded as a $M\times d$ matrix, where the i-th row $H^{\text {src}}_i$ denotes the representation of token $X_i$, and the d means the embedding dimension.

After that, the encoder works as follows:

$$\begin{aligned} \begin{aligned} H^{\text {ne}}&= \texttt {ReLU}(H^{\text {src}}W_{\text {ne}}^{src}), \\ H^{\text {enc}}&= H^{\text {src}} + H^{\text {ne}}, \\ \hat{X}^{\text {ne}}&=\texttt {softmax}(H^{\text {ne}}E^{\text {s-ne}}), \end{aligned} \end{aligned}$$

(1)

where $W_{\text {ne}}^{src}$ is a $d\times d$ matrix to be learned, $E^{\text {s-ne}}$ is the entity embedding of the source language with size $d\times \vert \mathbb {N}\vert $, and $\hat{X}^{\text {ne}}$ means the matrix of the predicted entity tokens of X. $\hat{X}^{\text {ne}}$ is only required during training, and we do not need it at inference time.

In Eqn.(1), the representation $H^{\text {src}}$ is fed into a feed-forward layer and get $H^{\text {ne}}$. After applying an affine transformation to $H^{\text {ne}}$ and a softmax operation, we can get the predicted entities $\hat{X}^{\text {ne}}$ of the input sequence X. We will minimize the difference between $\hat{X}^{\text {ne}}$ and $X^{\text {ne}}$ (i.e., outputted by the NER model) so that $H^{\text {ne}}$ can be regarded as the features of the entities. We add $H^{\text {src}}$ and $H^{\text {ne}}$ together as the eventual output of the encoder and feed it into the decoder. In this way, both the named entity information and the semantic information represented by natural words can be passed into the decoder.

Entity-enhanced decoder Similarly, we define $\texttt {dec}$ as the decoder of the standard Transformer, which is also made up of a series of blocks. Beside a self-attention and a feed-forward layer, each block also consists of an additional encoder-decoder attention layer, which is used to aggregate the information from the encoder, i.e., $H^{\text {enc}}$. Let $Y_{<t}$ denote the sub-sequence $(Y_0,Y_1,\cdots ,Y_{t-1})$, where $Y_0$ is a special token indicating the beginning of a sentence. The decoder works as follows:

$$\begin{aligned} \begin{aligned} H^{\text {tgt}}_t&=\texttt {dec}(H^{\text {enc}}, Y_{<t}),\; \\ \hat{Y}^{\text {ne}}_t&=\texttt {softmax}(\texttt {ReLU}(H^{\text {tgt}}_t W^{\text {tgt}}_{\text {ne}})E^{\text {t-ne}}),\; \\ \hat{Y}_t&= \texttt {softmax}(H^{\text {tgt}}_tE^{\text {t}}), \end{aligned} \end{aligned}$$

(2)

where $E^{\text {t-ne}}$ is the entity embedding of the target language, and $E^{\text {t}}$ is the embedding of target words. The $W_{\text {ne}}^{\text {tgt}}$ is a $d\times d$ affine matrix. Specifically, we first get the representation of the last block in the decoder and transform it into entity embedding space, and then use different softmax operations to get the predicted translated token and the predicted entity tag, respectively.

During the inference time, we can skip the entity tag predictions and only keep the token prediction, which leads to similar decoding cost as standard MT methods.

Discussion The key challenge in this task is that there are no entity labels available during the inference time. To solve this, we leverage a multi-task framework to build the enhanced encoder and decoder, where the primary task is machine translation, and two auxiliary tasks are source-side and target-side named entity detection. Previous work Niehues and Cho (2017) shows that multi-task learning can help improve the performance. Therefore, we can not only improve the accuracy on named entities, but also regularize the training. As for the NER classification, we can also leverage the outputs of internal blocks of the encoder and decoder. We empirically verified the effect on the choices of these different outputs and found that there are no significant differences compared with using the output from last block.

3.2 Training and inference strategies

Let $\theta $ denote the parameters of $\texttt {enc}$, $\texttt {dec}$ and word embeddings. Let $\theta _{\text {s-ne}}$ and $\theta _{\text {t-ne}}$ denote the parameters related to source-side NER and target-side NER.

The training loss consists of the following three parts:

$$\begin{aligned} \begin{aligned}&\ell _{\text {s-ne}} = -\frac{1}{M}\sum _{i=0}^{M-1}\log P(X^{\text {ne}}_i|X;\theta ,\theta _{\text {s-ne}}),\; \\&\ell _{\text {t-ne}} = -\frac{1}{N}\sum _{j=0}^{N-1}\log P(Y^{\text {ne}}_j|X,Y_{<j};\theta ,\theta _{\text {s-ne}},\theta _{\text {t-ne}}),\\&\ell _{\text {mt}} = -\frac{1}{N}\sum _{t=0}^{N-1}(1 + P_{\text {NE},t}^\gamma )\log P(Y_t|Y_{<t},X;\theta ,\theta _{\text {s-ne}}),\, \\&P_{\text {NE},t}=P(Y^{\text {ne}}_t\ne \texttt {O}|X,Y_{<t};\theta ,\theta _{\text {s-ne}},\theta _{\text {t-ne}}), \end{aligned} \end{aligned}$$

(3)

where there are two losses for named entity recognition $\ell _{\text {s-ne}}$, $\ell _{\text {t-ne}}$, and a translation loss $\ell _{\text {mt}}$. Because the human annotations of source and target entity labels are unavailable, the labels extracted by DeepPavlov are used to compute the entity loss for both $\ell _{\text {s-ne}}$ and $\ell _{\text {t-ne}}$. For the translation loss $\ell _{\text {mt}}$, considering that we should enhance the entity tokens, we design an adaptive way inspired from the focal loss (Lin et al. 2017): $P_{\text {NE},t}$ is the probability that $Y_t$ is an entity token (instead of $\texttt {O}$). The more likely the token is an entity, the larger weight we will assign to it. Following Lin et al. (2017), the weight is controlled by a positive hyper-parameter $\gamma $ for flexibility. The weight of each token is at least one to stabilize training.

The final training objective function of entity-enhanced NMT model on data pair (X, Y) is

$$\begin{aligned} \ell = \ell _{\text {mt}} + \alpha \ell _{\text {s-ne}} + \beta \ell _{\text {t-ne}}, \end{aligned}$$

(4)

where $\alpha $ and $\beta $ are hyper-parameters to be tuned according to validation performance. Practically, the hyper-parameter setting in all our experiments are: $\gamma =1.0,\alpha =0.5,\beta =0.5$.

At inference time, we are interested in the translation, therefore we will ignore the related named entity recognition modules. Specifically, the aforementioned $\hat{X}^{\text {ne}}_t$ and $\hat{Y}^{\text {ne}}_t$ only affect the training process, our method therefore maintains efficiency in inference.

The NER module in the decoder is also disabled and we only generate the translation sentence.

4 Experiments

Data processing We conduct experiments on the translation of four languages, English, German, Chinese, and Japanese, which are briefly denoted as En, De, Zh and Ja respectively. It includes both linguistic distance close language pairs En$\leftrightarrow $De and more different languages like En$\leftrightarrow $Zh, En$\leftrightarrow $Ja. We follow Ott et al. (2019) to process data for IWSLT’14 En$\leftrightarrow $De, where all words are lowercased and tokenized. We follow Zhu et al. (2020) to process the data for IWSLT’17 En$\leftrightarrow $Zh. For En$\leftrightarrow $Ja, we follow Michel and Neubig (2018), Wang et al. (2019) to combine the training sets of KFTT, JESC, and TED talks together (Neubig 2011; Pryzant et al. 2018; Cettolo et al. 2012), and test on the corresponding test sets separately. For En$\leftrightarrow $Zh, we use Moses and Jieba tokenizer respectively, after which we use BPE to split them into subwords. For Ja, we use SentencePiece to process it directly. Detail information are in Table 1 and URLs of data and tools are in Appendix.

Table 1 Dataset statistics

End-to-end entity-aware neural machine translation

Abstract

Similar content being viewed by others

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Deep neural network-based relation extraction: an overview

Pre-trained models for natural language processing: A survey

1 Introduction

2 Related work

3 Our methods

3.1 Network architecture

3.2 Training and inference strategies

4 Experiments

5 Results and discussions

5.1 Entity and human evaluation

5.2 Translation quality on normal test sets

5.3 Translation quality on entity-rich test sets

5.4 Case study

5.5 Ablation study

5.6 Encoder entity recognition ability

5.7 Entity accuracy for different types

5.8 Test with other NER tool

5.9 The gap between training and decoding loss

6 Conclusions and future work

Availability of data and material

Code availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Entity types supported by DeepPavlov

Appendix B Data and processing scripts

Appendix C Entity \(F_{1}\) score and human evaluation score

Appendix D Human evaluation details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation