1 Introduction

Natural language processing (NLP) models are crucial for numerous AI-related applications, including sentiment analysis (Li et al., 2021; Xue et al., 2022), knowledge tracing (Song et al., 2021, 2022), question answering (Berant et al., 2013), and machine translation (Luong et al., 2017; Dzmitry Bahdanau & Bengio, 2015). These models exploit contextual information in textual sequences which make them vulnerable to text perturbation. Among the NLP tasks, NMTs can also be sensitive to adversarial examples, such as malicious tampering and input typos, as the sequence-to-sequence mapping relies on both the accuracy of individual word translation and contextual correlation within a sentence. Therefore, as a practical application that can be broadly applied for commercial purposes, the robustness of NMTs against adversarial attacks is highly desired, posing the necessity of studying NMT-targeted attacks.

Existing attack methods to NLP models can be generally divided into character-level and word-level attacks. Character-level attacks, which manipulate informational letters within a word to attack the victim NLP model with incorrectly spelled examples, have been explored and proven effective in both white-box and black-box settings (Belinkov & Bisk, 2017; Ebrahimi et al., 2018). However, character-level attacks can be easily defended by spelling auto-correction methods. In contrast, word-level attack methods hold that an adversary should locate the vulnerable words and manipulate them, such as swapping, inserting, deleting, and substituting, to deceive the NLP models (Cheng et al., 2019; Alzantot et al., 2018). However, word-level attacks to NLP models usually have a trade-off where the attacking performance depends on the number of perturbed words (Michel et al., 2019). Despite of the constant efforts on improving NLP attack methods, it is still challenging to strike such a balance between the number of perturbed words and its effectiveness in existing works, which is particularly true for attacks to NMTs which have not been well studied in the literature.

To this end, we argue that it is necessary to have an attack method that maximizes the attacking performance without having to increase the number of word perturbations. Therefore, we propose an attack strategy with a two-step approach: (1) a hybrid attention attack strategy to locate the top vulnerable words (i.e., victim words). This strategy consists of two types of attention weights: a language-specific attention that examines the correlation of words between source and target languages, and a sequence-centered self-attention that focuses on the language understanding of the source sentence itself. (2) a pre-trained Mask Language Model (MLM) to make semantic-aware substitutions to the victim words discovered in (1), to ensure that the generated adversarial examples are semantically correct. With the proposed strategy, we can make high-quality word-level attacks to NMTs with only a small amount of perturbations.

Specifically, the main contributions of this paper are as follows:

  • We propose a novel Hybrid Attentive Attack (HAA) method which identifies the most influential words in an input sequence based on language-specific and sequence-centered attentions.

  • We introduce a semantic-aware word substitution strategy for the proposed HAA method to strike a balance between attack effectiveness and imperceptibility.

  • We conduct extensive experiments on real-world datasets with three state-of-the-art victim NMTs. Experimental results demonstrate that our proposed method achieves the best performance with a small number of perturbed words.

2 Related Work

In this section, we will introduce some previous about the textual attacks to NLP model, attention mechanisms and BERT variants.

2.1 Word-level attacks to NLP models

Word-level attacks pose non-trivial threats to NLP models by locating the victim words and manipulating them for targeted or untargeted purposes. With the help of an adopted FGSM (Goodfellow et al., 2015), Papernot (2016) was the first one to generate word-level adversarial examples to classifiers. They replaced the randomly chosen words and find the substitution with the help of the gradient to pose adversarial threat. Notably, while the textual data is naturally discrete, many gradient-based victim words selection methods are inherited from computer vision (Chivukula & Liu, 2018; Yin et al., 2018; Chivukula & Liu, 2017), which leaves locating victim words a challenging problem (Yang et al., 2020, 2021). Many existing methods randomly select the victim words, without considering the gradient and contextual information, and focus on words manipulations (Zang et al., 2020; Cheng et al., 2020; Wang et al., 2021) while Liang argues the selection of victim words is also important. To concrete this, he performed a white-box attack they provided a concept of Hot Training Phrase (HTP) and Hot Sample Phrase (HSP) to select the victim words with the help of backpropagation to get all the cost gradients (Liang et al., 2017). To make a more practical black-box setting, Gao (2018) proposed a new criterion without gradient information for locating the victim words to attack classifiers, by greedily searching the word with the highest score on the criterion. Furthermore, Li (2020) defined a score function by applying the logits from BERT (Devlin et al., 2019) for selecting the victim words, and then substitute them with BERT to attack downstream jobs based on BERT.

NMT, a type of NLP models, is an approach to machine translation that uses an deep learning techniques to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model (Kalchbrenner & Blunsom, 2013). Since the NMT is based on deep learning techniques and can be used for commercial purposes, there are raising number of researchers concern that the security and fairness of NMT can be abused. The attack for NMT is firstly introduce by Belinkov (2017), who worked with character-based neural machine translation and tent to attack NMT with natural typos without assuming any gradients. In addition to the attacks, they have explored two approaches to increase model robustness: structure-invariant word representations and robust training on noisy texts. Ebrahimi (2018) provided white and black box attack techniques and showed that white-box attacks were more damaging than black-box attacks, while black-box setting is more practical. For white-box attack, they tried to mute or push a particular word in a translation task by using gradient-based optimization. As for black-box attack, they just randomly picked a character and made necessary changes. Different from the two previous pioneers, Cheng (2019) proposed a gradient-based white-box attack technique called AdvGen to attack NMT in sentence-level. Guided by the training loss they used a greedy choice based approach to find the best solution. Their research paper is based on using adversarial examples for both attack generation and using these adversarial examples to improve the robustness and security of the model. While Michael (2019) also worked on textual white-box attacks to NMTs from a sentence-level and proposed a natural criterion for untargeted attacks. They argued that adversarial examples should be meaning preserving on the source side but meaning destroying on the target side. They used the gradients of the model which replaces one word from the sentences to maximize the loss while they used KNN to determine the top 10 words which are similar to the victim word for purpose of preserving the semantic means. Besides, it was also proposed to attack NMTs via data poisoning (i.e., changing the training data) (Xu et al., 2021).

The pre-mentioned attacking strategies are all from character and word levels, while they all have some drawbacks such as being detected word correction system, too perceptible for human eyes. Different from these pioneers who attacked the NMT from character-level and sentence-level, Tan (2020) proposed to attack NMT in a word-level under a black-box setting. They applied BLEU as a score function to locate the victim words by measuring the difference between the original sentence and the sentence with target word replaced with a special token, and replaced these victim words with synonyms.

2.2 Attention in NMT

Attenion was first derived from human intuition based on the human activities, later adapted to machine translation for automatic token alignment (Hu, 2019). Attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this section, we will introduce some related work about attention mechanism in NLP.

The traditional machine translation models (Kalchbrenner & Blunsom, 2013) are constructed by an encoder-decoder architecture, both of which are recurrent neural networks. An input sequence of source tokens is first fed into the encoder, with which the tokens will be transferred to the hidden representations, and then the decoder will utilize these hidden representations from the encoders as the initial input and output a sequence of dependent tokens. Such an encoder-decoder framework had achieved highest performance compared to purely statistical machine translation models. However, this architecture suffers from two serious drawbacks. First, RNN is forgetful, meaning that old information cleaned up after being propagated over multiple time steps. Second, there is no explicit word alignment during decoding and therefore focus is scattered across the entire sequence. To this end, the concept of attention was first introduced for an encoder-decoder structured NMT by Bahdanau (2015), and has become popular in the NMT community as an essential component of sequence-to-sequence models. Bahdanau provided such an attention mechanism to model word alignments between input and output sequence, which is an essential aspect of structured output tasks such as translation or text summarization. Based on Bahdanau’s attention, Luong (2015) proposed two attention models, namely local and global, in context of machine translation tasks. The global attention model is similar to Bahdanau’s attention while the local attention is computed with hidden states from the output of the encoder. Luong’s attention achieved a better performance than Bahdanau’s attention and provided a way of transparentizing the NMTs.

Recurrent architectures rely on sequential processing of input at the encoding step that results in computational inefficiency, as the processing cannot be parallelized (Vaswani et al., 2017). To address this, Vasiwani proposed Transformer architecture that eliminates sequential processing and recurrent connections. Specifically, transformer-based architectures, which are primarily used in modelling language understanding tasks, avoid recurrent structure in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. To be more specific, the transformer views the encoded representation of the input as a set of key-value pairs,(KV), whose dimension equals input sequence length. For the decoder, the previous output is compressed into a query Q and the next output is produced by mapping this query and the set of keys and values. Referring to Bahdanau’s and Luong’s attention, the transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys:

$$\begin{aligned} {\text {Attention}}({\mathbf {Q}}, {\mathbf {K}}, {\mathbf {V}})={\text {softmax}}\left( \frac{{\mathbf {Q}} {\mathbf {K}}^{\top }}{\sqrt{n}}\right) {\mathbf {V}}. \end{aligned}$$

Transformer architecture achieved significant parallel processing, shorter training time, and higher accuracy for Machine Translation without any recurrent component. Besides, self-attention can provide correlations among the contextual words for NLP models, which we will utilize in our proposed algorithm.

2.3 BERT and its variations

BERT evolution has multiplied into diverse domains over time. Descendent of the Transformer architecture, BERT is a Bidirectional Encoder Representation which is trained with two unsupervised tasks: masked language model, and next sentence prediction. BERT models are heavily pre-trained on millions and billions of unannotated texts allowing us to fine-tune the model on custom tasks and with specific datasets through a transfer learning. Due to the superior model structure and large training data, BERT has performed many state-of-arts in many NLP tasks such as GLEU (Wang et al., 2018), SQuADv1.1 (Rajpurkar et al., 2016), SQuASv2.0 (Rajpurkar et al., 2018), SWAG (Zellers et al., 2018), etc. In addition to the performance in language understanding, BERT has also become a ground-breaking framework for many natural language processing tasks such as Sentimental analysis, sentence prediction, abstract summarization, question answering, natural language inference, and many more. BERT has various model configurations, BERT-Base the most basic model with 12 encoder layers and BERT-Large model with an additional number of layers.

Over time many new models have been inspired by the BERT architecture but are trained in different languages or optimized on domain-specific data sets. One of well-known BERT variants is RoBERTa (Liu et al., 2019), known as a Robustly Optimized BERT Pretraining Approach, which is developed to enhance the training phase. RoBERTa was developed by training the BERT model longer, on larger data of longer sequences and large mini-batches. By such a setting, RoBERTa obtained substantially improved results with some modifications of BERT hyper-parameters. Besides, RoBERTa does not make next sentence prediction (NSP) and make dynamic word masking.

A lite version of BERT (ALBERT) (Lan et al., 2020) was another well-known version of BERT. It was proposed to enhance the training and results of BERT architecture by using parameter sharing and factorizing techniques to reduce the number of parameters. BERT model contains millions of parameters, BERT-based holds about 110 million parameters which makes it hard to train also too many parameters impact the computation. To overcome such challenges ALBERT was introduced as it has fewer parameters compared to BERT.

3 Methodology

In this section, we first introduce and formulate the attention mechanism in NMT. Then, we elaborate on the proposed two-step attentive adversarial attack to NMTs, which features an attentive word location and a semantic-aware word substitution. Specifically, we firstly calculate the Hybrid Attention weights consisting of the language-specific translation attention and sequence-centered self-attention to locate the sensitive words. Then, we target to find replacement words using costume-designed selection steps to ensure parsing correctness and semantic preservations.

3.1 Attentions in NMT

Bahdanau (2015) proposed the attention mechanism to help the word alignments, especially for long sentences. We argue that such an attention mechanism reflects the contributions of each input words to the translated results, therefore a small perturbation to the most contributing word will give a heavy influence to the translation. The attention model utilizes a encoder-decoder framework for each step j during decoding they compute an attention score \(\alpha _{ji}\) for hidden representation \({\varvec{h}}\) in i of each input token to obtain, and the formulation is below:

$$\begin{aligned} e_{j i}=&a\left( {\varvec{s}}_{i}, {\varvec{h}}_{j}\right) \end{aligned}$$
(1)
$$\begin{aligned} \alpha _{i j}=&\frac{\exp \left( e_{i j}\right) }{\sum _{k=1}^{T} \exp \left( e_{i k}\right) }\end{aligned}$$
(2)
$$\begin{aligned} c_{j}=&\sum _{j=1}^{T} \alpha _{j i} {\varvec{h}}_{i}, \end{aligned}$$
(3)

where \(e_{ji}\) is output of an alignment model a, usually a forward neural network, and \(\varvec{s_i}\) is the decoder RNN hidden state for time i. Using \(e_{ji}\), one can score how well the inputs around position j and the output at position i match. \(c_{ji}\) is the encoded sentence representation with respect to the current element \({\varvec{h}}_j\) to measure its similarity with output sequence (\(\mathrm {y}_{\mathrm {1}}\),\(\mathrm {y}_{\mathrm {2}}\)...\(\mathrm {y}_{\mathrm {t}}\)), where \(\mathrm {y}_{\mathrm {1}}\) is the t-th output tokens. The diagram for the this attetion model is demonstrated in Fig. 1.

Self-Attention (Vaswani et al., 2017) can be applied to many other kinds of NLP tasks besides machine translation. Different from a translation task, the goal is to learn the dependencies between the words in a given sentence and use that information to capture the internal structure of the sentence. In self-attention, there are 3 important variables, Q, K and V, which are vectors used to get better encoding for both our source and target words. All of these three variables are hidden representations from the linear layer. Futhurmore, the attention weights of self-attention is also calculated different with Bahdanau’s attention, the formulation is below:

$$\begin{aligned} {\text {Self}} \,=\,{\text{Attention}}({\mathbf {Q}}, {\mathbf {K}}, {\mathbf {V}})={\text {softmax}}\left( \frac{{\mathbf {Q}} {\mathbf {K}}^{\top }}{\sqrt{\varvec{(}d_k)}}\right) {\mathbf {V}}. \end{aligned}$$
(4)

where \(d_k\) is the number of dimensions for key vector K. We argue to attack NMTs using self-attention too, as an disturbance to the dependency of source language can also deprave the translation quality.

3.2 Problem formulation

Denoting the source sequence as S, the translated target sequence as Y, a NMT model can be defined as \(f(S): S \rightarrow {Y}\). We denote \(S=[w_1, \ldots , w_n]\) and \(Y=[h_1, \ldots , h_k]\), where w and h denote the words in the source and target sequence, while n and k are the number of words in each respective sequence. To ensure the attack’s applicability, we assume a black-box setting where the attacker can only query the NMT model for translated results of a given input, and does not have access to the model parameters, gradients or training data. For an input pair (SY), we want to generate an adversarial example \(S_{adv}\) such that \(f(S_{adv})\) has an obvious semantic difference from Y. Additionally, we want \(S_{adv}\) to be grammatically correct and semantically similar to S.

Fig. 1
figure 1

Illustration of an attention-based NMT model (Dzmitry Bahdanau & Bengio, 2015) with RNN based encoder-decoder structures, generating the t-th target token \(y_t\) given a input sentence (\(\hbox {x}_1\), \(\hbox {x}_2\),..., \(\hbox {x}_T\))

3.3 Attentive word location

Attention weights in NMT models can be seen as the strength of semantic association between the source and target tokens, by adopting such a mechanism, the performance NMTs are boosted (Dzmitry Bahdanau & Bengio, 2015). Hence, we argue that NMTs can be crashed if the attention mechanism is tampered, and the best way of tampering attention is to adopt attention mechanism itself. In this subsection, we introduce the proposed attentive word location scheme and demonstrate different attentive NMT attack implementations based on language-specific and sequence-centered attentions.

3.3.1 Translation attentive attack

Since translation is a cross-language task defined by the source and target languages, it is intuitive to pose language-specific attacks to challenge NMTs’ robustness. To this end, we propose a Translation Attentive Attack (TAA) mechanism that focuses on influential words in the translation towards a certain target language. Concretely, we obtain such an attention \({\mathcal {A}}\) that measures word-wise importance in a specific translation task based on a contextual NMT model (Dzmitry Bahdanau & Bengio, 2015).

To calculate \({\mathcal {A}}\), we feed the NMT model with the source sequence to get the translated result \({\hat{Y}}=[{\hat{h}}_1, \ldots ,{\hat{h}}_{k'}]\), where \(k'\) is the number of words in the attacked target sentence. We then extract a correlation matrix \({\mathcal {A}}\) from the softmax layer in the model’s decoder, thereby formulating the process as \({\mathcal {T}}(S): S \rightarrow {} {\mathcal {A}}\). The elements in the correlation matrix \({\mathcal {A}}\) describe the probability distributions of translated words in the target language conditioned on the source sequence S, which can be written as:

$$\begin{aligned} \small a_{i j}={P}({\hat{h}}_{j}\vert [w_1, \ldots , w_i,\ldots ])=\frac{\exp \left( e_{i j}\right) }{\sum _{i=1}^{n} \exp \left( e_{i j}\right) }, \end{aligned}$$
(5)

where P denotes probability, and \(e_{i j}\) denotes the feature in the model depicting the matching degree between the predicted word \({\hat{h}}_{j}\) in the target language and the input word \(w_i\) in S. The conditional probabilities reveal the correlation between the input sequence and the predicted sequence in the target language. Given its softmax-normalized distribution, we have \(\sum _{i=1}^n a_{i j}=1, \forall j\), therefore it is intuitive to measure \(w_i\)’s contextual contribution to a translated word \({\hat{h}}_j\) using \(a_{i j}\) straightforwardly. Further, to find the most influential input words in the translation process, for the whole predicted sequence, we define the language-specific word-wise attention by summing the matrix elements by index j, as \({\mathbb {A}}=[{A}'_1, \ldots , {A}'_i, \ldots , {A}'_n]\), where \({A}'_i=\sum _{j=1}^{k'} a_{i j}\).

We can sort the words of the source sequence according to such an attention weight, \({\mathbb {A}}\), for the first step, and select the top language-specific influential words as the victim words for substitution in the second step, which will be introduced in Sect. 3.4.

3.3.2 Self-attentive attack

Beside the language-specific attack that focuses on the translation task between two languages above, the inherent semantics of the input sequence can also be tampered. Thus we propose a sequence-centered Self-Attentive Attack (SAA) which exploits attention from the input sequence itself. We utilize the transformer model (Vaswani et al., 2017), \({\mathcal {V}}(S):S\rightarrow {{\mathcal {B}}}\), to extract the self-attention matrix \({\mathcal {B}}\), whose elements \(b_{ij}\) indicate the word-wise weights given positional encodings. Particularly, since such weights are obtained via softmax activation, they are also naturally normalized (\(\sum _{i=1}^{n}b_{ij}=1, \forall j\)), and thus they are suitable to quantitatively measure the dependencies among words across the entire input sequence. Therefore, similar to the first step in TAA, we define the sequence-centered self-attention weight as \({\mathbb {B}}=[{B}'_1 \ldots {B}'_i \ldots {B}'_n]\), where \({B}'_i=\sum _{j=1}^n b_{ij}\).

Different from the language-specific attention in TAA that emphasizes on contextual alignment between source and target sequences, the sequence-centered attention in SAA can explore long-range dependencies within the input sequence itself, better indicating the word-wise influence on overall language understandings of the sequences.

3.3.3 Hybrid attentive attack

As analyzed above, the translation-attentive attack and self-attentive attack focus on different aspects of NMTs, i.e., the cross-language context alignment and the overall semantic understanding of the source sequence, respectively. We argue that both the two aspects are crucial for NMTs, and an ideal attack for NMTs should combine their advantages. Thus we propose a Hybrid Attentive Attack (HAA) scheme which comprehensively considers the word influence by combining the attention weight from TAA and SAA:

$$\begin{aligned} {\mathbb {H}}=(1-\lambda ) {\mathbb {A}}+\lambda {\mathbb {B}}, \end{aligned}$$
(6)

where \({\mathbb {H}}=[{H}'_1 \ldots {H}'_i \ldots {H}'_n]\) and \({H}'_i\) is the final influence weight for word \(w_i\) in the input sentence. The optimal parameter \(\lambda\) can be found by a greedy search based on the attack performance measured by BLEU on translated results. The overall workflow of the HAA model is demonstrated in Algorithm 1 with an example shown in Fig. 2.

Fig. 2
figure 2

An illustrated example of our HAA model. In this example, HAA generates an adversarial example with one word perturbed to attack an English-Chinese translation. The arrows inside the TAA box, and those in the SAA box, respectively represent the utilisation of translation and self-attention weights. The numbers inside the semantic-aware substitution box represents the sentence-level semantic similarity. The TAA, SAA, HAA and Semantic-aware Substitution workflows are reflected in lines 2–3, lines 4–5, line 6, and lines 7–15 in Algorithm 1, respectively

figure a

3.4 Semantic-aware word substitution

In the above subsection, we locate the most influential words in the input sequence to be attacked. An ideal attack should guarantee sufficient concealment besides having attack effectiveness, enabling the adversarial example to avoid being noticed by the NMT model. Therefore, we further argue an qualified adversarial example \(S_{adv}\) should preserve semantics and be grammatically correct, constraining reasonable deviations from the original input sequence.

We propose to design such a semantic-aware word substitution approach based on the semantic feature similarity between the tampered sequence and the original one. We mask a victim word one at a time by a descending order of the attention score to get \(S_{mask}\), and utilise an MLM model \({\mathcal {M}}(S_{mask}): S_{mask}\rightarrow {S'_{can}}\), where \(S'_{can}\) is a mask-filled sentence. At each iteration, we utilize \({\mathcal {M}}\) to generate \(n^*\) best adversarial example candidates, \({\mathbb {S}}_{can}=[S'_{(can,1)},\ldots , S'_{(can,p)}, \ldots , S'_{(can,n^*)}]\), according to corresponding logits from \({\mathcal {M}}\), and we use a pre-trained semantic retrieval model, universal sentence encoder (USE) (Yang et al., 2019), to calculate the cosine feature distance between the candidate \(S'_{(can,p)}\) and the original sequence S. Then we select \(S_{adv}\) with highest similarity to the original one as the adversarial example. By such a semantic-aware word substitution, we can complete the NMT adversarial attack process and strike a balance between influencing the translation result and concealing the perturbations with similar semantics.

4 Experiments

We empirically evaluated and assessed our proposed attacking strategies (TAA, SAA and HAA) on a task of translating English to Chinese to three well-performed world-leading NMTs: Google Cloud Translation, Baidu Cloud Translation and Helsinki NMT (Tiedemann, 2020). To deeply explore the attacking performance, we not only attack the victim model but also make transfer attacks which utilize the adversarial examples generated on one victim NMT to attack other NMTs.

4.1 Datasets

To get sufficient training data, we utilized 4 datasets as our training set for training the language-specific NMT and sequence-centered transformer models utilized for the TAA, the SAA, and the MLM for semantic-aware word substitution. Three of the training sets are Commentary (Tiedemann, 2012), Infopankki (Tiedemann, 2020) and the Openoffice (Tiedemann, 2020), are publicly available, while the other, YYeTs subs,Footnote 1 is scripted by us from YYeTs website (provided in the supplementary material), which provides human translated movie and drama subtitles. The details of the train set can be found in Table 1.

To get reliable experimental results, we test attacking strategies on 3 other public datasets, WMT20 T1, WMT20 T2 (Tiedemann, 2020) and ALT-P(test) (Riza et al., 2016). WMT is the main event for machine translation and machine translation research, which provides reliable multilingual datasets from Wikipedia. To diverse the sources of test set, we also include ALT-P dataset on news. The details of the test set can be found in Table 1.

Table 1 Introduces details about datasets used in the experiments

4.2 Victim models

We test the proposed attacking strategies on three well-performed NMTs: Google Cloud TranslationFootnote 2 (Google.T), Baidu Cloud TranslationFootnote 3 (Baidu.T), and Helsinki NMT (Hel.T) (Tiedemann, 2020). The first two NMTs are cloud translation platforms, which are used for commercial purposes while the other NMT, Helsinki NMT is based on MarianNMT(Junczys-Dowmunt et al., 2018) from Microsoft for academic purpose.

4.3 Baselines

We compare our proposed strategies with 5 word-level attack strategies below:

  • RAND: randomly selects victim words in the target sentences and utilize the proposed semantic-aware substitution strategy to construct the adversarial examples.

  • Morpheus-Attack (Morph) (Tan et al., 2020), greedily searches for words, from noun, verb, or adjective tags, maximally decreasing BLEU on source language side, and substitute them with synonyms.

  • BERT-ATTACK (BERT.A) (Li et al., 2020): utilizes BERT to locate the victim words by ranking the differences between the logits of original words and BERT-predicted words, and then make substitutions with BERT.

  • Seq2sick (Cheng et al., 2020): crafts the adversarial example by depraving the targeted logits of victim NMT with regularization on preserving semantic similarity.

  • PSO (Zang et al., 2020): selects word candidates from HowNet and employs the PSO to find adversarial text for classifier. We adjust the metric from classification logits to BLEU.

4.4 Evaluation metrics

We use metrics based on BLEU and USE (Yang et al., 2019) to evaluate attacking performance on the target language side and the semantic preservation on the source language side. BLEU evaluates the sentence pairs in term of word alignment while USE is a multilingual pre-trained language model to evaluate the semantic similarity.

Since changes of the original input will always lead to changes of the translated output, we examine how much more changes an attacked output has compared to those of the unattacked translation. So instead of directly using BLEU and USE on translated outputs, we define BLEU drop ratio (BDR) and USE drop ratio (UDR) to evaluate attacks:

$$\begin{aligned} \small \mathrm {BDR}&=\frac{\mathrm {BLEU}(Y,f(S))-\mathrm {BLEU}(Y,f(S_{adv}))}{\mathrm {BLEU}(Y,f(S))} \end{aligned}$$
(7)
$$\begin{aligned} \mathrm {UDR}&=\frac{\mathrm {USE}(Y,f(S))-\mathrm {USE}(Y,f(S_{adv}))}{\mathrm {USE}(Y,f(S))} \end{aligned}$$
(8)

where S and Y denote input sentence and translation reference, and \(f(\cdot )\) is the victim NMT model.

In addition, we also evaluate how much word perturbations are made on the original inputs by using BLEU and USE on the attacked source language. To distinguish from the metrics used on the target language side, we use S-BLEU and S-USE for denoting changes made on the source language.

4.5 Experimental settings

In this section, we will introduced the models used for HAA and results of greedy searching \(\lambda\). Since the number of of attacked words could intuitively affect the attacking performance, the number of perturbed words in each sentence to be attacked ranges from 1 to 5 in our experiment comparisons.

4.5.1 Model structures

In this subsection, we introduce the structure of the language-specific NMT for TAA, transformer for SAA, and the MLM for semantic-aware word substitution. All of these 3 models are trained and fine-tuned on the same train datasets mentioned in Table 1.

  • TAA: The architecture of TAA consists of a 2-layer stacked LSTM, plus a Luong’s (2015) translation attention layer to process of the output of LSTM. To be more specific, the encoder takes a list of subtoken IDs to an embedding vector for each subtoken via an embedding layer. Further, we processes the embeddings into a new sequence with a LSTM. After encoding, the features of input sentences will be passed into a decoder, and the decoder’s job is to generate predictions for the next output token. The decoder receives the complete encoder output and uses a LSTM to keep track of what it has generated so far. To get translation attention, the decoder will utilize its LSTM output as the query to the attention over the encoder’s output, producing the context vector. After the LSMT in decoder, we adopt the Luong’s translation attention to combine the LSTM output and the context vector generate the translation attention matrix. For the last step, decoders generates logit predictions for the next tokens based on the attention matrix. For the hyper-parameter, we set 1024 hidden units, 256 embedding dimmensions, 64 batch size, with Adam optimizer.

  • SAA: SAA is designed to get sequence-centered attention weights on the source language, therefore it will be trained with only the data in source language. Since the data is unlabeled and sequential, we utilize BERT-base-uncased (Devlin et al., 2019), one of the best unsupervised language models, as the transformer to extract the sequence-centered attention weights. The hyper-parameters of this model are public available. To adjust the model to our dataset, it will be fine-tuned on our datset with Adam optimizer with learning rate 0.001 and batch size 128.

  • MLM for semantic-aware substitution: MLMs mask the words in the train set and are given a task to fill these masks, therefore utilize these models can help to find parsing substitutions for the proposed methods. We utilize a public pre-trained model, RoBERTa-large (Liu et al., 2019), as our candidate to generate parsing and semantic-preserving adversarial examples.

4.5.2 Optimization of \(\lambda\)

In the experiments, our proposed method, HAA, utilizes a greedy search for the best hyper-parameter \(\lambda\) to combine language-specific and sequence-centered attention. The objective used for searching is BLEU and the search is within the validation set which contains 1000 samples separated from the training set. We greedily search for the optimal hyper-parameter \(\lambda\) within \([0,0.01,\ldots ,1]\) with a step size of 0.01 for each victim model and the searched results for the three victim NMTs (Google, Baidu and Helsinki translations) are shown in Fig. 3.

Fig. 3
figure 3

The process of searching for the best \(\lambda\) for Google, Baidu and Helsinki NMT. The discovered optimal \(\lambda\) values are highlighted in red (Color figure online)

From searched results in Fig. 3, we can find that the optima \(\lambda\) values for the three victim models are \(\lambda _\text {Google}=0.68\), \(\lambda _\text {Baidu}=0.47\) and \(\lambda _\text {Hel.T}=0.41\). Therefore, we can find the \(\lambda\) can be different for different victim NMTs in our experimental settings. Since \(\lambda\) is utilized to control the weight of SAA and TAA, it can show the preference between SAA and TAA. From the results, we find that for different victim NMTs, the proposed HAA will have different preferences: TAA is preferred for Google translation while SAA is preferred for Baidu Translation and Helsinki Translation. Besides, as the \(\lambda\) is searched based on the performance of NMTs, there is no doubt that the \(\lambda\) can be different due to the different NMTs’ performance on datasets so that this preference can be different in datasets.

4.6 Main results and analysis

We show the results for greedy searching process in Fig. 3. The main results of attacking performance and semantic preserving performance on different test data sets are shown in Tables 2, 3, 4, and Figs. 4, 5, and 6. In addition to the statistics of the results, an example of learned attentions for the proposed methods is shown in Table 5 and an adversarial example is also shown in Table 6 to show the differences of attacks. We validate the advantages of our proposed methods (i.e., TAA, SAA and HAA) from the following three aspects:

Table 2 Comparisons of performance on WMT20 T1 dataset in terms of semantic preservation (S-BLEU, S-USE) and attacking performance (BDR, UDR) averaged across different number of perturbed words for victim models
Table 3 Comparisons of performance on WMT20 T2 dataset in terms of semantic preservation (S-BLEU, S-USE) and attacking performance (BDR, UDR) averaged across different number of perturbed words for victim models
Table 4 Comparisons of performance on ALT.P dataset in terms of semantic preservation (S-BLEU, S-USE) and attacking performance (BDR, UDR) averaged across different number of perturbed words for victim models
Fig. 4
figure 4

Attacking performance (BLEU, USE) on the WMT20 T1 dataset towards different numbers of perturbed words ranging from 1 to 5 for three victims, NMT, Goolge.T, Baidu.T and Helsinki.T

Fig. 5
figure 5

Attacking performance (BLEU, USE) to the WMT20 T2 dataset towards different numbers of perturbed words ranging from 1 to 5 for three victims, NMT, Goolge.T, Baidu.T and Helsinki.T

Fig. 6
figure 6

Attacking performance (BLEU, USE) to the ALT.P dataset towards different numbers of perturbed words ranging from 1 to 5 for three victims, NMT, Goolge.T, Baidu.T and Helsinki.T. The horizontal red dashed lines indicate the numbers of words needed to achieve identical drops of metric scores

Table 5 Examples for attentions learned by proposed methods (TAA, SAA and HAA). The examples are red, blue and green for TAA, SAA, and HAA, respectively
Table 6 Adversarial examples (adv.) crafted by proposed methods and baselines, and their corresponding translated results (Tran.)
Table 7 Comparisons among different word substituting methods

4.6.1 Does HAA have superior attack performance compared to baselines?

We compare the attacking performance of the proposed attentive methods (TAA, SAA, and HAA) and non-attentive baselines in Fig. 2, reflected by decreases of BLEU and USE between the original and the attacked translation results. It can thus be concluded that the proposed method HAA achieves the best attacking performance, with the largest metric score drops for both word alignment (BLEU) and semantic understanding (USE). Particularly, as shown in Fig. 2, HAA consistently outperforms other competing methods across different data domains, regardless of the number of perturbed words. Apart from HAA itself, its different attentive components TAA and SAA also show surpass the non-attentive baselines in most cases.

4.6.2 Balance between attack performance and the number of perturbed words

Concerning the trade-off between effectiveness and imperceptibility, we evaluate the attack’s imperceptibility from both appearance and semantic modification perspectives, the first of which is the number of words perturbed. As shown in Figs. 4, fig:T2 and fig:alt, comparing the numbers of words needed to achieve identical drops of metric scores (marked by the horizontal red dashed lines), we can find that HAA perturbs the fewest words, for it theoretical focuses on the most influential words with both language-specific and sequence-centered attentions. Thus we can conclude that the proposed HAA more successfully balances attacking performance and the appearance modifications to the sequence.

4.6.3 How well does HAA reserve the semantic meaning of the original input sentences?

To further investigate the attack’s imperceptibility, we evaluate the semantic similarities between the original input sentence and its derived adversarial sample (i.e., S-BLEU and S-USE) shown in Tables 2, 3 and 4 on different datasets. All of the table demonstrates the attacking methods based our semantic-aware substitution, SAA TAA HAA and RAND, are the best methods in most cases in terms of semantic preserving. In some cases, our methods are not the best, but they are still comparable to the best method PSO by a close margin in semantic preservation. However, PSO’s preservation comes at the price of much inferior performance, as is shown by its BDR and UDR. Thus we can conclude that proposed HAA provides the one of the best balances between attack performance and semantics preservation.

Fig. 7
figure 7

Attacking performance (BDR, UDR) of transferred attacks from mBART to Google, Baidu and Helsinki NMT models

To further validate the effectiveness of our word replacement strategy, we conduct an extensive experiment on our semantic-preserving performance by a task of substituting the same victim words located by our hybrid attention. We select 3 common substituting baselines:

  • Default masked-word filling (HA.Def): utilize MLM to fill the mask without a consideration to the semantic preservation

  • Synonyms (HA.Syn): replace the victim words with synonym from the WordNet (Miller, 1998)

  • Word embedding distance ranking (HA.Rank): search the word embedding space in GloVe (Pennington et al., 2014) to set the word, with smallest distance (\(l_2\)) to victim word, as the replacement.

The results from Table 7 show that HAA (semantic-aware substitution) achieves the best semantic-preserving performance on attacking the same position. Clearly, HAA can provide more parsing-correct and semantic-preserved adversarial examples than other methods.

4.7 Transferability

The transferability of adversarial examples is defined as whether the adversarial examples targeting at a specific model f can also mislead another model \(f^{'}\). To evaluate transferability, we apply one-word-perturbation adversarial examples generated by different methods on mBART-large-cc25 (Tang et al., 2020), a sequence-to-sequence transformer from Facebook, to attack Googl e, Baidu and Helsinki translation models. Figure 7 shows the results on the original mBART NMT and other transferred models. It can be concluded from this figure that our attentive methods (TAA, SAA, and HAA) achieve the best attack performance on the three transferred NMT models, demonstrating the effectiveness of our methods in terms of attack transferability.

4.8 Attacking preference

As the superiority of proposed method in terms of attacking performance, we collect some statistics to research the attacking preference, described by speech (POS) tags, for different attacking strategies. In this subsection, we analyze statistics on POS as shown in Table 8, and aim to analyse the more vulnerable POS tags by a comparison between the proposed methods and baselines.

Words that are assigned to the same part of speech (POS) tags generally present similar syntactic importance, we investigate attacking strategies’ preference on POS tags for further lingual analysis. We apply Stanford PSO tagger (Toutanova et al., 2003) to annotate them with POS tags, including noun, verb, adjective (Adj.), adverb (Adv.) and others (i.e., pronoun preposition, conjunction, etc). Statistical results in Table 8 demonstrate that generally all the attacking methods tend to focus on noun, which we can suppose is the most sensitive POS category for translation. However, the proposed attacking strategies (TAA, SAA and HAA) tends to take a larger proportion of Verbs than any other methods, thus we may conclude that Verb might be the second adversarially vulnerable POS tag.

Table 8 Distributions of POS tags for different attack strategies The percentages are calculated row-wise.

5 Discussion and Conclusions

In recent years, safety and fairness of NLP models have greatly been threat by adversarial attacks. Most existing researches focus on NLP classifiers, such as fake news detection, sentiment analysis, and email spam while few researchers raise concern about the robustness of the sequence-to-sequence neural machine translation (NMT) models. Unlike the classifiers, NMT outputs a sequence of dependent discrete classes or token IDs rather a single class. To this end, the attacking performance for NMTs would perform poorly if the victim words only affect their translation results while the semantics of other words are still preserved. Thus, to make a threat level attacks to NMTs, the attackers should not perturb the victim words only but also the contextual environment.

In this research, we have proposed HAA which selects influential words by both translation-specific and language-centered attentions and substitutes them with semantics preserved word perturbations. Adversarial examples generated by our proposed method will not only affect the victim words translation but also other words’ translations. Experiments demonstrate that HAA delivers the best balance between the number of perturbed words and attacking performance among the competing methods.

Although the generated adversarial examples can threat the NMTs, adversarial examples are not bugs but features (Ilyas et al., 2019). To protect the NMT from the proposed attack, we believe that one possible defence strategy is adversarial retraining, which is usually done by joining the adversarial examples in the training set then retraining the models with the newly constructed training set. Although we did not perform the adversarial retraining in experiments, due to the lack of access to the victim models’ structure since the Google and Baidu translations are online service and Helsinki NLP does not specify their model structures, by joining the adversarial features into model training, the model can be theoretically more robust against adversarial attacks.

Since the adversarial attack is one of the most effective methods to test the robustness of a model, the proposed attentive attacks raise some concern about the attention mechanism. As transformers with attention mechanism achieved great success, most of the existing well-performed NLP models are based on such an mechanism. Such an popularity of attentions could put NMTs in high risks because attackers can make effective attacks by utilizing the attention mechanism. Thus a safer way of applying attentions is a promising future research direction. At the same time, we also plan to pertinently study and design defence strategies to further improve the robustness of NMT models under future adversarial attacks.