1 Introduction

NLP models are known to be vulnerable in various applications, including machine translation (Ni et al., 2022; Cheng et al., 2020; Tan et al., 2020), sentiment analysis (Zang et al., 2020; Yang et al., 2021), and text summarization (Cheng et al., 2020). Attackers can exploit these weaknesses, creating adversarial examples that compromise the performance of targeted NLP systems. This growing susceptibility presents significant security challenges for AI models.

Textual attacks on NLP models are classified into character (Iyyer et al., 2018b; Ribeiro et al., 2018), word (Alzantot et al., 2018; Jia et al., 2019), and sentence-level (Jia & Liang, 2017) attacks. Character-level attacks are easily countered due to noticeable misspellings (Ebrahimi et al., 2018), while sentence-level attacks often yield complex, hard-to-read text (Gan & Ng, 2019). Word-level attacks are gaining preference for their effectiveness and subtlety, as they involve replacing words with carefully chosen substitutes (Zhang et al., 2020; Garg & Ramakrishnan, 2020; Liet al., 2020). Consequently, our focus is on conducting word-level adversarial attacks.

Crafting optimal adversarial examples involves navigating the interplay of successful attacks, controlled imperceptibility. The predominant strategies for this can be classified into optimization algorithms and hierarchical search methods. Within the realm of optimization, Genetic Attack (GA) (Alzantot et al., 2018; Jia et al., 2019) and Particle Swarm Optimization (PSO) (Zang et al., 2020) stand out as evolutionary approaches, focusing on optimizing attack effectiveness within embedding spaces and sememe-based thesauri, respectively. However, these methods face two primary challenges: 1) low efficiency in the optimization process due to the expansive search space, such as GloVe (Pennington et al., 2014), and 2) Compromised semantic integrity, as even synonym-based word substitutions can cause sentence-level semantics inconsistency. On the other hand, Hierarchical search crafts adversarial examples by orderly substituting words based on word saliency rank (WSR) (Ren et al., 2019; Li et al., 2021; Yang et al., 2021). It first identifies target words using WSR, then employs a Masked Language Model or thesaurus for substitutions. These hierarchical attacking methods have several drawbacks: 1) the first drawback of this approach is the difficulty of presetting the number of perturbed words (NPW) for large datasets with many tokens since the optimal NPW varies with different target texts (Michel et al., 2019); 2) the WSR-based methods will significantly reduce the searching domain by only attacking the combination of victim words ordered by the WSR. For a clear illustration, Fig. 1 showcases the drawbacks of optimization-based GA and hierarchical PWWS attacks. GA’s replacement of ‘thriller’ with ‘science’ sacrifices semantic quality, while PWWS, despite altering three words, fails to fool the classifier.

Fig. 1
figure 1

An illustrating example to show attack performances of optimizing attack (genetic attack), PWWS attack, and the proposed method RJA-MMR, where label “0” represents negative sentiment and “1” represents positive sentiment. The substitutions for different attack methods are bold. Genetic attack sacrifices too much semantics by changing “thrillers” to “science”, while PWWS fails to fool the model and makes many ineffective modifications. The proposed method, RJA-MMR, makes a successful attack with only one word changed

To address the above problems, we propose two novel black-box and word-level antagonistic algorithms: Reversible Jump Attacks (RJA) and MH Modification Reduction (MMR). For RJA, we employ the Reversible Jump sampler (RJS) and propose three variables from a target distribution: the number of perturbed words (NPW), victim words, and substitutions from Masked Language Models (MLM) and HowNet (Dong & Dong, 2003). The target distribution for RJS for evaluating the quality of the adversarial candidates is regularized by a strong penalty of semantic (dis)similarity. The NPW can be cross-dimensionally searched via RJS to adjust for different textual inputs according to their word saliency and overall performance. Given these three factors, adversarial candidates are only accepted based on an acceptance probability from RJS. By running such a process iteratively, we will obtain the successful candidates with the highest semantic similarity. Therefore, RJA efficiently searches threat-level attacks inside a domain larger than WSR without presetting an NPW and sacrificing much semantics for imperceptibility.

The other algorithm is Metropolis–Hasting Modification Reduction (MMR) which tends to restore the manipulations from RJA (i.e., reverse back to the original words) and then update the existing substitutions to maintain the attacking performance. Specifically, given an adversarial candidate, MMR first stochastically proposes a new candidate by restoring the attacked words. It applies a customized acceptance probability, calculated by comparing the overall performance between the new and current candidates, to determine the acceptance of the new candidate. After restoring some attacked words, MMR uses MH algorithm to update the substitutions of the current attacked words to preserve the attacking performance. By combining RJA and MMR, we proposed an integrated RJA-MMR as our final model. Specifically, RJA utilizes a Reverse Jump sampler (Green, 1995b), a Markov Chain Monte Carlo (MCMC) family member, to sample the dimensional jumping vectors to perform a cross-dimensional search for the optimal attacking performance constrained by semantic similarity. Intuitively, RJA and MMR agree on attacking performance improvement but disagree on NPW. By iteratively running these two antagonistic algorithms, attackers can boost the attack performance with only a small number of perturbations. The attack performance is illustrated by an example in Fig. 1, where RJA-MMR outperforms the optimizing attack (Genetic attack) and hierarchical attack (PWWS).

Our main contributions from this work are as follows:

  • We design a highly effective adversarial attack method, Reversible Jump Attack (RJA), which utilizes the Reversible Jump algorithm to generate adversarial examples with an adaptive number of perturbed words. The algorithm enables our attack method to have an enlarged search domain by jumping across the dimensions.

  • We propose Metropolis–Hasting Modification Reduction (MMR), which applies Metropolis–Hasting (MH) algorithm to construct an acceptance probability and use it to restore the attacked victim words to improve the imperceptibility with attacking performance reserved. MMR is functional with RJA and empirically proven effective in the adversarial examples generated by other attacking algorithms.

  • We evaluate our attack method on real-world public datasets. Our results show that methods achieved the best performance in terms of attack performance, imperceptibility and examples’ fluency.

The rest of this paper is structured as follows. We first review adversarial attacks for NLP models and the Markov Chain Monte Carlo methods in NLP in Sect. 2. Then we detail our proposed method in Sect. 3. We evaluate the performance of the proposed method through empirical analysis in Sect. 4. We conclude the paper with suggestions for future work in Sect. 5.

2 Related work

This section reviews the literature on word-level textual attacks and MCMC sampling in NLP.

2.1 Word-level attacks to classifiers

An increasing amount of effort is devoted to generating better textual adversarial examples with various attack models. Character-level attacks (Liang et al., 2018; Ebrahimi et al., 2018) use misspellings to attach the victim classifiers; however, these attacks can often be defended by a spell checker. At the same time, sentence-level attacks (Iyyer et al., 2018b; Zou et al., 2020) pose threats to the classifier via inserting, removing, and paraphrasing sentences or pieces of sentences to the original input, while it’s difficult for the generated text to maintain the imperceptibility (Li et al., 2021). Word-level attacks pose non-trivial threats to NLP models by locating important words and manipulating them for targeted or untargeted purposes. Such attacks are broadly regarded as the optimal unit of attacks (Jia & Liang, 2017).

2.1.1 Gradient-based word-level attacks

With the help of an adopted fast gradient sign method (FGSM) (Goodfellow et al., 2015; Papernot et al., 2016) were the first to generate word-level adversarial examples to classifiers. While their attack was able to fool the classifiers, their word-level manipulations significantly affected the original meaning. In Liang et al. (2018), the authors proposed to attack the target model by inserting Hot Training Phrases (HTPs) and modifying or removing the Hot Sample Phrases (HSPs), where HTPs and HSPs are calculated based on the gradient with respect to words from the input. Similar to Liang, Samanta & Mehta (2018) utilizes the embedding gradient to determine the important words. Then hierarchical-driven rules together with hand-crafted word-level synonyms and character-level typos were designed. Notably, while the textual data is naturally discrete and more perceptible than image data, many gradient-based textual attacking methods inherited from computer vision are not effective enough, which leaves textual attack a challenging problem.

2.1.2 Non-gradient-based word-level attacks

Alzantot et al. (2018) transferred the domain of adversarial attacks to an optimization problem by formulating a customized objective function. With genetic optimization, they generate the adversarial examples by sampling the qualified genetic ‘son’ generations that break out the encirclement of the semantic threshold. However, the genetic algorithm can be low efficient. Since word embedding space is sparse, performing natural selection for languages in such a space can be computationally expensive. Jia et al. (2019) proposed a faster version of Alzantot’s adversarial attacks by shrinking the search space, which accelerates the process of evolving in genetic optimization. Although Jia has greatly reduced the computational expense of genetic-based optimization algorithms, the optimizing processes inside word embedding space, such as GloVe (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013), are still not efficient enough. To ease the searching process, embedding-based algorithms have to use a counter-fitting method to post-process attacker’s vectors to accelerate the searching speed (Mrksic et al., 2016). Compared with the word embedding method, utilizing well-organized linguistic thesaurus, e.g., synonym-based WordNet (Miller et al., 1990) and sememe-based HowNet (Dong & Dong, 2003), is a simple and easy implementation. Ren et al. (2019) sought synonyms based on WordNet synsets and ranked word replacement order via probability-weighted word saliency (PWWS). Zang et al. (2020) and Yang et al. (2021) both manifested that the sememe-based HowNet can provide more substitute words via Particle Swarm Optimization (PSO) and an adaptive monotonic heuristic search to determine which group of words should be attacked. In addition, some recent studies utilized masked language models (MLM), such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), to generate contextual perturbations (Liet al., 2020; Garg & Ramakrishnan, 2020). The pre-trained MLMs can ensure the predicted token correctly fits the sentence grammar but cannot preserve semantics.

2.2 Markov chain Monte Carlo in NLP

Markov chain Monte Carlo (MCMC) (Metropolis et al., 1953a), a statistically generic method for approximate sampling from an arbitrary distribution, can be applied in a variety of fields, such as optimization (Rubinstein, 1999), machine learning (Fan et al., 2018), quantum simulation (Haase et al., 2021) and icing models (Herrmann, 1986). The main idea is to generate a Markov chain whose equilibrium distribution is equal to the target distribution (Kroese et al., 2011). There exist various algorithms for constructing chains, including the Gibbs sampler, Reversible Jump sampler (Green, 1995b), and Metropolis–Hasting (MH) algorithm (Metropolis et al., 1953a). To get models capable of reading, deciphering, and making sense of human languages, NLP researchers apply MCMC to many downstream tasks, such as text generation and sentimental analysis. For text generation, Kumagai et al. (2016) proposes a probabilistic text generation model which generates human-like text by inputting semantic syntax and some situational content. Since human-like text requests grammarly correct word alignment, they employed Monte Carlo Tree Search to optimize the structure of the generated text. In addition, Harrison et al. (2017) presents the application of MCMC for generating a story, in which a summary of movies is produced by applying recurrent neural networks (RNNs) to summarize events and directing the MCMC search toward creating stories that satisfy genre expectations. For sentimental analysis, Kang and Ren (2011) applies the Gibbs sampler to the Bayesian network, a network of connected hidden neurons under prior beliefs, to extract the latent emotions. Specifically, they apply the Hidden Markov models to a hierarchical Bayesian network and embed the emotional variables as the latent variable of the Hidden Markov model.

2.2.1 Metropolis–Hasting and reversible jump samplers

The Metropolis–Hasting (MH) (Metropolis et al., 1953a) algorithm is a classical Markov chain Monte Carlo sampling approach. Given the stationary distribution \(f({\textbf{z}})\) and transition proposal \(q({\textbf{z}}'|{\textbf{z}})\), the MH algorithm can generate desirable examples from \(f({\textbf{z}})\). Specifically, at each iteration, a new state \(\mathbf {z'}\) will be proposed given the current state \({\textbf{z}}\) based on a transition function \(q({\textbf{z}}'|{\textbf{z}})\). The MH algorithm is based on a “trial-and-error” strategy by defining an acceptance probability \(\alpha (\mathbf {z'}\vert {\textbf{z}})\) as following:

$$\begin{aligned} \alpha (\mathbf {z'}\vert {\textbf{z}})=\min \left\{ \frac{f(\mathbf {z'}) q({\textbf{z}} \mid \mathbf {z'})}{f({\textbf{z}}) q(\mathbf {z'} \mid {\textbf{z}})}, 1\right\} \end{aligned}$$
(1)

to decide whether the new state \({\textbf{z}}'\) is accepted or rejected.

MCMC can also be applied to sample variational dimension sampling. Reversible Jump samplers (RJS) (Green, 1995b) is a variation of MCMC algorithms specifically designed to sample from target distributions that contain vectors with different dimensions. Due to such a property, RJS can be applied to variable selection (Fan & Sisson, 2011), dimension reduction (Rincent et al., 2017), and cross-dimensional optimization (Kroese et al., 2011). Unlike the MH algorithm, RJS requests an additional transition item for proposing the new dimensions. The formulation of the acceptance probability of RJS is below:

$$\begin{aligned} \displaystyle \alpha (\mathbf {z'}_{(m')}|{\textbf{z}}_{(m)})=\min \left\{ \frac{f(\mathbf {z'}_{(m')}) q({\textbf{z}}_{(m)} \mid \mathbf {z'}_{(m')})}{f({\textbf{z}}_{(m)}) q(\mathbf {z'}_{(m')} \mid {\textbf{z}}_{(m)})}, 1\right\} \end{aligned}$$
(2)
$$\begin{aligned} q\left( \mathbf {z'}_{(m')}|{\textbf{z}}_{(m)}\right) =p\left( \mathbf {z'}_{(m')}|m', {\textbf{z}}_{(m)}\right) p\left( m'|{\textbf{z}}_{(m)}\right) , \end{aligned}$$
(3)

where m denotes the dimensions of the vector \({\textbf{z}}_{(m)}\), \(q\left( \mathbf {z'}_{(m')}|{\textbf{z}}_{(m)}\right)\) in Eq. (3) illustrates the new transition function and \(p\left( m'|{\textbf{z}}_{(m)}\right)\) is the dimensional transition item. Comparing the acceptance probabilities of MH (Eq. 1) and RJS (Eq. 2) reveals that RJS is more effective than MH in handling dimensional variations and sampling parameters of unknown dimensions. Since making adversarial would be a typical situation of dimension variation due to number of perturbed words (NPW), we believe that attacks based RJS is expected to achieve better performance than the literature based on MH (Zhang et al., 2019).

2.2.2 Adversarial attack via MCMC

Despite the applications in NLP, the MCMC can be applied to adversarial attacks on NLP models. Zhang et al. (2019) has successfully applied MH sampling to generate fluent adversarial examples for natural language by proposing gradient-guided word candidates. Specifically, they proposed both black-box and white-box attacks, and for black-box attacks, they perform removal, insertion and replacement by the words chosen from the pre-selector candidates set, but the empirical studies indicate these candidates are not efficient and effective for attacking. As for the white-box attacks, the gradient of the victim model is introduced to score the pre-selector candidates set, which successfully improves the attacking performance. However, the white-box setting is not practical in the real world, as attackers do not have access to the gradient and structure of the victim models. In addition, MHA successfully improved the language quality in terms of fluency, but the imperceptibility of the generated examples, especially in the modification rate, cannot be optimized.

3 Imperceptible adversarial attack via Markov chain Monte Carlo

In this section, we will detail our proposed method, RJA-MMR, the Reversible Jump attacks (RJA) with Metropolis–Hasting Modification Reduction (MMR).

3.1 Problem formulation and notaition

Given a pre-trained text classification model, which maps from feature space \({\mathcal {X}}\) to a set of classes \({\mathcal {Y}}\), an adversary aims to generate an adversarial document \(\mathbf {x^*}\) from a legitimate document \(x\in {\mathcal {X}}\) whose ground truth label is \(y\in {\mathcal {Y}}\), so that \(F(\mathbf {x^*})\ne y\). The adversary also requires \(Sem(x,\mathbf {x^*}) \le \epsilon\) for a domain-specific semantic similarity function \(Sem(\cdot ): {\mathcal {X}}\times {\mathcal {X}}\rightarrow (0,1)\), where the bound \(\epsilon \in {\mathbb {R}}\) helps to ensure imperceptibility. In other words, in the context of text classification tasks, we use \(Sem(x,\mathbf {x^*})\) to capture the semantic similarity between x and \(\mathbf {x^*}\). More details of the notation are illustrated in Table 1.

Table 1 List of notations used in this research
Fig. 2
figure 2

The workflow of our RJA-MMR. In this example, HAA generates an adversarial example with one word perturbed to attack a sentimental classifier with two labels (positive and negative). The block \(\textcircled {1}\) shows the calculation of word saliency. After obtaining the word saliency, we perform RJA in block \(\textcircled {2}\) which reflects the lines 4–15 in Algorithm 1. After RJA, we perform the two steps, restoring and updating MMR in block \(\textcircled {3}\) and \(\textcircled {4}\), respectively. The block \(\textcircled {3}\) and \(\textcircled {4}\) are illustrated in lines 4–10 and lines 11–18 in Algorithm 2, respectively

3.2 Reversible jump attack

This section details our proposed Reversible Jump Attack (RJA) which generates adversarial examples under semantic regularisation. Let \(D=\{(x_1,y_1),(x_2,y_2),\ldots ,(x_N,y_N)\}\) denote a dataset with N data samples, where x and y are the input text and its corresponding class. Given the input text \(x=[w_1,\ldots ,w_i, \ldots , w_{n}]\) with n words, we denote an adversarial candidate of RJA as \({\textbf{x}}\) and denote the final chosen adversarial example as \({\textbf{x}}^*\).

RJA, unlike traditional methods, treats the number of perturbed words (NPW) as a variable in the sampling process, not a preset value. Utilizing the Reversible Jump Sampler, RJA conditionally samples NPW, victim words, and their substitutions. The approach involves a transition function that proposes adversarial candidates, evaluated against a target distribution focusing on attack effectiveness and semantic similarity (Eq. 2). This process iteratively refines the adversarial examples, guided by an acceptance probability mechanism.

This section first presents the transition function (Sect. 3.2.1) and then elaborates on the acceptance probability (Sect. 3.2.2), which builds upon the transition function.

3.2.1 Transition function

To propose the adversarial candidates, we construct our transition function to sequentially propose the three compulsory factors of crafting a new adversarial candidate \({\textbf{x}}_{t+1}\) given the current one \({\textbf{x}}_t\): the NPW m, the victim words \({\textbf{v}}=[v_1, \ldots ,v_m]\), and the corresponding substitutions \({\textbf{s}}=[s_1, \ldots ,s_m]\), where the dimension of \({\textbf{v}}\) and \({\textbf{s}}\) is m. Before we detail the process of proposing these factors, we first introduce the concept of the word saliency. In this context, word saliency refers to the impact of the word \(w_i\) on the output of the classifier and the transition function, if this word is deleted from the sentence. The word with a high saliency has a high impact on the classifier. Thus, associating more importance to high-saliency words can help the transition function efficiently propose a high-quality adversarial candidate. To calculate the word saliency, we use the changes of victim classifiers’ logits before and after deleting word \(w_i\) to represent the saliency \(I( w_i)\):

$$\begin{aligned} I( w_i)=F_{logit}(x)-F_{logit}(x\backslash w_i), \end{aligned}$$
(4)

where \(F_{logit}(\cdot )\) is the classifier returning the logit of the correct class, and \(x\backslash w_i=[w_1,\ldots ,w_{i-1},w_{i+1},\ldots , w_{n}]\) is the text with \(w_i\) removed. We calculate the word saliency \(I(w_i)\) for all \(w_i \in x\) to obtain word saliency \({\textbf{I}}(x)\). Calculating the word saliency is illustrated in Block \(\textcircled {1}\) of Fig. 2.

Among the iterations of searching for victim words, assume the RJA adversarial candidate at iteration t is \({\textbf{x}}_t=(m_t,{\textbf{v}}_t,{\textbf{s}}_t)\) and the new adversarial candidate to be crafted is \({\textbf{x}}_{t+1}=(m_{t+1},{\textbf{v}}_{t+1},{\textbf{s}}_{t+1})\), we propose the first factor, the NPW value \(m_{t+1}\), by either adding or subtracting 1, i.e., \(m_{t+1}\in \{m_{t}+1, m_{t}-1\}\). This set \(\{m_{t}+1, m_{t}-1\}\) does not need to include \(m_{t}\) because if the proposed state is rejected, \(m_{t+1}\) will be retained as \(m_{t}\), which means \(m_{t}\) still remains as a possible state. Thus the transition function for the new NPW value \(m_{t+1}\) can be formulated as a probability mass function as below:

$$\begin{aligned} p&(m_{t+1}| {\textbf{x}}_{t})= {\left\{ \begin{array}{ll} \displaystyle \frac{\exp (l_1)}{\exp (l_1)+\exp (l_2)}&{} m_{t+1}=m_{t}-1,\\ \displaystyle \frac{\exp (l_2)}{\exp (l_1)+\exp (l_2)} &{} m_{t+1}=m_{t}+1,\\ \end{array}\right. } \nonumber \\&\text {where} \quad l_1 = \sum _{w_i \in {\textbf{v}}_{t}} I(w_i), \quad l_2 = \sum _{w_i \notin {\textbf{v}}_{t}} I(w_i). \end{aligned}$$
(5)

Such a transition function can propose the new state \(m_{t+1}\in \{m_{t}-1,m_{t}+1\}\) by referring to the proportion of the exponential on victim word saliency \(l_1\) and unattacked word saliency \(l_2\) overall word saliency exponential. Intuitively, if the saliency values of all attacked words are high, the probability of proposing to reduce one attacked word, \(m_{t+1}=m_{t}-1\), is high, and vice versa. Concretely, to sample \(m_{t+1}\) from such a transition function, we firstly draw a random number, \(\eta \sim Unif(0,1)\); and if \(\eta\) is less than the probability of sampling \(m_{t+1}=m_{t}-1\), i.e., \(\eta <\frac{\exp (l_1)}{\exp (l_1)+\exp (l_2)}\), then \(m_{t+1}=m_{t}-1\), otherwise \(m_{t+1}=m_{t}+1\). Unlike hierarchical attacks, which deterministically perturb the words in the descending order of the word saliency, randomization is applied because of its two merits: 1) it overcomes the imprecision problem with the WSR (word saliency rank) mentioned in the preceding introduction section, and 2) it enlarges the search domain by proposing more combinations of attacked words than those in hierarchical searching.

After determining the number of perturbed words, we sample one target victim word \(v_{tgt}\) (where “tgt” refers to “target”) to be manipulated according to the newly sampled \(m_{t+1}\). Specifically, for \(m_{t+1}=m_{t}+1\), the target word \(v_{tgt}\) is uniformly sampled from unattacked word set \(x \backslash {\textbf{v}}_{t}\), while for \(m_{t+1}=m_{t}-1\) the target word \(v_{tgt}\) is uniformly drawn from attacked words set \({\textbf{v}}_{t}\) then the selected words will be restored to the original words. The transition function of sampling the target victim word \(v_{tgt}\) is thus formulated as:

$$\begin{aligned} \displaystyle p(v_{tgt}|{\textbf{x}}_{t},m_{t+1})&= {\left\{ \begin{array}{ll} \frac{1}{m_{t}}\quad v_{tgt}\in {\textbf{v}}_{t}&{} \text { if } m_{t+1}=m_{t}-1,\\ \frac{1}{n-m_{t}}\quad v_{tgt}\in {\textbf{x}}\backslash {\textbf{v}}_{t}&{} \text { if }m_{t+1}=m_{t}+1.\\ \end{array}\right. } \end{aligned}$$
(6)

After the target word \(v_{tgt} \in {\textbf{x}}_{t}\) is selected, we search for a parsing-fluent and semantic-preserving substitution for \(w_{tgt}\). Therefore, we uniformly draw a substitution \(s_{tgt}\) for \(v_{tgt}\) from the candidates set, which is the intersection (consensus) of candidates provided by Mask Language Models (MLMs) and Synonyms. Specifically, let \({\mathcal {M}}\) denote the MLM, and we mask the \(v_{tgt}\) in \({\textbf{x}}\) to construct a masked \({\textbf{x}}_{mask}\) and feed the masked text into \({\mathcal {M}}\) to search for the parsing-fluent candidates. Instead of using the argmax prediction, we take the most possible K words, which are the top K words suggested by the logits from \({\mathcal {M}}\), to construct MLM candidates set \({\mathbb {G}}_{{\mathcal {M}}}=\{w_{{\mathcal {M}}}^{1}, \ldots ,w_{{\mathcal {M}}}^{K}\}\). To keep semantically similar, we form a synonym set \({\mathbb {G}}_{syn}=\{w_{syn}^1, \ldots , w_{syn}^K\}\) from HowNet (Dong et al., 2010) based thesauri such as OpenHowNet (Qi et al., 2019) and BabelNet (Qi et al., 2020) These thesauri are context-aware and at the same time can provide more synonyms than common thesaurus such as WordNet (Miller, 1992). Since our objective is that the generated adversarial examples should be parsing-fluent and semantic-preserving, the substitution \(s_{tgt}\) will be uniformly sampled from the intersection \({\mathbb {G}}={\mathbb {G}}_{{\mathcal {M}}}\cap {\mathbb {G}}_{syn}\), which is illustrated in Eq. (7).

$$\begin{aligned} p(s_{tgt}|w_{tgt},m_{t+1},{\textbf{x}}_{t})=\frac{1}{[{\mathbb {G}}]} \end{aligned}$$
(7)

where \({\mathbb {G}}={\mathbb {G}}_{{\mathcal {M}}}\cap {\mathbb {G}}_{syn}\) and \([{\mathbb {G}}]\) is the cardinality of the set \({\mathbb {G}}\).

By applying the Bayes rule to the Eqs. (5), (6) and (7), the final transition function is:

$$\begin{aligned} p_{_{RJA}}\left( {\textbf{x}}_{t+1}|{\textbf{x}}_{t}\right) =p\left( m_{t+1}|{\textbf{x}}_{t}\right) p\left( w_{tgt}|m_{t+1},{\textbf{x}}_{t}\right) p\left( s_{tgt}|w_{tgt},m_{t+1},{\textbf{x}}_{t}\right) \end{aligned}$$
(8)

3.2.2 Acceptance probability for RJA

Before we calculating the acceptance probability, we need to construct the target distribution for evaluating the performance. Specifically, we argue that a good adversarial example should achieve successful attacks while being kept semantically similar to the input text x. Therefore, we formulate the following equation as our target distribution:

$$\begin{aligned} \pi ({\textbf{x}})=\frac{\left( 1-F_{p}({\textbf{x}})\right) Sem\left( x,{\textbf{x}}\right) }{C}, \end{aligned}$$
(9)

where \(Sem(x,{\textbf{x}})\) represents the semantic similarity, which generally is implemented with the cosine similarity between sentence encodings from a pre-trained sentence encoder, such as USE (Cer et al., 2018). \(C=\sum _{{\textbf{x}}\in {\mathcal {X}}}\left( 1-F_{p}({\textbf{x}})\right) Sem\left( x,{\textbf{x}}\right)\) is a positive normalizing factor to make \(\sum _{{\textbf{x}}\in {\mathcal {X}}}\pi ({\textbf{x}})=1\) and \(F_{p}(\cdot ): {\mathcal {X}}\rightarrow (0,1)\) denotes the confidence of making right predictions where \({\mathcal {X}}\) represents text space. From Eq. (9), we can easily observe that the value from target distribution \(\pi ({\textbf{x}})\) will increase with the increase of the attacking performance measured by the confidence of making a wrong prediction \(1-F_{p}({\textbf{x}})\), and semantic similarity \(Sem(x,{\textbf{x}})\).

Given the target distribution in Eq. (9) and transition function in Eq. (), we formulate the acceptance probability for RJA, \(\alpha _{_{RJA}}({\textbf{x}}_{t+1}|{\textbf{x}}_{t})\), as follows:

$$\begin{aligned} \displaystyle \alpha _{_{RJA}}({\textbf{x}}_{t+1}|{\textbf{x}}_{t})=\min \left\{ \frac{\pi ({\textbf{x}}_{t+1}) p_{_{RJA}}({\textbf{x}}_{t}|{\textbf{x}}_{t+1})}{\pi ({\textbf{x}}_{t}) p_{_{RJA}}({\textbf{x}}_{t+1}|{\textbf{x}}_{t})}, 1\right\} \end{aligned}$$
(10)

After calculating \(\alpha ({\textbf{x}}_{t+1}|{\textbf{x}}_{t})\), we sample a random number \(\epsilon\) from a uniform distribution, \(\epsilon \sim Uniform(0,1)\), if \(\epsilon <\alpha ({\textbf{x}}_{t+1}|{\textbf{x}}_{t})\) we will accept \({\textbf{x}}_{t+1}\) as the new state, otherwise the state will remain as \({\textbf{x}}_{t}\). By running T iterations, we obtain a set of adversarial candidates \(\{{\textbf{x}}_1,{\textbf{x}}_2,\ldots {\textbf{x}}_T\}\). We then choose the candidate which not only successfully fools the classifier but also preserves the most semantics as the final adversarial candidate \({\textbf{x}}\). The process of RJA is illustrated in Algorithm 1 and block \(\textcircled {2}\) in Fig. 2.

Algorithm 1
figure a

Reversible Jump Attack (RJA)

3.3 Modification reduction with metropolis–hasting algorithm

Besides the success of tampering with the classifier and semantic preservation, the modification rate is also an important factor in evaluating the imperceptibility of adversarial examples. Generally, methods in the literature can generate effective adversarial examples; however, it was hard to guarantee the modification rate is optimally the lowest. To address this, we introduce the Metropolis–Hasting Modification Reduction (MMR), leveraging the Metropolis–Hasting (MH) algorithm to optimize the modification rate by exploring efficient yet minimal substitution combinations for a given adversarial candidate. MMR involves two steps, each employing the MH algorithm: 1) stochastically restoring some attacked words to create a less modified candidate and 2) updating all substitutions without altering the NPW, \(m\). These steps are detailed in Sects. 3.3.1 and 3.3.2 respectively.

3.3.1 Restoring attacked words with MMR

The first step of MMR is probabilistically restoring some attacked words with MH algorithm to test the necessity of the current substitutions. Given an adversarial candidate \({\textbf{x}}_t=(m_t, {\textbf{v}}_t,{\textbf{s}}_t)\) from iteration t in RJA, we aim to generate an adversarial candidate \({\textbf{x}}_t^{r}\) which is constructed by restoring some attacked words in \({\textbf{x}}_t\). To sample the restored substitutions, we propose the probability mass function of selecting substitutions \(s^r \in \{s_i,w_i\}\) in iteration t as follows:

$$\begin{aligned} p(s^r|{\textbf{x}}_t)&= {\left\{ \begin{array}{ll} \displaystyle \frac{\exp (I(w_i))}{1+\exp (I(w_i))}&{} \text { if } s^r = s_{i}\; (\text {continue to attack}),\\ \displaystyle \frac{1}{1+\exp (I(w_i))} &{} \text { if } s^r = w_{i} \;(\text {attack cancelled}),\\ \end{array}\right. } \end{aligned}$$
(11)
$$\begin{aligned}&\quad p_{restore}({\textbf{x}}^{r}_{t}|{\textbf{x}}_t)=\prod _{s^r \in {\textbf{s}}_t}p(s^r|{\textbf{x}}_{t}) \end{aligned}$$
(12)

where \(s^r=s_i\) denotes to continue the attack and \(s^r=w_i\) denotes restoring the substitution to the original word \(w_i\), respectively. The \({\textbf{x}}^{r}_{t}\) is the proposed adversarial candidate with selected substitutions restored from \({\textbf{x}}\). With such a probability mass function, the \(s^r\) can be sampled by the same strategy of sampling as in Eq. (5). To further investigate the quality of such a candidate, we apply the target distribution, \(\pi ({\textbf{x}})\), in Eq. (9) to construct the following acceptance probability:

$$\begin{aligned} \displaystyle \alpha _{restore}({\textbf{x}}^{r}_{t}|{\textbf{x}}_t)=\min \left( \frac{\pi ({\textbf{x}}^{r}_{t}) p_{restore}({\textbf{x}}_t|{\textbf{x}}^{r}_{t})}{\pi ({\textbf{x}}_t) p_{restore}({\textbf{x}}^{r}_{t}|{\textbf{x}}_t)}, 1\right) \end{aligned}$$
(13)

to decide whether the proposed adversarial candidate \({\textbf{x}}^{r}_t\) should be accepted as the true candidate.

3.3.2 Updating the combination of substitutions with MMR

Having restored selected substitutions to obtain the adversarial candidate \({\textbf{x}}^{r}_{t}\) at the \(t\)-th iteration, we proceed to the second step: MMR updating. This step is designed to refine attack performance by altering substitution combinations without affecting the NPW, \(m_t\). We apply a methodology similar to the one in Eq. (7) for sampling substitution combinations. In essence, the MMR updating utilizes the candidate proposing function (Eq. 7) to explore alternative substitutions for each attacked word, aiming for enhanced attack efficacy. The formulation for this update, leading to the next adversarial candidate \({\textbf{x}}^{u}_{t}\), is governed by the subsequent acceptance probability:

$$\begin{aligned} \alpha _{update}({\textbf{x}}^{u}_{t}|{\textbf{x}}^{r}_{t})&=\min \left( \frac{\pi ({\textbf{x}}^{u}_{t}) p_{update}({\textbf{x}}^{r}_{t}|{\textbf{x}}^{u}_{t})}{\pi ({\textbf{x}}^{r}_{t}) p_{update}({\textbf{x}}^{u}_{t}|{\textbf{x}}^{r}_{t})}, 1\right) ,\end{aligned}$$
(14)
$$\begin{aligned} p_{update}({\textbf{x}}^{u}_{t}|{\textbf{x}}^{r}_{t})&=\prod _{s_i \in {\textbf{s}}^{r}_t}p(s_i|w_{i}, m^{r}_{t},{\textbf{x}}^{r}_{t}) , \end{aligned}$$
(15)

where \(p(s_i|w_{i},m^{r}_{t},{\textbf{x}}^{r}_{t})\) is identical to that in Eq. (7).

By iteratively running T times MH algorithms for substitution restoring and updating with acceptance probabilities in Eqs. (13) and  (14), respectively, we can construct the adversarial set \({\mathbb {X}}'=\{{\textbf{x}}^u_{t}\}^{T}_{t=1}\) and select the candidate with the highest semantic similarity among the successful candidates that fools the classifier as the final adversarial example \({\textbf{x}}^{*}\). This proposed MMR algorithm will not only be applied to our RJA algorithm but also can help other attack methodologies reduce their modifications. The whole process of MMR is illustrated in Algorithm 2 and block \(\textcircled {3}\)\(\textcircled {4}\) in Fig. 2.

Algorithm 2
figure b

Metropolis–Hasting Modification Reduction (MMR)

4 Experiments and analysis

In this section, we comprehensively evaluation the performance of our method against the current state of the art. Besides the main results (Sect. 4.4) of attacking performance and imperceptibility, we also conduct experiments on ablation studies (Sect. 4.5), efficiency analysis (Sect. 4.6), transferability (Sect. 4.7), target attacks (Sect. 4.8), performance front of defense mechanism (Sect. 4.9), adversarial retraining (Sect. 4.10), part-of-speech (POS) preference (Sect. 4.11) and scales of models for robustness(Sect. 4.12)

We evaluate the effectiveness our methods on three widely-used and publicly available benchmark datasets: AG’s News (Zhang et al., 2015), Emotion (Saravia et al., 2018), SST2 (Socher et al., 2013) and IMDB(Maas et al., 2011). Specifically, AG’s News is a news classification dataset with 127,600 samples belonging to 4 topic classes, World, Sports, Business, Sci/Tech. Emotion (Saravia et al., 2018) is a dataset with 20,000 samples and 6 classes, sadness, joy, love, anger, fear, surprise. SST2 (Socher et al., 2013) is a binary class (positive and negative) topic dataset with 9613 samples. The IMDB dataset (Maas et al., 2011), comprising movie reviews from the Internet Movie Database, is predominantly utilized for binary sentiment classification, categorizing reviews into ‘positive’ or ‘negative’ sentiments. The details of these datasets can be found in Table 2.

To ensure reproducibility, we provide the code and data used in our experiments in a GitHub repository.Footnote 1

Table 2 Datasets and accuracy of victim models before attacks

4.1 Victim models

We apply our attack algorithm to two types of popular and well-performed victim models. The details of the models can be found below.

4.1.1 BERT-based classifiers

To do convincing experiments, we choose three well-performed and popular BERT-based models, which we call BERT-C models (where the letter “C” represents “classifier”), pre-trained by Huggingface.Footnote 2 Due to the different sizes of the datasets, the structures of BERT-based classifiers are adjusted accordingly. The BERT classifier for AG’s News is structured by the Distil-RoBERTa-base (Sanh et al., 2019) connected with two fully connected layers, and it is trained for 10 epochs with a learning rate of 0.0001. For the Emotion dataset, its BERT-C adopts another version of BERT, Distil-BERT-base-uncased (Sanh et al., 2019), and the training hyper-parameters remain the same as BERT-C for AG’s News. Since the SST2 dataset is relatively small compared with the other two models, the corresponding BERT classifier utilizes a small-size version of BERT, BERT-base-uncased (Devlin et al., 2019). As for the IMDB, we employ the Distil-BERT-base-uncased for classification tasks. The test accuracy of these BERT-based classifiers before they are under attacks are listed in Table 2 and these models are publicly accessibleFootnote 3Footnote 4Footnote 5Footnote 6.

4.1.2 TextCNN-based models

The other type of victim model is TextCNN (Kim, 2014), structured with a 100-dimension embedding layer followed by a 128-units long short-term memory layer. This classifier is trained 10 epochs by ADAM optimizer with parameters: learning rate \(lr=0.005\), the two coefficients used for computing running averages of gradient and its square are set to be 0.9 and 0.999 \((\beta _1=0.9\), \(\beta _2=0.999)\), the denominator to improve numerical stability \(\sigma =10^{-5}\). The accuracy of these TextCNN-base models is also shown in Table 2.

4.2 Baselines

To evaluate the attacking performance, we use the TextAttack (Morris et al., 2020) framework to deploy the following baselines:

  • AGA (Alzantot et al., 2018): it uses the combination of restrictions on word embedding distance and language model prediction scores to reduce search space. As for the searching algorithm, it adopts a genetic algorithm, a popular population-based evolutionary algorithm.

  • Faster Alzantot Genetic Algorithm (FAGA) (Jia et al., 2019): it accelerates AGA by bounding the searching domain of genetic optimization.

  • BERT-Base Adversarial Examples (BAE) (Garg & Ramakrishnan, 2020): it replaces and inserts tokens in the original text by masking a portion of the text and leveraging the BERT-MLM.

  • Metropolis–Hasting Attack (MHA) (Zhang et al., 2019): it performs Metropolis–Hasting sampling, which is designed with the guidance of gradients, to sample the examples from a pre-selector that generates candidates by using MLM.

  • BERT-Attack (BA)(Liet al., 2020): it takes advantage of BERT-MLM to generate candidates and attacked words by the static WSR descending order.

  • Probability Weighted Word Saliency (PWWS) (Ren et al., 2019): it chooses candidate words from WordNet (Miller et al., 1990) and sorts word attack order by multiplying the word saliency and probability variation.

  • TextFooler (TF) (Jin et al., 2020): it ranks the important words with similar strategy with Eq. (4). With the important rank, the attacker prioritizes replacing them with the most semantically similar and grammatically correct words until the prediction is altered.

  • Particle Swarm Optimization (PSO) (Zang et al., 2020): it selects word candidates from HowNet and employs the POS to find adversarial text. This method treats every sample as a particle whose location in the search space needs to be optimized.

4.3 Experimental settings and evaluation metrics

For our RJA and RJA-MMR, we use the Universal Sentence Encoder (USE) (Cer et al., 2018) to measure the sentence semantic similarity for target distribution in Eq. (9). We experiment to find \(k=30\) substitution candidates and to find these candidates’ substitutions, we use RoBERTa-large (Liu et al., 2019) as the MLM with WordPiece (Wu et al., 2016) tokenizer for contextual infilling and utilize OpenHowNet (Qi et al., 2019) with NLTK (Bird et al., 2009) tokenizer as the synonym thesaurus. For the sampling-based algorithms, MHA and the proposed methods (RJA, RJA-MMA), we set the maximum number of iterations T to 1000.

We argue that the quality of adversarial examples is appraised with regard to three key facets: attacking performance, imperceptibility, and fluency. To measure these facets, we use the following five metrics to measure the performance of adversarial attacks:

  • Successful attack rate (SAR) is defined as the percentage of attacks where the adversarial examples make the victim models predict a wrong label.

  • Modification Rate (Mod) is the percentage of modified tokens. Each replacement, insertion or removal action accounts for one modified token.

  • Grammar Error (GErr) is measured by the absolute rate of increased grammatic errors in the successful adversarial examples, compared to the original text, where we use LanguageTool (Naber et al., 2003) to obtain the number of grammatical errors.

  • Perplexity (PPL) denotes a metric used to evaluate the fluency of adversarial examples (Kann et al., 2018; Zang et al., 2020). The perplexity is calculated using small-sized GPT-2 with a 50k-sized vocabulary (Radford et al., 2019).

  • Textual similarity (Sim) is measured by the cosine similarity between the sentence embeddings of the input and that of the adversarial sample. We encoded the two sentences with the universal sentence encoder (USE) (Cer et al., 2018).

SAR evaluates attack performance, while Mod and Sim measure imperceptibility. GErr and PPL assess language fluency.

Table 3 Results on SAR, Mod, and Sim metrics among the baselines and proposed methods on different datasets
Table 4 Results on PPL and GErr metrics among the baselines and proposed methods on different datasets

4.4 Experimental results and analysis

The main experimental results of the attacking performance (SAR), the imperceptibility performance (Sim, Mod) and the fluency of adversarial examples (PPL, GErr) are listed in Table 3 and 4. Moreover, we demonstrate adversarial examples crafted by various methods shown in Table 5. We manifest the three contributions mentioned in the Introduction section by answering three research questions:

4.4.1 Does our method make more thrilling attacks compared with baselines?

We compare the attacking performance of the proposed method RJA-MMR and baselines in Table 3. This table demonstrates that RJA-MMR consistently outperforms other competing methods across different data domains, regardless of the structure of classifiers. Further, even RJA, by itself, without using MMR, can craft more menacing adversarial examples than most baselines. We attribute such an outstanding attacking performance to the two prevailing aspects of RJA. Firstly, RJA optimizes the performance by stochastically searching the domain. Most of the baselines perform a deterministic searching algorithm which could get stuck in the local optima. Differently, such a stochastic mechanism helps skip the local optima and further maximize the attacking performance.

Secondly, some of the baselines strictly attack the victim words in the order of word saliency rank (WSR), where the domain of the hierarchical search is limited to combinations of the neighboring victim words from the WSR, which would miss the potential optimal victim words combination. Unlike these methods, the RJA would enlarge the searching domain by testing more combinations of substitutions that do not follow the WSR order. Thus, the proposed method RJA achieves the best-attacking performance, with the highest successful attack rate (SAR).

Table 5 Adversarial examples of the Emotion dataset for victim classifier BERT-C

4.4.2 Is RJA-MMR superior to the baselines in terms of imperceptibility?

We evaluate the imperceptibility of different attack strategies in terms of semantic similarities (USE) and modification rate (Mod) between the original input text and its derived adversarial examples, shown in Table 3. It can be seen that the proposed RJA-MMR attains the best performance among the baselines. The outstanding performance of the proposed method is attributed to the mechanisms of RJA and MMR. For semantic preservation, we statistically design the target distribution (Eq. 9) with a strong regularization of the semantic similarity in each iteration. Moreover, the HowNet is a knowledge-graph-based thesaurus that provides part-of-speech (POS) aware substitutions. Compared with the candidates supplied by baselines, the synonyms from HowNet can be more semantically similar to the original words. As for the modification rate, the proposed MMR is mainly designed for restoring the attacked words from successful adversarial examples so that the proposed RJA-MMR perturbs fewer words without sacrificing the attacking performance. Thus we can conclude that the proposed RJA-MMR provides the best performance for imperceptibility among baselines.

4.4.3 Is the quality of adversarial examples generated by the proposed methods better than that crafted by the baselines?

We insist the qualified adversarial examples should be parsing-fluent and grammarly correct. From the Table 4, we can find the RJA-MMR provides the lowest perplexity (PPL), which means the examples generated by RJA-MMR are more likely to appear in the corpus of evaluation. As our corpus is long enough and the evaluation model is broadly used, it indicates these examples are more likely to appear in natural language space, thus eventually leading to better fluency. For the grammar errors, the proposed method RJA-MMR is substantially better than the other baselines, which indicates a better quality of the adversarial examples. We attribute such performance to our method of finding word substitution, constructing the candidates set by intersecting the candidates from HowNet and MLM.

4.5 Ablation study

To rigorously validate the efficacy of the proposed RJA-MMR method, this section conducts a detailed ablation study, dissecting each component to assess its individual impact and overall contribution to the method’s performance.

4.5.1 Effectiveness of RJA

We compare the attacking performance of our Reversible Jump Attack methods (RJA, RJA-MMR) and baselines in Table 3, reflected by SAR. The RJA helps attackers achieve the best attacking performance, with the largest metric SAR across the different downstream tasks. Apart from RJA-MMR, its ablation RJA also surpasses the strong baselines in most cases. Therefore, RJA is effective in terms of attacking performance.

4.5.2 Effectiveness of MMR

MMR is a stochastic mechanism to reduce the modifications of adversarial examples with attacking performance preserved. Besides RJA-MMR, we also apply MMR to different attacking algorithms, including PSO, TF, PWWS, BA and MHA, aiming to demonstrate the advantages of MMR in general.

From Table 3, we can find RJA-MMR has superior performance to RJA with lower modification rates. Moreover, the other baseline analysis results are shown in Fig. 3. It shows that the attacking algorithms with MMR consistently have a lower modification rate than those without MMR. This means that attacking strategies can generally benefit from MMR by making fewer modifications.

Fig. 3
figure 3

Comparisons on modification rates among attacking strategies (PSO, TF, PWWS, BA, MHA) with MMR and without MMR to attack the BERT-C on AG News dataset

Table 6 Performance metrics for RJA-MMR against the TextCNN model on the AG News dataset using varied word candidate selection methods
Table 7 Assessment of attack algorithms’ efficiency on the Emotion dataset, utilizing empirical complexity (EC) in seconds per example for practical evaluation and total variance (TV) distance for theoretical convergence speed analysis

4.5.3 Performance versus the number of iterations

The performance of the proposed methods is influenced by the number of iterations, denoted as \(T\). To delve deeper into this relationship, we conducted an extensive ablation study examining the correlation between performance and \(T\). Insights drawn from Fig. 4 reveal a positive trend where performance amplifies in tandem with the number of iterations. Notably, performance begins to plateau, indicating convergence, at \(T=100\).

Fig. 4
figure 4

The progression of SAR, SIM, Mod, GErr, and PPL metrics for SST2 BERT over increased iterations (T). Performance trends and convergence points are visually represented

4.5.4 Effectiveness of the word candidates

In our ablation study, detailed in Table 6, we explored the effectiveness of various word candidate selection methods on the performance of RJA-MMR against the TextCNN model, utilizing the AG News dataset. Our evaluation included three strategies: using HowNet, MLMs with BERT-base (Devlin et al., 2019), RoBERTa-large (Liu et al., 2019), and a synergistic approach combining HowNet and MLMs. Individually, HowNet and the MLM approaches showed notable performance, with RoBERTa-large slightly outperforming BERT-base. However, the combination of HowNet and MLMs produced superior results, surpassing the individual methods in all evaluated metrics, highlighting the significant advantage of integrating HowNet with MLMs to enhance the effectiveness of adversarial attacks.

Furthermore, our analysis of combination strategies for generating word candidates revealed that the more sophisticated MLM, RoBERTa-large, yielded a more effective attack performance than its less advanced counterpart, BERT-base. This finding suggests a positive correlation between advancements in MLM technology and enhancements in attack efficacy. We attribute this trend to the ability of more advanced MLMs to generate more relevant and suitable word candidates for use in attack methodologies, thereby increasing the precision and effectiveness of adversarial strategies.

4.6 Platform and efficiency analysis

In this section, we aim to evaluate the efficiency from both empirical and theoretical perspectives. To perform the empirical complexity (EV) evaluation, we carry out all experiments on RHEL 7.9 with the following specification: Intel(R) Xeon(R) Gold 6238R 2.2GHz 28 cores (26 cores enabled) 38.5MB L3 Cache (Max Turbo Freq. 4.0GHz, Min 3.0GHz) CPU, NVIDIA Quadro RTX 5000 (3072 Cores, 384 Tensor Cores, 16GB Memory) (GPU), and 88GB RAM. Table 7 lists the time consumed for attacking BERT and TextCNN classifiers on the Emotion dataset. The metric of time efficiency is second per example, which means a lower metric indicates better efficiency. Results from Table 7 show that our RJA and RJA-MMR run longer than some static counterparts (PWWS, BAE, TF) but are more efficient than the others, such as PSO, FAGA, MHA and BA. Nonetheless, the results of our methods running longer than some baseline methods indicate the genuine time needed to look for the more optimal adversarial examples.

To theoretically gauge convergence speed, researchers employ the probabilistic concept of Mixing Time (MT), which denotes the duration for a Markov chain to approach its steady-state distribution closely (Kroese et al., 2011). Given that MT is constrained by the total variation distance (TV) between the proposed and target distributions, TV is frequently used as a metric to quantify both the mixing time and speed of convergence (Metropolis et al., 1953a; Green, 1995b). Analysis of Table 7 reveals that the proposed RJA-MMR method registers the lowest Total Variance (TV) distance, indicating superior theoretical performance in terms of convergence speed compared to other methods.

4.7 Transferability

The transferability of adversarial examples refers to its ability to degrade the performance of other models to a certain extent when the examples are generated on a specific classifier (Goodfellow et al., 2015). To evaluate the transferability, we investigate further by exchanging the adversarial examples generated on BERT-C and TextCNN and the results are shown in Fig. 5.

When the adversarial examples generated by our methods are transferred to attack BERT-C and TexCNN, we can find that the attacking performance of RJA-MMR still achieves more than 80% successful rate, which is the best among baselines as illustrated in the Fig. 5. Apart from RJA-MMR, its ablated components RJA also surpass the most baselines. This suggests that the transferring attacking performance of the proposed methods consistently outperforms the baselines.

Fig. 5
figure 5

Performance of transfer attacks to victim models (BERT-C and TextCNN) on Emotion. A lower accuracy of the victim models indicates a higher transfer ability (i.e., the lower, the better)

4.8 Targeted attacks

A targeted attack is to attack the data sample with class y in a way that the sample will be misclassified as a specified target class \(y^{\prime }\) but not other classes by the victim classifier. RJA and MMR can be easily adapted to targeted attack by modifying \(1-F_{y}({\textbf{x}})\) to \(F_{y^{\prime }}({\textbf{x}})\) in Eq. (9). The targeted attack experiments are conducted on the Emotion dataset. The results demonstrate that the proposed RJA-MMR achieves better performance than PWWS, in terms of attacking performance (SAR), imperceptibility performance (Mod, Sim) and sentence fluency (GErr, PPL) (Table 8).

Table 8 Targeted attack and imperceptibility-preserving performance on the Emotion dataset

4.9 Attacking models with defense mechanism

Defending against textual adversarial attacks is paramount in ensuring the integrity and security of machine learning models used in natural language processing applications. Effective defense mechanisms encompass two multi-faceted approaches that include: 1) robust model training, utilizing adversarial training techniques to increase models’ resilience against malicious inputs. 2) malicious input detection, aiming to identify and mitigate adversarial examples without actively altering the machine learning model’s structure or training process.

To ensure a thorough evaluation of our proposed attack methods, we’ve integrated two distinct defense mechanisms into our assessment. For passive defense, we adopted the Frequency-Guided Word Substitutions (FGWS) (Mozes et al., 2021) approach, which excels at identifying adversarial examples. Conversely, for active defense, we incorporated Random Masking Training (RanMASK) (Zeng et al., 2023), a technique that bolsters model resilience via specialized training routines. We perform the adversarial attack to the BERT-C on the two datasets IMDB and SST2, and the results are presented in Table 9. The results show that our method outperforms the baselines.

Table 9 A comparative analysis of attack performance (SAR) against BERT-C when subjected to two defense mechanisms, FGWS and RanMASK, across IMDB and SST2 datasets

4.10 Adversarial retraining

This section explores RJA-MMR’s potential in improving downstream models’ accuracy and robustness. Following (Li et al., 2021), we use RJA-MMR to generate adversarial examples from AG’s News training instances and include them as additional training data. We inject different proportions of adversarial examples into the training data for the settings of a BERT-based MLP classifier and a TextCNN classifier without any pre-trained embedding. We provide adversarial retraining analysis by answering the following two questions:

Fig. 6
figure 6

Results of adversarially trained BERT and TextCNN by inserting the different numbers of adversarial examples to the training set. The accuracy is based on the performance of the SST2 test set

4.10.1 Can adversarial retraining help achieve better test accuracy?

As shown in Fig. 6, when the training data is accessible, adversarial training gradually increases the test accuracy while the proportions of adversarial data are smaller than roughly 30%. Based on our results, we can see that a certain amount of adversarial data can help improve the models’ accuracy, but too much such data will degrade the performance. This means that the right amount of adversarial data will need to be determined empirically, which matches the conclusions made from previous research (Jia et al., 2019; Yang et al., 2021).

Fig. 7
figure 7

The success attack rate (SAR) of adversarially retrained models with different numbers of adversarial examples. A lower SAR indicates a victim classifier is more robust to adversarial attacks

Does adversarial retraining help the models defend against adversarial attacks? To evaluate this, we use RJA-MMR to attack the classifiers trained with different proportions (\(0\%,10\%,20\%,30\%,40\%\)) of adversarial examples. A higher success rate (SAR) indicates a victim classifier is more vulnerable to adversarial attacks. As shown in Fig. 7, adversarial training helps to decrease the attack success rate by more than 10% for the BERT classifier (BERT-C) and 5% for TextCNN. These results suggest that the proposed RJA-MMR can be used to improve downstream models’ robustness by joining its generated adversarial examples to the training set.

4.11 Parts of speech preference

Regarding the superiority of the proposed method in attacking performance, we investigate its attacking preference, described by parts of speech (POS), for further linguistic analysis. In this subsection, we break down the attacked words in AG’s News dataset by part-of-speech tags with Stanford PSO tagger (Toutanova et al., 2003), and the collected statistics are shown in Table 10. By analyzing the results, we expect to find the more vulnerable POS by comparing the proposed methods and baselines.

We apply PSO tagger to annotate them with POS tags, including noun, verb, adjective (Adj.), adverb (Adv.) and others (i.e., pronoun preposition, conjunction, etc.). Statistical results in Table 10 demonstrate that all the attacking methods heavily focus on the noun. Presumably, in the topic classification task, the prediction heavily depends on noun. However, the proposed attacking strategies (RJA and RJA-MMR) tend to take a more significant proportion of others than any other methods; thus we might conclude that Others (pronoun, preposition and conjunction) might be the second adversarially vulnerable. Since these tags (pronouns, prepositions and conjunction) do not carry much semantics, we think these tags will not linguistically and semantically affect prediction but possibly impact the sequential dependencies, which could contaminate the contextual understanding of the classifiers and then subsequently cause wrong predictions.

Table 10 POS preference with respect to choices of victim words among attacking methods
Table 11 Robustness of BERT Models of Different Sizes on the Emotion Dataset

4.12 Robustness versus the scale of pre-trained models

Examining Tables 3 and 4, a question arises: Does increasing the scale of a model enhance its robustness? To explore this, we conducted a study applying our proposed attack methods to victim models of varying sizes on the Emotion dataset.

To provide a more nuanced analysis, we recognize that limiting our comparison to the two initial versions of BERT-base and large as introduced by (Devlin et al., 2019)-does not sufficiently support robust experimental outcomes. Hence, we have incorporated several widely recognized versions published subsequent to the original BERT paper. Specifically, we analyzed four versions of BERT as documented in Turc et al. (2019): BERT Tiny,Footnote 7 BERT Mini,Footnote 8 BERT Small,Footnote 9 and BERT Medium.Footnote 10 Notably, the most downloaded version among these has reached up to 6,559,486 monthly downloads on Huggingface alone. Our findings, detailed in Table 11, demonstrate a positive correlation between model size and experimental robustness, confirming the value of incorporating a diverse range of model sizes into our analysis.

5 Conclusion and future work

In recent years, the safety and fairness of NLP models have greatly been threatened by adversarial attacks. Many researchers have raised concerns about the robustness of the NLP classifiers because of their broad downstream tasks, such as fake news detection, sentiment analysis, and email spam detection. To improve classifiers’ robustness, we have presented RJA-MMR which consists of two algorithms, Reversible Jump Attack (RJA) and Metropolish-Hasting Modification Reduction (MMR). RJA poses threatening attacks to NLP classifiers by applying the Reversible Jump algorithm to adaptively sample the number of perturbed words, victim words and their substitutions for individual textual input. While MMR is a customized algorithm to help improve the imperceptibility, especially to lower the modification rate, by utilizing the Metropolis–Hasting algorithm to restore the attacked words without affecting attacking performance. Experiments demonstrate that RJA-MMR delivers the best attack success, imperceptibility and sentence fluency among strong baselines.

Although the adversarial examples can threaten the NLP models, these examples are not bugs but features (Ilyas et al., 2019). To protect the models from the attacks, we conduct extensive experiments with a defense strategy, adversarial retraining, which is done by joining the adversarial examples in the training set and then retraining the models with the newly constructed training set. Unsurprisingly, in our experiments, the robustness of the classifiers has been greatly improved, while the accuracy of these models on clean data drops when an excessive amount of adversarial examples are injected.

Since the adversarial attack is one of the most effective methods to test the robustness of a model, the proposed attacks raise some concerns about deep neural networks (DNNs) and large pre-trained models. As DNNs and pre-trained language models achieved great success, most existing well-performed NLP classifiers are based on these techniques. Such popularity of these techniques could put textual classifiers at high risk because attackers can make effective attacks by utilizing DNNs and large pre-trained models. Thus a safer way of applying these techniques is a promising future research direction. At the same time, we also plan to pertinently study and design defense strategies to further improve the robustness of NLP classifiers under future adversarial attacks.