Defense against adversarial attacks via textual embeddings based on semantic associative field

Huang, Jiacheng; Chen, Long

doi:10.1007/s00521-023-08946-7

Defense against adversarial attacks via textual embeddings based on semantic associative field

Original Article
Open access
Published: 02 November 2023

Volume 36, pages 289–301, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Defense against adversarial attacks via textual embeddings based on semantic associative field

Download PDF

Jiacheng Huang¹ &
Long Chen^1,2

873 Accesses
Explore all metrics

Abstract

Deep neural networks are known to be vulnerable to various types of adversarial attacks, especially word-level attacks, in the field of natural language processing. In recent years, various defense methods are proposed against word-level attacks; however, most of those defense methods only focus on synonyms substitution-based attacks, while word-level attacks are not based on synonym substitution. In this paper, we propose a textual adversarial defense method against word-level adversarial attacks via textual embedding based on the semantic associative field. More specifically, we analyze the reasons why humans can read and understand textual adversarial examples and observe two crucial points: (1) There must be a relation between the original word and the perturbed word or token. (2) Such a kind of relation enables humans to infer original words, while humans have the ability to associations. Motivated by this, we introduce the concept of semantic associative field and propose a new defense method by building a robust word embedding, that is, we calculate the word vector by exerting the related word vector to it with potential function and weighted embedding sampling for simulating the semantic influence between words in same semantic field. We conduct comprehensive experiments and demonstrate that the models using the proposed method can achieve higher accuracy than the baseline defense methods under various adversarial attacks or original testing sets. Moreover, the proposed method is more universal, while it is irrelevant to model structure and will not affect the efficiency of training.

BeamAttack: Generating High-quality Textual Adversarial Examples Through Beam Search and Mixed Semantic Spaces

A Textual Adversarial Attack Scheme for Domain-Specific Models

Learning to Generate Textual Adversarial Examples

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep neural networks (DNNs) have achieved a great breakthrough in various tasks, e.g., pattern recognition [1, 2], sentiment analysis [3] and autopilot [4], due to the rapid improvement of computational power in the era of big data. However, despite DNNs’ effectiveness and powerful ability to solve complex problems, the security issues of DNNs have become increasingly prominent. In past years, many studies [5,6,7] have shown that the adversarial examples, crafted by adding tiny perturbations to original examples maliciously, cause DNNs’ decision results to fail blatantly, even though the adversarial examples and the original examples are no different from human cognition. Such a security issue, which is called adversarial attacks, leads to severe security threats for applications based on DNNs and results in a crisis of confidence among people who use them.

For text classification, textual adversarial attacks can be divided into three categories, i.e., character-level, word-level and sentence-level attacks. The character-level attacks [8, 9] could be easily detected and defended by a spell checker [10] while the syntax has been destroyed, and the sentence-level attacks are hard to preserve the original semantics while rephrasing a sentence to another one. In contrast, the word-level attacks have become the most effective approach for crafting textual adversarial examples because those adversarial examples are able to fool the victim models with a high success rate while maintaining grammatical correctness and semantic consistency, which are more challenging to defend against. For example, Ren et al. [11] proposed a greedy algorithm called probability weighted word saliency for generating synonyms substitution-based adversarial examples with a very low word substitution rate. Zhang et al. [12] proposed the Metropolis–Hastings attack algorithm, which guarantees fluency of perturbed text by designing a stationary distribution based on a pre-trained language model. Zang et al. [13] proposed a word substitution strategy based on sememes which regards the words with the same sememe as candidate words that can be replaced with each other and searches the optimal adversarial examples based on particle swarm optimization. These text adversarial attacks pose a security threat to DNNs and also offer potential for promoting related text processing tasks, such as steganography [14] and information hiding [15] based on text. For example, adversarial examples techniques can be used to counteract attempts at steganalysis, which is the process of detecting and identifying hidden information within a carrier.

To solve the security issues of textual adversarial attacks, researchers have explored various approaches for improving DNNs’s robustness. For example, Li et al. [16] and Ren et al. [11] conducted adversarial training by crafting adversarial examples with their proposed attacks and adding the adversarial examples to the training dataset in an attempt to improve the DNNs’s robustness. However, it will cost much time for generating sufficient textual adversarial examples and retraining models with new datasets incorporating adversarial examples. Another line of work called certified defense attempts to provide the DNNs with provable robustness, e.g., Jia et al. [17] trained models that are provably robust to synonym substitution attacks using Interval Bound Propagation to minimize an upper bound on the worst-case loss. However, existing studies of certified defense only focus on synonyms substitution-based attacks; besides, it is difficult to generalize the certified defense to large-scale datasets and models with complex structure due to the heavy computing cost and strict constraints.

The motivation of this paper stems from the consideration of the process by which humans understand textual adversarial examples and how the process could guide us to improve the robustness of DNNs. We believe that there must be a relation between the original word and the substituted word (or token) in word-level adversarial attacks, and such a relation enables humans to infer original words, while humans have the ability to association. For example, there are similar semantic meanings between a word and its synonym or near-synonym.

Another question is, in reference to the discussion above, how to improve the DNNs’ robustness according to the relation between words. To this end, we introduce the concept of the semantic associative field for guiding us to build a word embedding which is robust to word-level adversarial examples. Specifically, we calculate the word vector by combining related word vectors using a potential function and weighted embedding sampling to simulate the semantic influence between words within the same semantic field.

The proposed method is simple, efficient and highly scalable. To demonstrate the effectiveness, we conduct exhaustive experiments with respect to different datasets and adversarial attack approaches. It shows that our textual embedding based on the semantic associative field is robust to word-level and character-level adversarial attacks.

We summarize our major contributions as follows:

We observe that, under the setting of word-level adversarial attack, there must be a relation between the original word and substitution. Such a kind of relation enables humans to infer the original word, while humans have the ability to associations.
We introduce the concept of semantic associative field motivated by the analysis above and propose a new defense method by building a robust word embedding. We calculate the word vector by exerting the related word vector to it with potential function and weighted embedding sampling for simulating the semantic influence between words in a same semantic field.
Experiments demonstrate that models using the proposed defense method based on semantic associative field theory can achieve higher accuracy than baselines under various adversarial attacks or original testing sets, and our method is more universal, while it is irrelevant to model structure and will not affect the efficiency of training.

The rest of our paper is organized as follows. In Sect. 2, we review the literature most related to this paper. Then in Sect. 3, we explain the concept of the semantic associative field in detail. For defending against textual adversarial examples via semantic associative field-based word embedding, we describe our methodology in Sect. 4. To demonstrate the effectiveness, we show our experiments and results in Sect. 5. Finally, we draw the conclusion and discussion in Sect. 6.

2 Related work

2.1 Textual adversarial attack

Although DNNs have achieved great success in the natural language process (NLP), malicious perturbed examples are able to cause DNNs’ predictions to fail blatantly. Various textual adversarial attack approaches have been proposed for exploring the weakness of NLP models. According to the level of perturbations, textual adversarial attacks can be divided into three categories, i.e., character-level, word-level and sentence-level attacks. Character-level attacks usually craft typos [8, 9, 18] or visual similar characters [19] deliberately, which are incomprehensible to NLP models totally, while humans recognize them. Word-level attacks are mainly replacing words in an original sentence with another with some strategies, e.g., synonyms-based [11], word embedding-based [20], sememe-based [13], language model-based [12], etc., while not changing the real label. Sentence-level attacks mainly include inserting distracting sentence [21] to original input or paraphrasing the original input [22, 23] for changing the predictions, but such huge perturbations are difficult to keep the original semantics and real labels.

2.2 Textual adversarial defense

The wide study of adversarial attacks causes researchers aware of the potential threats when using DNNs, and various methods for enhancing the robustness of DNNs against adversarial attacks have been proposed. Generally, those methods can be categorized into three types, i.e., adversarial training, input transformation and certified defense methods.

Adversarial training-based defense methods are enhancing the DNNs’ robustness to adversarial examples by adding them to the training dataset. For example, Li et al. [11, 16, 24] conducted adversarial training with adversarial examples generated by proposed adversarial attack methods, Dinan et al. [25] proposed an iterative build it–break it–fix it strategy with humans and models in the loop based on crowd-sourcing.

Input transformation-based methods eliminate the adversarial perturbations in the input space. Wang et al. [26] proposed a synonym encoding method to map synonyms to the same code by inserting a coder before the input layer.

Certified defense methods provide the models with provable robustness. Jia et al. [17] trained models that are provably robust to synonym substitution attacks using Interval Bound Propagation to minimize an upper bound on the worst-case loss. Ye et al. [27] proposed a structure-free method for certified robustness to synonym substitution attacks based on the idea of randomized smoothing which smooths the model with random word substitutions built on the synonymous network. Zhou et al. [28] proposed a certified defense method to defend synonyms substitution attacks by sampling the points from a convex hull formed by a word and its synonyms using the Dirichlet distribution to ensure the robustness within such a region.

2.3 Word embedding

In the tasks of NLP, texts ought to be converted to numerical value data which can be processed by DNNs, and the technology that represents text data as a numerical vector form is called word embedding. The one-hot encoding is early technology that converts each word into a vector where one dimension is set to 1 to indicate a word, and other dimensions are set to 0. Although the one-hot encoding is simple and applicable to any text data, the word vectors are irrelevant to each other and a dimensional disaster is likely to happen. Since the distributed representation, i.e., map each word into a dense vector of lower dimensions, was proposed by Hinton [29], various word embedding technologies are created and can be divided into two categories: (1) static word embedding and (2) dynamic word embedding.

Static word embedding means that a word can only be represented by a unique word vector no matter how the context changes. For example, word2vec [30] is a two-layer neural net that creates vectors that are distributed numerical representations of word features such as the context of individual words using skip-gram with negative sampling and a continuous bag of words. GloVe [31] obtains vector representations for words by training on aggregated global word–word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Dynamic word embedding refers to that word vectors can be adjusted dynamically according to the context. For example, ElMO [32] (Embeddings from Language Models) use a two-layer bidirectional recurrent neural language model to obtain the vector representation of the word by predicting and realizing the dynamic change by weighting the word vector. BERT [33] (Bidirectional Encoder Representations from Transformers) is an autoencoder language model based on the transformer in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. However, dynamic word vector technology can solve the problem of the multi-meaning of words to a certain extent but often leads to a large cost of computing resources.

2.4 Princeton WordNet

Princeton WordNet [34] is a large English lexical database providing a semantic network of general domain concepts linked by a few relations and it groups English words into sets of synonyms and defines relationships between words and their meanings, which are used to organize words into a hierarchy, known as a taxonomy, and to indicate semantic relations between words. WordNet has been widely used in various NLP tasks, such as entity recognition, information retrieval and word sense disambiguation, e.g., AlMousa et al. [35] proposed a sequential contextual similarity matrix multiplication algorithm based on WordNet knowledge for word sense disambiguation, Butt et al. [36] proposed an automatic food item detection from unstructured text using WordNet-based semantic sense modeling, Aminu et al. [37] designed a rule-based web ontology language information retrieval system with an enhanced WordNet for query expansion. In our research, WordNet is used to build a semantic associative field by providing structured lexical semantic networks containing various relations between concepts such as synonymy, hyponym and hypernym.

3 Semantic associative field

3.1 Motivation

We studied literature about textual adversarial attacks and summarized that various word-level attacks can be seen as mapping with some principles, that is, a word-level attack is to map words into others with similar lexical meanings or sememes. The question is why DNNs are vulnerable to those adversarial examples, while humans are not. We believe that there is a relation between original and perturbed tokens and such a relation enables humans to make a semantic association while reading perturbed text, especially with context. Thus, perturbed text can be easily comprehended by humans but DNNs cannot, while DNNs do not have the semantic association ability of words. Specifically, the word embeddings trained with corpus and algorithms, e.g., GloVe and Word2vec, are inconsistent with human cognition.

Take a simple example, we denote words “wonderful”, “topping” and “boring” as $w_1$, $w_2$, and $w_3$ separately, and we calculate the distances among each other. Then we got that $distance(w_1,w_2)$ is 0.856 and $distance(w_1,w_3)$ is 0.503. It is shown that the distance in word embeddings space between $w_1$ and $w_3$ is smaller than $w_1$ and $w_2$; however, the semantics of $w_1$ and $w_2$ ought to be closer. To solve the problem, we aim to improve word embeddings by introducing the concept of the semantic associative field.

3.2 Semantic associative field

The concept of field originated from physics, which is used to describe the non-contact interaction between material particles, e.g., gravity field, magnetic field, etc. In the field of linguistics, researchers have proposed the semantic associative field, introducing the idea of field theory into the semantic analysis, to represent the semantic connections and differences of words [38]. The semantic field theory holds that the semantics of a concept (i.e., word) is affected by others in the same cluster, namely the semantic associative field, which is formed by the words related to each other semantically as shown in Fig. 1. There are various relations between concepts in lexical semantics, such as hypernym, hyponym and synonym, and these complex and diverse relations form a variety of word aggregates that are connected with each other to form the network of lexicons in the human mind.

Similar to the field of physics, the semantic field is a field with sources, that is, words in a semantic field are seen as sources exerting influence on each other within the field. Furthermore, the influence between field sources is described by a potential function which is detailed in the next section. In this work, we try to improve the existing word embedding representation via semantic field theory and make the distance between words related semantically closer in embedding space for defending textual adversarial attacks.

4 Methodology

The overall framework of our defense method is shown in Fig. 2. Generally, we first build a semantic associative field from WordNet and then enhance word embedding based on the semantic associative field. Next, we detail our defense method which incorporates two-part, namely the semantic associative field building and the semantic associative field-based word embeddings enhancing method.

4.1 Modeling semantic associative field

As stated in Sect. 3.2, there are various relations between concepts, which are included in the same semantic associative field, such as hypernym, hyponym, synonym. In the meantime, WordNet, a large lexical database of English, provides structured lexical semantic networks containing various relations between concepts mentioned above. Therefore, we can easily query the concepts related to a concept semantically and get the structural relationships between those concepts in a semantic associative field by introducing the Python third-party library nltk which integrates WordNet. In practice, we set the maximum number of relationship layers to 2 to avoid infinite capacity of the semantic associative field

After we build the semantic associative field, we describe how to model the semantic field mathematically for computability. Formally, for a word w, we assume there are n words related to w semantically, which is denoted as $S=(w_1\ldots w_i\ldots w_n)$ and forms a semantic network with w together. The position coordinate of w in the semantic network is defined as follows:

$$\begin{aligned} p=(s_1,\ldots ,s_i,\ldots ,s_n), \end{aligned}$$

(1)

where $s_i$ is the similarity between w and $w_i$ in the semantic network, and the similarity is calculated by the following formula:

$$\begin{aligned} s_i=\frac{\delta }{L(w,w_i)+\delta }, \end{aligned}$$

(2)

where $L(w,w_i)$ is the shortest path of the w and $w_i$ in the semantic network. $\delta $ is a tuning parameter for controlling the path length between w and $w_i$ in the semantic network when the similarity is calculated to 0.5. We can calculate the position of each word in the semantic field according to the above formulas. It is obvious that the farther the positional coordinates between words are, the greater the semantic distance between the words in the semantic field, and vice versa.

The field formed by the interaction between all words in the semantic network is called a semantic associative field, as a field with sources, in which every word influences others. In order to better describe the interaction between words in the semantic associative field, we introduce a potential function to represent the strength of such an interaction.

$$\begin{aligned} \phi (w,w_i)=me^{-\left( \frac{d(w,w_i)}{\sigma }\right) ^2}, \end{aligned}$$

(3)

where $\phi (w,w_i)$ is the potential function representing the strength of interaction from $w_i$ to w, m is the mass of $w_i$ which represent the strength of the field source, $d(w,w_i)$ is the Euclidean distance between w and $w_i$ in semantic field, $\sigma $ is a tune parameter ranging from $(0,\infty )$ for controlling the interaction range of field source. In this paper, m of all words are set to 1 for simplification, i.e., each word in a semantic associative field has the same strength as a field source. It is obvious that the greater the distance between words, the smaller the potential energy between the words in the semantic associative field, and vice versa. In other words, the strength of the interaction between words will decrease rapidly, while the semantic distance increases until it approaches zero.

4.2 Improving word embedding with weighted sampling

After we model the semantic associative field, we describe how to improve the word embeddings based on the semantic associative field. According to the principle of superposition of field, the field strength of any point in the semantic field is equal to the vector sum of the field strength generated by all field sources independently. Therefore, the key idea of our method is to calculate the word vector by exerting the related word vector to it with a potential function and weighted embedding sampling for simulating the semantic influence between words in the same semantic field. Formally, we assume access to well-trained word embeddings denoted as E and sample the embeddings about $w_i$ from E for weighted averaging. The weight is obtained by the potential function mentioned in Sect. 4.1, and we add the result of weighted averaging to the embedding of w for reflecting the semantic influence from $w_i$ to w. Thus, the embedding of w is updated by the following formula:

$$\begin{aligned} E'(w)=E(w)+\frac{\sum _{i=1}^{n}{\phi (w,w_i) E(w_i)}}{\sum _{i=1}^{n}{\phi (w,w_i)}}, \end{aligned}$$

(4)

where E(.) is the vector representation of a word in original word embeddings, $\phi (w,w_i)$ is the weight of $E(w_i)$ calculated by Eq. (3), and n is the amount of w’s semantically related words. An example of the calculation process above is illustrated in Fig. 3, for the situation that the word w has two related words (i.e., $w_1$ and $w_2$) in the same semantic field, while $\phi (w,w_1)$ and $\phi (w,w_2)$ are assumed to be calculated as 0.4 and 0.8, respectively.

5 Experiments

In this section, we conduct comprehensive experiments to evaluate our defense approach on different DNNs and datasets for revealing the advantages of our method.

5.1 Experiments setup

5.1.1 Datasets

We evaluate our defense approach and conduct comparison experiments with others on three benchmark datasets, i.e., IMDB [39], AG’s News [40] and SNLI [41]. IMDB is a binary sentiment analysis (SA) dataset labeled as positive or negative; AG’s News is a collection of more than one million news articles, which can be categorized into four classes: World, Sports, Business and Sci/Tech, and SNLI is a natural language inference (NLI) dataset in which each instance comprises a premise-hypothesis sentence pair and is labeled one of three relations including entailment, contradiction and neutral. Details of the datasets above are shown in Table 1.

Table 1 Statistics of all datasets

Full size table

5.1.2 Details of network architectures

We replicate five popular universal sentence encoding models, i.e., TextCNN [42], bidirectional LSTM (BiLSTM) [43] and Attention-based bidirectional LSTM (BiLSTM + ATT) [44]. TextCNN has three convolutional filters of the different kernels (3,4,5), and their outputs are concatenated, pooled and fed to a fully connected layer followed by an output layer. Bidirectional LSTM is composed of a 128-dimensional and 64-dimensional bidirectional LSTM layer followed by a dropout layer using a drop rate of 0.5, and the output is pooled and fed to an output layer. Attention-based bidirectional LSTM has an attention layer followed by a BiLSTM mentioned above. Besides, we apply the 100-dimensional pre-trained word vectors model GloVe to the embedded layer.

5.1.3 Training details

In our experiments, all models are trained using Adam optimizer [45] with default settings in Keras, that is, the learning rate is set to $1\times 10^{-3}$, the Epsilon fuzz factor is set to $1\times 10^{-7}$, and the AMSGrad variant [46] of Adam is not applied. We take 20% of the examples for training as the validation set and use the EarlyStop to avoid overfitting, that is, the training progress will finish in advance when the loss value has stopped improving. Besides, the maximum epoch for training DNNs is set to 5 and the batch size of training examples fed to DNNs for each epoch is set to 128.

5.1.4 Attack methods

For evaluating the defense efficacy of our defense method comprehensively, we replicate three word-level advanced textual adversarial attacks, that is, PWWS [47], PSO [48] and HLA [49]. The hyperparameters of all attack methods above are consistent with the experimental setup in the original papers. Considering the inefficiency of generating textual adversarial attacks, we attack each model with 1000 sampled examples from test data for generating adversarial examples.

5.1.5 Defense baselines

We choose three state-of-the-art adversarial defense methods as the baselines and compared them with ours. The first baseline [27], denoted as SAFER, is a kind of certified defense against word-level substitution-based adversarial attacks based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentence. The second baseline [26], denoted as SEM, inserts an encoder before the input layer of the target model to map each cluster of synonyms to a unique encoding. The third baseline [50], denoted as ASCC, generates worst-case perturbations for adversarial training by an adversarial sparse convex combination method.

5.2 Evaluation on defense efficacy

We first evaluate the efficacy of our defense methodology, i.e., the accuracy of models under adversarial attacks while using the defense method. For a comprehensive evaluation, the defense efficacy of our methodology is studied on different datasets and models and compared with the three baselines, i.e., SAFER, SEM and ASCC as described above. The testing sets for the evaluation under the same dataset across different models, attack methods and defense methods are the same for a fair comparison.

Table 2 shows the performance of different baseline defense methods and ours on clean or perturbed examples. The more robust the defense method is, the less the accuracy decreases, while the model deals with textual adversarial examples. In the meantime, the performance of a robust model on clean examples is expected to be as unaffected as possible compared with the original model. The best defense method for each model could be singled out by checking the maximum accuracy of each column under different settings, and it shows that the model using our defense method achieves dominant accuracy under various adversarial attacks or original testing sets across all datasets. For example, the accuracy of TextCNN using SAFER is only 81.1% under PWWS attack on IMDB, yet our defense method makes the accuracy of TextCNN reach 93.9% under the same setting. Although the accuracy of TextCNN using our defense method slightly decreases to 94.5% on the clean examples, it is still much higher than the baselines, while the accuracies are 81.1%, 72.3% and 74.4% for SAFER, SEM and ASCC, respectively. It shows that our method can achieve a good trade-off between the model’s robustness and accuracy, benefiting from the appropriate strength of semantic interaction between related words in the same semantic associative field, which effectively avoids semantic shifts or insufficient interactions, while the hyperparameters are chosen appropriately. We also discuss the selection of hyperparameters in Sect. 5.5.

Table 2 The classification accuracy (%) against different textual adversarial attacks on three datasets for TextCNN, BiLSTM and BiLSTM with attention

Full size table

5.3 Defense against transferability

The transferability is a property of adversarial examples, i.e., the adversarial examples generated for a special model also mislead other models even if the structures and parameters are different between them [5]. Due to the transferability of adversarial examples, the attackers are able to conduct an adversarial attack on a model with unknown internal structures and parameters [51]. Thus, the transfer-based attack is a more realistic threat and an effective defense method ought to prevent the model from the transferability of adversarial examples.

Without loss of generality, we craft adversarial examples via PWWS on each model across all datasets under different defense methods for evaluating their performance in defending transfer-based adversarial attacks. Table 3 demonstrates the performance on data IMDB. The best defense method for each model could be singled out by checking the maximum accuracy of each row under different settings and the experimental results show that our method can better defend against transfer-based attacks in most cases compared with baselines. For example, the accuracy of TextCNN is 95.6% when dealing with adversarial example generated by BiLSTM under PWWS attack on IMDB dataset, while the accuracies of the baselines are 82.4%, 75.6% and 76.2% under the same setting.

Table 3 The classification accuracy (%) of target models against adversarial examples generated via PWWS from various source models on IMDB for evaluating the transferability

Full size table

5.4 Evaluation on training efficiency

In addition to improving the accuracy of models on adversarial examples, high efficiency in training is also vital to a defense method, especially when a defense method is applied to large-scale datasets. As shown in Table 4, we evaluate the training time per epoch for models with various defense methods on dataset IMDB. For avoiding the impact of runtime environment fluctuation, we conduct ten repeated experiments and calculate the mean and standard deviation of the training time in seconds. It demonstrates that the SEM and our defense method cost the least time for each training epoch since they only need to do a transform to input text based on some strategies. The adversarial training-based defense method ASCC costs a little more time for the training epoch, while a procedure of white-box adversarial attack exists at each epoch, and the certified defense method SAFER costs much more time for training epoch, while it will sample a mini-batch of data points (sentences) and randomly perturb the sentences using a perturbation distribution at each epoch.

Table 4 The training time per epoch (in seconds) for the models with various defense methods on IMDB

Full size table

5.5 Hyper-parameters study

Furthermore, we study how the hyper-parameter $\delta $ and $\sigma $ in similarity and potential function of our method influence its performance. We try different $\delta $ ranging from 0.5 to 5 and $\sigma $ ranging from 1 to 10 with or without adversarial attacks. Specifically, $\sigma $ and $\delta $ are fixed to 1 and 0.5, respectively, when we study the influence of $\delta $ and $\sigma $ changing. The results on dataset IMDB using TextCNN, BiLSTM and BiLSTM with attention are illustrated in Figs. 4 and 5.

First, as shown in Fig. 4a–c, we study how $\delta $, the tuning parameter in Eq. 2, influences the accuracy of our method empirically on three models where $\sigma $ is fixed to 1. As $\delta $ increases, the accuracy of models improves evidently, peaks when $\delta $ is about 2–3.5, and then starts to drop because too large $\delta $ leads to semantic drifts. Specifically, the similarity in Eq. 2 is gradually close to 1, while the $\delta $ increase; thus, the position coordinate of words in the semantic field is getting closer; in other words, the distance between words is getting shorter. Besides, the value of the potential function in Eq. 3 is correlated negatively with the distance between words, thus, too large $\delta $ causes too large the strength of interaction between words in the semantic field, i.e., semantic drifts. Conversely, too small $\delta $ causes insufficient interactions, i.e., insufficient stimulation of the semantic influence between words in the same semantic field.

Similarly, as shown in Fig. 5a–c, we study the influence of $\sigma $, the tune parameter in Eq. 3, on three models where $\delta $ is fixed to 0.5. As $\sigma $ increases, the value of the potential function in Eq. 3 rises, i.e., the strength of interaction between words in the semantic field grow, and thus, the accuracy of models dealing with adversarial examples increases. Similar to the discussion about parameter $\delta $, too large the strength of interaction between words in a semantic field leads to semantic drifts; therefore, after peaking when $\sigma $ is about 4–5, the accuracy of models starts to decrease if we continue to increase $\sigma $.

In summary, either too large or too small $\delta $ and $\sigma $ cause poor performance of the model on clean and adversarial examples. Therefore, we choose $\delta = 2$ and $\sigma = 4$ to have a good trade-off.

6 Conclusion and discussion

In this paper, we first analyze the reasons why humans can read and understand textual adversarial examples and observe two crucial points: (1) There must be a relation between the original word and the perturbed word (or token). (2) Such a kind of relation enables humans to infer the original word, while humans have the ability to associations. Based on these two observations, we introduce the concept of semantic associative field and propose a new defense method by building a robust word embedding, that is, we calculate the word vector by exerting the related word vector to it with potential function and weighted embedding sampling for simulating the semantic influence between words in the same semantic field. Experiments demonstrate that the models using the proposed method can achieve higher accuracy than the baseline defense methods under various adversarial attacks or original testing sets. Moreover, the proposed method is more universal, while it is irrelevant to model structure and will not affect the efficiency of training. However, some limitations need to be addressed in future. For example, we need to consider how to apply our methodology to the defense of adversarial perturbations in vision.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Article Google Scholar
Xue W, Li T (2018) Aspect based sentiment analysis with gated convolutional networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL). Association for Computational Linguistics, Melbourne, Australia, pp 2514–2523
Bojarski M, Testa DD, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, Zhang X, Zhao J, Zieba K (2016) End to end learning for self-driving cars. arXiv:1604.07316
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2014) Intriguing properties of neural networks. In: Proceedings of the 2nd international conference on learning representations (ICLR), pp 1–10
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: Proceedings of the 3rd international conference on learning representations (ICLR), pp 1–11
Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks. In: Proceedings of the 2017 IEEE symposium on security and privacy (SP), pp 39–57
Ebrahimi J, Rao A, Lowd D, Dou D (2018) HotFlip: White-box adversarial examples for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Melbourne, Australia, pp 31–36. https://doi.org/10.18653/v1/P18-2006
Gil Y, Chai Y, Gorodissky O, Berant J (2019) White-to-black: efficient distillation of black-box adversarial attacks. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1373–1379. https://doi.org/10.18653/v1/N19-1139
Pruthi D, Dhingra B, Lipton ZC (2019) Combating adversarial misspellings with robust word recognition. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 5582–5591. https://doi.org/10.18653/v1/P19-1561
Ren S, Deng Y, He K, Che W (2019) Generating natural language adversarial examples through probability weighted word saliency. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 1085–1097. https://doi.org/10.18653/v1/P19-1103
Zhang H, Zhou H, Miao N, Li L (2019) Generating fluent adversarial examples for natural languages. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 5564–5569. https://doi.org/10.18653/v1/P19-1559
Zang Y, Qi F, Yang C, Liu Z, Zhang M, Liu Q, Sun M (2020) Word-level textual adversarial attacking as combinatorial optimization. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL). Association for Computational Linguistics, pp 6066–6080
Cao Y, Zhou Z, Chakraborty C, Wang M, Wu QMJ, Sun X, Yu K (2022) Generative steganography based on long readable text generation. IEEE Trans Comput Soc Syst 1–11. https://doi.org/10.1109/TCSS.2022.3174013
Zhou Z, Mu Y, Yang C-N, Zhao N (2016) Coverless multi-keywords information hiding method based on text. Int J Secur Appl 10:309–320. https://doi.org/10.14257/ijsia.2016.10.9.30
Article Google Scholar
Li J, Ji S, Du T, Li B, Wang T (2019) Textbugger: generating adversarial text against real-world applications. In: Proceedings of the 26th annual network and distributed system security symposium (NDSS). The Internet Society, San Diego, California, pp 1–15
Jia R, Raghunathan A, Göksel K, Liang P (2019) Certified robustness to adversarial word substitutions. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 4129–4142. https://doi.org/10.18653/v1/D19-1423
He X, Lyu L, Sun L, Xu Q (2021) Model extraction and adversarial transferability, your BERT is vulnerable! In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 2006–2012. https://doi.org/10.18653/v1/2021.naacl-main.161
Eger S, Şahin GG, Rücklé A, Lee J-U, Schulz C, Mesgar M, Swarnkar K, Simpson E, Gurevych I (2019) Text processing like humans do: visually attacking and shielding NLP systems. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1634–1647. https://doi.org/10.18653/v1/N19-1165
Sato M, Suzuki J, Shindo H, Matsumoto Y (2018) Interpretable adversarial perturbation in input embedding space for text. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence (IJCAI). Stockholm, Sweden, pp 4323–4330. https://doi.org/10.24963/ijcai.2018/601
Jia R, Liang P (2017) Adversarial examples for evaluating reading comprehension systems. In: Proceedings of the 2017 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Copenhagen, Denmark, pp 2021–2031
Iyyer M, Wieting J, Gimpel K, Zettlemoyer L (2018) Adversarial example generation with syntactically controlled paraphrase networks. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 1875–1885. https://doi.org/10.18653/v1/N18-1170
Ribeiro MT, Singh S, Guestrin C (2018) Semantically equivalent adversarial rules for debugging NLP models. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Melbourne, Australia, pp 856–865. https://doi.org/10.18653/v1/P18-1079
Alzantot M, Sharma Y, Elgohary A, Ho B-J, Srivastava M, Chang K.-W (2018) Generating natural language adversarial examples. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, Belgium, pp 2890–2896. https://doi.org/10.18653/v1/D18-1316
Dinan E, Humeau S, Chintagunta B, Weston J (2019) Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4537–4546. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1461
Wang X, Hao J, Yang Y, He K (2021) Natural language adversarial defense through synonym encoding. In: de Campos CP, Maathuis MH, Quaeghebeur E (eds) Proceedings of the thirty-seventh conference on uncertainty in artificial intelligence UAI. Proceedings of machine learning research, vol 161. AUAI Press, Virtual Event, pp 823–833. https://proceedings.mlr.press/v161/wang21a.html
Ye M, Gong C, Liu Q (2020) SAFER: A structure-free approach for certified robustness to adversarial word substitutions. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 3465–3475. https://doi.org/10.18653/v1/2020.acl-main.317
Zhou Y, Zheng X, Hsieh C-J, Chang K-W, Huang X (2021) Defense against synonym substitution-based adversarial attacks via Dirichlet neighborhood ensemble. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, pp 5482–5492. https://doi.org/10.18653/v1/2021.acl-long.426
Paccanaro A, Hinton GE (2001) Learning distributed representations of concepts using linear relational embedding. IEEE Trans Knowl Data Eng 13(2):232–244. https://doi.org/10.1109/69.917563
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the 1st international conference on learning representations (ICLR), pp 1–12
Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 2227–2237 (2018). https://doi.org/10.18653/v1/N18-1202
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423
Miller, G.A.: Wordnet: A lexical database for English. In: Proceedings of the workshop on human language technology. HLT ’93. Association for Computational Linguistics, USA, p 409 (1993). https://doi.org/10.3115/1075671.1075788
AlMousa M, Benlamri R, Khoury R (2022) A novel word sense disambiguation approach using wordnet knowledge graph. Comput Speech Lang 74:101337. https://doi.org/10.1016/j.csl.2021.101337
Article Google Scholar
Butt S, Bakhtyar M, Noor W, Baber J, Ullah I, Ahmed A, Basit A, Kakar MSH (2022) Semantic similarity based food entities recognition using wordnet. J Intell Fuzzy Syst 43(2):2069–2078. https://doi.org/10.3233/JIFS-219306
Article Google Scholar
Aminu EF, Oyefolahan IO, Abdullahi MB, Salaudeen MT (2021) An enhanced wordnet query expansion approach for ontology based information retrieval system. In: Misra S, Muhammad-Bello B (eds) Information and Communication Technology and Applications. Springer, Cham, pp 675–688
Chapter Google Scholar
Vassilyev LM (1974) The theory of semantic fields: a survey. Linguistics 12(137):79–94. https://doi.org/10.1515/ling.1974.12.137.79
Article Google Scholar
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150 (2011). https://aclanthology.org/P11-1015
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems, vol 28. Curran Associates Inc, Montreal, pp 1–9
Google Scholar
Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 632–642. https://doi.org/10.18653/v1/D15-1075
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1746–1751. (2014). https://doi.org/10.3115/v1/D14-1181
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, Denmark, pp 670–680 (2017). https://doi.org/10.18653/v1/D17-1070
Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Berlin, Germany, pp 207–212. https://doi.org/10.18653/v1/P16-2034
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings, pp 1–15. arXiv:1412.6980
Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, conference track proceedings. OpenReview.net, Vancouver, BC, pp 1–23. https://openreview.net/forum?id=ryQu7f-RZ
Ren S, Deng Y, He K, Che W (2019) Generating natural language adversarial examples through probability weighted word saliency. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 1085–1097. https://doi.org/10.18653/v1/P19-1103
Zang Y, Qi F, Yang C, Liu Z, Zhang M, Liu Q, Sun M (2020) Word-level textual adversarial attacking as combinatorial optimization. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 6066–6080. https://doi.org/10.18653/v1/2020.acl-main.540
Maheshwary R, Maheshwary S, Pudi V (2021) Generating natural language attacks in a hard label black box setting. In: Thirty-fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, virtual event, February 2–9, 2021. AAAI Press, Virtual Event, pp 13525–13533. https://ojs.aaai.org/index.php/AAAI/article/view/17595
Dong X, Luu AT, Ji R, Liu H (2021) Towards robustness against natural language word substitutions. In: 9th International conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. pp 1–14 (2021). https://openreview.net/forum?id=ks5nebunVn_
Papernot N, McDaniel P, Goodfellow I, Jha S, Celik ZB, Swami A (2017) Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia conference on computer and communications security. ASIA CCS ’17. Association for Computing Machinery, New York, NY, USA, pp 506–519. https://doi.org/10.1145/3052973.3053009

Download references

Funding

This work was supported by the Key Cooperation Project of Chongqing Municipal Education Commission (No. HZ2021008).

Author information

Authors and Affiliations

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Jiacheng Huang & Long Chen
School of Cyber Security and Information Law, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Long Chen

Authors

Jiacheng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Long Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JH contributed to conceptualization, methodology, software, data curation and writing—original draft. LC performed writing—review and editing, supervision and project administration.

Corresponding author

Correspondence to Long Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, J., Chen, L. Defense against adversarial attacks via textual embeddings based on semantic associative field. Neural Comput & Applic 36, 289–301 (2024). https://doi.org/10.1007/s00521-023-08946-7

Download citation

Received: 13 October 2022
Accepted: 01 August 2023
Published: 02 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00521-023-08946-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Defense against adversarial attacks via textual embeddings based on semantic associative field

Abstract

Similar content being viewed by others

BeamAttack: Generating High-quality Textual Adversarial Examples Through Beam Search and Mixed Semantic Spaces

A Textual Adversarial Attack Scheme for Domain-Specific Models

Learning to Generate Textual Adversarial Examples

Explore related subjects

1 Introduction

2 Related work

2.1 Textual adversarial attack

2.2 Textual adversarial defense

2.3 Word embedding

2.4 Princeton WordNet

3 Semantic associative field

3.1 Motivation

3.2 Semantic associative field

4 Methodology

4.1 Modeling semantic associative field

4.2 Improving word embedding with weighted sampling

5 Experiments

5.1 Experiments setup

5.1.1 Datasets

5.1.2 Details of network architectures

5.1.3 Training details

5.1.4 Attack methods

5.1.5 Defense baselines

5.2 Evaluation on defense efficacy

5.3 Defense against transferability

5.4 Evaluation on training efficiency

5.5 Hyper-parameters study

6 Conclusion and discussion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation