1 Introduction

Deep neural networks (DNNs) have achieved a great breakthrough in various tasks, e.g., pattern recognition [1, 2], sentiment analysis [3] and autopilot [4], due to the rapid improvement of computational power in the era of big data. However, despite DNNs’ effectiveness and powerful ability to solve complex problems, the security issues of DNNs have become increasingly prominent. In past years, many studies [5,6,7] have shown that the adversarial examples, crafted by adding tiny perturbations to original examples maliciously, cause DNNs’ decision results to fail blatantly, even though the adversarial examples and the original examples are no different from human cognition. Such a security issue, which is called adversarial attacks, leads to severe security threats for applications based on DNNs and results in a crisis of confidence among people who use them.

For text classification, textual adversarial attacks can be divided into three categories, i.e., character-level, word-level and sentence-level attacks. The character-level attacks [8, 9] could be easily detected and defended by a spell checker [10] while the syntax has been destroyed, and the sentence-level attacks are hard to preserve the original semantics while rephrasing a sentence to another one. In contrast, the word-level attacks have become the most effective approach for crafting textual adversarial examples because those adversarial examples are able to fool the victim models with a high success rate while maintaining grammatical correctness and semantic consistency, which are more challenging to defend against. For example, Ren et al. [11] proposed a greedy algorithm called probability weighted word saliency for generating synonyms substitution-based adversarial examples with a very low word substitution rate. Zhang et al. [12] proposed the Metropolis–Hastings attack algorithm, which guarantees fluency of perturbed text by designing a stationary distribution based on a pre-trained language model. Zang et al. [13] proposed a word substitution strategy based on sememes which regards the words with the same sememe as candidate words that can be replaced with each other and searches the optimal adversarial examples based on particle swarm optimization. These text adversarial attacks pose a security threat to DNNs and also offer potential for promoting related text processing tasks, such as steganography [14] and information hiding [15] based on text. For example, adversarial examples techniques can be used to counteract attempts at steganalysis, which is the process of detecting and identifying hidden information within a carrier.

To solve the security issues of textual adversarial attacks, researchers have explored various approaches for improving DNNs’s robustness. For example, Li et al. [16] and Ren et al. [11] conducted adversarial training by crafting adversarial examples with their proposed attacks and adding the adversarial examples to the training dataset in an attempt to improve the DNNs’s robustness. However, it will cost much time for generating sufficient textual adversarial examples and retraining models with new datasets incorporating adversarial examples. Another line of work called certified defense attempts to provide the DNNs with provable robustness, e.g., Jia et al. [17] trained models that are provably robust to synonym substitution attacks using Interval Bound Propagation to minimize an upper bound on the worst-case loss. However, existing studies of certified defense only focus on synonyms substitution-based attacks; besides, it is difficult to generalize the certified defense to large-scale datasets and models with complex structure due to the heavy computing cost and strict constraints.

The motivation of this paper stems from the consideration of the process by which humans understand textual adversarial examples and how the process could guide us to improve the robustness of DNNs. We believe that there must be a relation between the original word and the substituted word (or token) in word-level adversarial attacks, and such a relation enables humans to infer original words, while humans have the ability to association. For example, there are similar semantic meanings between a word and its synonym or near-synonym.

Another question is, in reference to the discussion above, how to improve the DNNs’ robustness according to the relation between words. To this end, we introduce the concept of the semantic associative field for guiding us to build a word embedding which is robust to word-level adversarial examples. Specifically, we calculate the word vector by combining related word vectors using a potential function and weighted embedding sampling to simulate the semantic influence between words within the same semantic field.

The proposed method is simple, efficient and highly scalable. To demonstrate the effectiveness, we conduct exhaustive experiments with respect to different datasets and adversarial attack approaches. It shows that our textual embedding based on the semantic associative field is robust to word-level and character-level adversarial attacks.

We summarize our major contributions as follows:

  • We observe that, under the setting of word-level adversarial attack, there must be a relation between the original word and substitution. Such a kind of relation enables humans to infer the original word, while humans have the ability to associations.

  • We introduce the concept of semantic associative field motivated by the analysis above and propose a new defense method by building a robust word embedding. We calculate the word vector by exerting the related word vector to it with potential function and weighted embedding sampling for simulating the semantic influence between words in a same semantic field.

  • Experiments demonstrate that models using the proposed defense method based on semantic associative field theory can achieve higher accuracy than baselines under various adversarial attacks or original testing sets, and our method is more universal, while it is irrelevant to model structure and will not affect the efficiency of training.

The rest of our paper is organized as follows. In Sect. 2, we review the literature most related to this paper. Then in Sect. 3, we explain the concept of the semantic associative field in detail. For defending against textual adversarial examples via semantic associative field-based word embedding, we describe our methodology in Sect. 4. To demonstrate the effectiveness, we show our experiments and results in Sect. 5. Finally, we draw the conclusion and discussion in Sect. 6.

2 Related work

2.1 Textual adversarial attack

Although DNNs have achieved great success in the natural language process (NLP), malicious perturbed examples are able to cause DNNs’ predictions to fail blatantly. Various textual adversarial attack approaches have been proposed for exploring the weakness of NLP models. According to the level of perturbations, textual adversarial attacks can be divided into three categories, i.e., character-level, word-level and sentence-level attacks. Character-level attacks usually craft typos [8, 9, 18] or visual similar characters [19] deliberately, which are incomprehensible to NLP models totally, while humans recognize them. Word-level attacks are mainly replacing words in an original sentence with another with some strategies, e.g., synonyms-based [11], word embedding-based [20], sememe-based [13], language model-based [12], etc., while not changing the real label. Sentence-level attacks mainly include inserting distracting sentence [21] to original input or paraphrasing the original input [22, 23] for changing the predictions, but such huge perturbations are difficult to keep the original semantics and real labels.

2.2 Textual adversarial defense

The wide study of adversarial attacks causes researchers aware of the potential threats when using DNNs, and various methods for enhancing the robustness of DNNs against adversarial attacks have been proposed. Generally, those methods can be categorized into three types, i.e., adversarial training, input transformation and certified defense methods.

Adversarial training-based defense methods are enhancing the DNNs’ robustness to adversarial examples by adding them to the training dataset. For example, Li et al. [11, 16, 24] conducted adversarial training with adversarial examples generated by proposed adversarial attack methods, Dinan et al. [25] proposed an iterative build it–break it–fix it strategy with humans and models in the loop based on crowd-sourcing.

Input transformation-based methods eliminate the adversarial perturbations in the input space. Wang et al. [26] proposed a synonym encoding method to map synonyms to the same code by inserting a coder before the input layer.

Certified defense methods provide the models with provable robustness. Jia et al. [17] trained models that are provably robust to synonym substitution attacks using Interval Bound Propagation to minimize an upper bound on the worst-case loss. Ye et al. [27] proposed a structure-free method for certified robustness to synonym substitution attacks based on the idea of randomized smoothing which smooths the model with random word substitutions built on the synonymous network. Zhou et al. [28] proposed a certified defense method to defend synonyms substitution attacks by sampling the points from a convex hull formed by a word and its synonyms using the Dirichlet distribution to ensure the robustness within such a region.

2.3 Word embedding

In the tasks of NLP, texts ought to be converted to numerical value data which can be processed by DNNs, and the technology that represents text data as a numerical vector form is called word embedding. The one-hot encoding is early technology that converts each word into a vector where one dimension is set to 1 to indicate a word, and other dimensions are set to 0. Although the one-hot encoding is simple and applicable to any text data, the word vectors are irrelevant to each other and a dimensional disaster is likely to happen. Since the distributed representation, i.e., map each word into a dense vector of lower dimensions, was proposed by Hinton [29], various word embedding technologies are created and can be divided into two categories: (1) static word embedding and (2) dynamic word embedding.

Static word embedding means that a word can only be represented by a unique word vector no matter how the context changes. For example, word2vec [30] is a two-layer neural net that creates vectors that are distributed numerical representations of word features such as the context of individual words using skip-gram with negative sampling and a continuous bag of words. GloVe [31] obtains vector representations for words by training on aggregated global word–word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Dynamic word embedding refers to that word vectors can be adjusted dynamically according to the context. For example, ElMO [32] (Embeddings from Language Models) use a two-layer bidirectional recurrent neural language model to obtain the vector representation of the word by predicting and realizing the dynamic change by weighting the word vector. BERT [33] (Bidirectional Encoder Representations from Transformers) is an autoencoder language model based on the transformer in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. However, dynamic word vector technology can solve the problem of the multi-meaning of words to a certain extent but often leads to a large cost of computing resources.

2.4 Princeton WordNet

Princeton WordNet [34] is a large English lexical database providing a semantic network of general domain concepts linked by a few relations and it groups English words into sets of synonyms and defines relationships between words and their meanings, which are used to organize words into a hierarchy, known as a taxonomy, and to indicate semantic relations between words. WordNet has been widely used in various NLP tasks, such as entity recognition, information retrieval and word sense disambiguation, e.g., AlMousa et al. [35] proposed a sequential contextual similarity matrix multiplication algorithm based on WordNet knowledge for word sense disambiguation, Butt et al. [36] proposed an automatic food item detection from unstructured text using WordNet-based semantic sense modeling, Aminu et al. [37] designed a rule-based web ontology language information retrieval system with an enhanced WordNet for query expansion. In our research, WordNet is used to build a semantic associative field by providing structured lexical semantic networks containing various relations between concepts such as synonymy, hyponym and hypernym.

3 Semantic associative field

3.1 Motivation

We studied literature about textual adversarial attacks and summarized that various word-level attacks can be seen as mapping with some principles, that is, a word-level attack is to map words into others with similar lexical meanings or sememes. The question is why DNNs are vulnerable to those adversarial examples, while humans are not. We believe that there is a relation between original and perturbed tokens and such a relation enables humans to make a semantic association while reading perturbed text, especially with context. Thus, perturbed text can be easily comprehended by humans but DNNs cannot, while DNNs do not have the semantic association ability of words. Specifically, the word embeddings trained with corpus and algorithms, e.g., GloVe and Word2vec, are inconsistent with human cognition.

Take a simple example, we denote words “wonderful”, “topping” and “boring” as \(w_1\), \(w_2\), and \(w_3\) separately, and we calculate the distances among each other. Then we got that \(distance(w_1,w_2)\) is 0.856 and \(distance(w_1,w_3)\) is 0.503. It is shown that the distance in word embeddings space between \(w_1\) and \(w_3\) is smaller than \(w_1\) and \(w_2\); however, the semantics of \(w_1\) and \(w_2\) ought to be closer. To solve the problem, we aim to improve word embeddings by introducing the concept of the semantic associative field.

3.2 Semantic associative field

The concept of field originated from physics, which is used to describe the non-contact interaction between material particles, e.g., gravity field, magnetic field, etc. In the field of linguistics, researchers have proposed the semantic associative field, introducing the idea of field theory into the semantic analysis, to represent the semantic connections and differences of words [38]. The semantic field theory holds that the semantics of a concept (i.e., word) is affected by others in the same cluster, namely the semantic associative field, which is formed by the words related to each other semantically as shown in Fig. 1. There are various relations between concepts in lexical semantics, such as hypernym, hyponym and synonym, and these complex and diverse relations form a variety of word aggregates that are connected with each other to form the network of lexicons in the human mind.

Fig. 1
figure 1

Relations between concepts in lexical semantics

Similar to the field of physics, the semantic field is a field with sources, that is, words in a semantic field are seen as sources exerting influence on each other within the field. Furthermore, the influence between field sources is described by a potential function which is detailed in the next section. In this work, we try to improve the existing word embedding representation via semantic field theory and make the distance between words related semantically closer in embedding space for defending textual adversarial attacks.

4 Methodology

The overall framework of our defense method is shown in Fig. 2. Generally, we first build a semantic associative field from WordNet and then enhance word embedding based on the semantic associative field. Next, we detail our defense method which incorporates two-part, namely the semantic associative field building and the semantic associative field-based word embeddings enhancing method.

Fig. 2
figure 2

The overall framework of our defense method. “Emb Layer” refers to “Embedding Layer”

4.1 Modeling semantic associative field

As stated in Sect. 3.2, there are various relations between concepts, which are included in the same semantic associative field, such as hypernym, hyponym, synonym. In the meantime, WordNet, a large lexical database of English, provides structured lexical semantic networks containing various relations between concepts mentioned above. Therefore, we can easily query the concepts related to a concept semantically and get the structural relationships between those concepts in a semantic associative field by introducing the Python third-party library nltk which integrates WordNet. In practice, we set the maximum number of relationship layers to 2 to avoid infinite capacity of the semantic associative field

After we build the semantic associative field, we describe how to model the semantic field mathematically for computability. Formally, for a word w, we assume there are n words related to w semantically, which is denoted as \(S=(w_1\ldots w_i\ldots w_n)\) and forms a semantic network with w together. The position coordinate of w in the semantic network is defined as follows:

$$\begin{aligned} p=(s_1,\ldots ,s_i,\ldots ,s_n), \end{aligned}$$
(1)

where \(s_i\) is the similarity between w and \(w_i\) in the semantic network, and the similarity is calculated by the following formula:

$$\begin{aligned} s_i=\frac{\delta }{L(w,w_i)+\delta }, \end{aligned}$$
(2)

where \(L(w,w_i)\) is the shortest path of the w and \(w_i\) in the semantic network. \(\delta \) is a tuning parameter for controlling the path length between w and \(w_i\) in the semantic network when the similarity is calculated to 0.5. We can calculate the position of each word in the semantic field according to the above formulas. It is obvious that the farther the positional coordinates between words are, the greater the semantic distance between the words in the semantic field, and vice versa.

The field formed by the interaction between all words in the semantic network is called a semantic associative field, as a field with sources, in which every word influences others. In order to better describe the interaction between words in the semantic associative field, we introduce a potential function to represent the strength of such an interaction.

$$\begin{aligned} \phi (w,w_i)=me^{-\left( \frac{d(w,w_i)}{\sigma }\right) ^2}, \end{aligned}$$
(3)

where \(\phi (w,w_i)\) is the potential function representing the strength of interaction from \(w_i\) to w, m is the mass of \(w_i\) which represent the strength of the field source, \(d(w,w_i)\) is the Euclidean distance between w and \(w_i\) in semantic field, \(\sigma \) is a tune parameter ranging from \((0,\infty )\) for controlling the interaction range of field source. In this paper, m of all words are set to 1 for simplification, i.e., each word in a semantic associative field has the same strength as a field source. It is obvious that the greater the distance between words, the smaller the potential energy between the words in the semantic associative field, and vice versa. In other words, the strength of the interaction between words will decrease rapidly, while the semantic distance increases until it approaches zero.

4.2 Improving word embedding with weighted sampling

After we model the semantic associative field, we describe how to improve the word embeddings based on the semantic associative field. According to the principle of superposition of field, the field strength of any point in the semantic field is equal to the vector sum of the field strength generated by all field sources independently. Therefore, the key idea of our method is to calculate the word vector by exerting the related word vector to it with a potential function and weighted embedding sampling for simulating the semantic influence between words in the same semantic field. Formally, we assume access to well-trained word embeddings denoted as E and sample the embeddings about \(w_i\) from E for weighted averaging. The weight is obtained by the potential function mentioned in Sect. 4.1, and we add the result of weighted averaging to the embedding of w for reflecting the semantic influence from \(w_i\) to w. Thus, the embedding of w is updated by the following formula:

$$\begin{aligned} E'(w)=E(w)+\frac{\sum _{i=1}^{n}{\phi (w,w_i) E(w_i)}}{\sum _{i=1}^{n}{\phi (w,w_i)}}, \end{aligned}$$
(4)

where E(.) is the vector representation of a word in original word embeddings, \(\phi (w,w_i)\) is the weight of \(E(w_i)\) calculated by Eq. (3), and n is the amount of w’s semantically related words. An example of the calculation process above is illustrated in Fig. 3, for the situation that the word w has two related words (i.e., \(w_1\) and \(w_2\)) in the same semantic field, while \(\phi (w,w_1)\) and \(\phi (w,w_2)\) are assumed to be calculated as 0.4 and 0.8, respectively.

Fig. 3
figure 3

An example of improving word embedding via semantic field

5 Experiments

In this section, we conduct comprehensive experiments to evaluate our defense approach on different DNNs and datasets for revealing the advantages of our method.

5.1 Experiments setup

5.1.1 Datasets

We evaluate our defense approach and conduct comparison experiments with others on three benchmark datasets, i.e., IMDB [39], AG’s News [40] and SNLI [41]. IMDB is a binary sentiment analysis (SA) dataset labeled as positive or negative; AG’s News is a collection of more than one million news articles, which can be categorized into four classes: World, Sports, Business and Sci/Tech, and SNLI is a natural language inference (NLI) dataset in which each instance comprises a premise-hypothesis sentence pair and is labeled one of three relations including entailment, contradiction and neutral. Details of the datasets above are shown in Table 1.

Table 1 Statistics of all datasets

5.1.2 Details of network architectures

We replicate five popular universal sentence encoding models, i.e., TextCNN [42], bidirectional LSTM (BiLSTM) [43] and Attention-based bidirectional LSTM (BiLSTM + ATT) [44]. TextCNN has three convolutional filters of the different kernels (3,4,5), and their outputs are concatenated, pooled and fed to a fully connected layer followed by an output layer. Bidirectional LSTM is composed of a 128-dimensional and 64-dimensional bidirectional LSTM layer followed by a dropout layer using a drop rate of 0.5, and the output is pooled and fed to an output layer. Attention-based bidirectional LSTM has an attention layer followed by a BiLSTM mentioned above. Besides, we apply the 100-dimensional pre-trained word vectors model GloVe to the embedded layer.

5.1.3 Training details

In our experiments, all models are trained using Adam optimizer [45] with default settings in Keras, that is, the learning rate is set to \(1\times 10^{-3}\), the Epsilon fuzz factor is set to \(1\times 10^{-7}\), and the AMSGrad variant [46] of Adam is not applied. We take 20% of the examples for training as the validation set and use the EarlyStop to avoid overfitting, that is, the training progress will finish in advance when the loss value has stopped improving. Besides, the maximum epoch for training DNNs is set to 5 and the batch size of training examples fed to DNNs for each epoch is set to 128.

5.1.4 Attack methods

For evaluating the defense efficacy of our defense method comprehensively, we replicate three word-level advanced textual adversarial attacks, that is, PWWS [47], PSO [48] and HLA [49]. The hyperparameters of all attack methods above are consistent with the experimental setup in the original papers. Considering the inefficiency of generating textual adversarial attacks, we attack each model with 1000 sampled examples from test data for generating adversarial examples.

5.1.5 Defense baselines

We choose three state-of-the-art adversarial defense methods as the baselines and compared them with ours. The first baseline [27], denoted as SAFER, is a kind of certified defense against word-level substitution-based adversarial attacks based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentence. The second baseline [26], denoted as SEM, inserts an encoder before the input layer of the target model to map each cluster of synonyms to a unique encoding. The third baseline [50], denoted as ASCC, generates worst-case perturbations for adversarial training by an adversarial sparse convex combination method.

5.2 Evaluation on defense efficacy

We first evaluate the efficacy of our defense methodology, i.e., the accuracy of models under adversarial attacks while using the defense method. For a comprehensive evaluation, the defense efficacy of our methodology is studied on different datasets and models and compared with the three baselines, i.e., SAFER, SEM and ASCC as described above. The testing sets for the evaluation under the same dataset across different models, attack methods and defense methods are the same for a fair comparison.

Table 2 shows the performance of different baseline defense methods and ours on clean or perturbed examples. The more robust the defense method is, the less the accuracy decreases, while the model deals with textual adversarial examples. In the meantime, the performance of a robust model on clean examples is expected to be as unaffected as possible compared with the original model. The best defense method for each model could be singled out by checking the maximum accuracy of each column under different settings, and it shows that the model using our defense method achieves dominant accuracy under various adversarial attacks or original testing sets across all datasets. For example, the accuracy of TextCNN using SAFER is only 81.1% under PWWS attack on IMDB, yet our defense method makes the accuracy of TextCNN reach 93.9% under the same setting. Although the accuracy of TextCNN using our defense method slightly decreases to 94.5% on the clean examples, it is still much higher than the baselines, while the accuracies are 81.1%, 72.3% and 74.4% for SAFER, SEM and ASCC, respectively. It shows that our method can achieve a good trade-off between the model’s robustness and accuracy, benefiting from the appropriate strength of semantic interaction between related words in the same semantic associative field, which effectively avoids semantic shifts or insufficient interactions, while the hyperparameters are chosen appropriately. We also discuss the selection of hyperparameters in Sect. 5.5.

Table 2 The classification accuracy (%) against different textual adversarial attacks on three datasets for TextCNN, BiLSTM and BiLSTM with attention

5.3 Defense against transferability

The transferability is a property of adversarial examples, i.e., the adversarial examples generated for a special model also mislead other models even if the structures and parameters are different between them [5]. Due to the transferability of adversarial examples, the attackers are able to conduct an adversarial attack on a model with unknown internal structures and parameters [51]. Thus, the transfer-based attack is a more realistic threat and an effective defense method ought to prevent the model from the transferability of adversarial examples.

Without loss of generality, we craft adversarial examples via PWWS on each model across all datasets under different defense methods for evaluating their performance in defending transfer-based adversarial attacks. Table 3 demonstrates the performance on data IMDB. The best defense method for each model could be singled out by checking the maximum accuracy of each row under different settings and the experimental results show that our method can better defend against transfer-based attacks in most cases compared with baselines. For example, the accuracy of TextCNN is 95.6% when dealing with adversarial example generated by BiLSTM under PWWS attack on IMDB dataset, while the accuracies of the baselines are 82.4%, 75.6% and 76.2% under the same setting.

Table 3 The classification accuracy (%) of target models against adversarial examples generated via PWWS from various source models on IMDB for evaluating the transferability

5.4 Evaluation on training efficiency

In addition to improving the accuracy of models on adversarial examples, high efficiency in training is also vital to a defense method, especially when a defense method is applied to large-scale datasets. As shown in Table 4, we evaluate the training time per epoch for models with various defense methods on dataset IMDB. For avoiding the impact of runtime environment fluctuation, we conduct ten repeated experiments and calculate the mean and standard deviation of the training time in seconds. It demonstrates that the SEM and our defense method cost the least time for each training epoch since they only need to do a transform to input text based on some strategies. The adversarial training-based defense method ASCC costs a little more time for the training epoch, while a procedure of white-box adversarial attack exists at each epoch, and the certified defense method SAFER costs much more time for training epoch, while it will sample a mini-batch of data points (sentences) and randomly perturb the sentences using a perturbation distribution at each epoch.

Table 4 The training time per epoch (in seconds) for the models with various defense methods on IMDB

5.5 Hyper-parameters study

Furthermore, we study how the hyper-parameter \(\delta \) and \(\sigma \) in similarity and potential function of our method influence its performance. We try different \(\delta \) ranging from 0.5 to 5 and \(\sigma \) ranging from 1 to 10 with or without adversarial attacks. Specifically, \(\sigma \) and \(\delta \) are fixed to 1 and 0.5, respectively, when we study the influence of \(\delta \) and \(\sigma \) changing. The results on dataset IMDB using TextCNN, BiLSTM and BiLSTM with attention are illustrated in Figs. 4 and 5.

First, as shown in Fig. 4a–c, we study how \(\delta \), the tuning parameter in Eq. 2, influences the accuracy of our method empirically on three models where \(\sigma \) is fixed to 1. As \(\delta \) increases, the accuracy of models improves evidently, peaks when \(\delta \) is about 2–3.5, and then starts to drop because too large \(\delta \) leads to semantic drifts. Specifically, the similarity in Eq. 2 is gradually close to 1, while the \(\delta \) increase; thus, the position coordinate of words in the semantic field is getting closer; in other words, the distance between words is getting shorter. Besides, the value of the potential function in Eq. 3 is correlated negatively with the distance between words, thus, too large \(\delta \) causes too large the strength of interaction between words in the semantic field, i.e., semantic drifts. Conversely, too small \(\delta \) causes insufficient interactions, i.e., insufficient stimulation of the semantic influence between words in the same semantic field.

Similarly, as shown in Fig. 5a–c, we study the influence of \(\sigma \), the tune parameter in Eq. 3, on three models where \(\delta \) is fixed to 0.5. As \(\sigma \) increases, the value of the potential function in Eq. 3 rises, i.e., the strength of interaction between words in the semantic field grow, and thus, the accuracy of models dealing with adversarial examples increases. Similar to the discussion about parameter \(\delta \), too large the strength of interaction between words in a semantic field leads to semantic drifts; therefore, after peaking when \(\sigma \) is about 4–5, the accuracy of models starts to decrease if we continue to increase \(\sigma \).

In summary, either too large or too small \(\delta \) and \(\sigma \) cause poor performance of the model on clean and adversarial examples. Therefore, we choose \(\delta = 2\) and \(\sigma = 4\) to have a good trade-off.

Fig. 4
figure 4

Classification of our methods on various values of \(\delta \) ranging from 0.5 to 5 for different models on IMDB where \(\sigma \) is fixed to 1

Fig. 5
figure 5

Classification of our methods on various values of \(\sigma \) ranging from 1 to 10 for different models on IMDB where \(\delta \) is fixed to 0.5

6 Conclusion and discussion

In this paper, we first analyze the reasons why humans can read and understand textual adversarial examples and observe two crucial points: (1) There must be a relation between the original word and the perturbed word (or token). (2) Such a kind of relation enables humans to infer the original word, while humans have the ability to associations. Based on these two observations, we introduce the concept of semantic associative field and propose a new defense method by building a robust word embedding, that is, we calculate the word vector by exerting the related word vector to it with potential function and weighted embedding sampling for simulating the semantic influence between words in the same semantic field. Experiments demonstrate that the models using the proposed method can achieve higher accuracy than the baseline defense methods under various adversarial attacks or original testing sets. Moreover, the proposed method is more universal, while it is irrelevant to model structure and will not affect the efficiency of training. However, some limitations need to be addressed in future. For example, we need to consider how to apply our methodology to the defense of adversarial perturbations in vision.