1 Introduction

In recent years, with the development of large-scale pre-trained language models based on transformers, using pre-trained model with fine-tuning has gradually become the mainstream method for natural language processing tasks. Machine Reading Comprehension (MRC) methods based on pre-trained models have achieved good results on some popular MRC datasets, such as SQuAD [1] and RACE [2], but some studies [3,4,5] have shown that MRC models are vulnerable to adversarial samples, which called the robustness problem of MRC model. As shown in Table 1, adding additional sentences which has similar semantic with answer sentences could mislead the model and make it output wrong answers. Thus, the robustness of the model still needs to be further improved.

Table 1 An example from the adversarial dataset. Prediction of the model changes from “1191” to “1192” after adding misleading sentence

MRC models go wrong because they fail to distinguish between misleading sentences and answer sentences. Answer sentences and misleading sentences are recognized by the model as correct answers to question. Human beings can understand question well, discover the subtle differences among multiple candidate answer sentences from the question’s perspective, and then find the correct answer. As shown in Table 1, people can judge that the type of answer is “time” according to the content of the answer sentence, and the important clue is “Richard’s fleet”. Then we can find the semantic differences between the two sentences from the perspective of the participants, and finally choose the correct answer according to the content described in the question. However, in the MRC, the encoding results of the answer sentence vector is close to that of the misleading sentence vector. And the model pays more attention to the misleading sentences because the misleading sentences contain some non-key words which appeared in the question sentence, and ignores the important decisive significance of the substantive words in the sentence for the answer result.

To deal with the robustness issue mentioned above, Welbl et al. [6] used data augmentation and adversarial training, Jia and Liang [3], Wang and Bansal [7], Liu et al. [8] enriched the training set by generating adversarial examples. However, since the types of adversarial examples are innumerable, all the above methods by augmenting the training dataset have some limitations. Majumder et al. [9] used an answer candidate reranking mechanism to avoid model errors on adversarial examples, but it sacrifices the accuracy on non-adversarial datasets and requires an additional complex structure to support the algorithm. All the above methods have improved the robustness of MRC models.

We find that model representation has a great influence on model robustness. Existing MRC models’ representation space is dense, especially the distance between answer sentence vectors and misleading sentence vectors is too close. Therefore, we propose Multi-task machine Reading Comprehension learning framework via Contrastive Learning (MRCCL), which introduces Contrastive Learning (CL) task into the MRC model through multi-task joint training. Specifically, we use dropout to generate positive samples, select other sentences in the passage as negative samples for contrastive learning, and jointly train the model with multi-task. While expanding the model representation space, the representation consistency in the model representation space is maintained. Our contributions are summarized as follows:

  1. 1.

    We propose a new contrastive learning algorithm to improve the robustness of machine reading comprehension. By selecting each sentence in the context as positive and negative samples, the context representation space of the pre-trained language model is improved, and the robustness of the machine reading comprehension model is further improved.

  2. 2.

    Experimental results show that our algorithm effectively alleviates the problem of poor robustness of the MRC model and has achieved the state-of-the-art on the adversarial dataset.

2 Related work

Adversarial Attacks in Machine Reading Comprehension Model

The research on the robustness of MRC models have just emerged in recent years, among which are various attack methods. Jia and Liang [3] successfully attacked the MRC model by adding a misleading sentence at the end of the text. Wang and Bansal [7] inserted a misleading segment at a random position in the context. Liu et al. [8] proposed a method that generates adversarial examples automatically, perfecting the method mentioned above. Welbl [6] attacked the MRC model by using part-of-speech-based and entity-word-based replacement methods. Schlegel et al. [10] chose to automatically generate the adversarial set, which corresponds to the original by adding negative words. In our research, we focused on solving problems where the context contained misleading sentences similar to the answer sentences. The semantics of the two sentence are naturally similar, but there are huge differences from the perspective of the problem, which is very common in practical applications.

Data augmentation and adversarial training have been used most widely for adversarial defenses. Jia and Liang [3], Wang and Bansal [7], Liu et al. [8] have improved the accuracy of the MRC model in oversensitivity problems by automatically generating adversarial samples. However, these methods use the adversarial training set generated by rules and have poor robustness in out-of-distribution data. Wang and Jiang [11] combined general knowledge with neural networks through data augmentation. Yang et al. [12, 13] used adversarial training to maximize the countermeasure loss by adding perturbations in the embedding layer. In addition, some studies have attempted to change the process of model inference. Chen et al. [14] decompose both the question and context into small units and construct the graph, converting the question answering into an alignment problem. Majumder et al. [9] re-rank the candidate answers according to the degree of overlap between the candidate sentence and the question. Zhou et al. [15] introduce rules based on external knowledge to regularize the model and adjust its output distribution. Yeh and Chen [16] trained the model by maximizing the mutual information between passages, questions, and answers, avoiding the effect of superficial biases in the data on the robustness of the model. Although these methods improve the robustness of MRC model to varying degrees, they do not consider the influence of model representation on robustness.

Contrastive Learning

Recent years, contrastive learning has become a popular self-supervised representation learning technique, which has been extensively used in computer vision. The main idea of contrastive learning is to shorten the distance of positive sample pairs in the representation space. Chen et al. [17] proposed SimCLR, which constructs positive samples by data augmentation and constructs negative samples by random sampling in the same batch.

Contrastive learning has been applied to learn better sentence representations in the field of NLP. Gao et al. [18] proposed SimCSE, which uses dropout as a means of data augmentation and achieves good performance in natural language inference tasks. Wang et al. [19] proposed a method to construct semantic negative examples for contrastive learning to improve the robustness of the pre-trained language model. Yan et al. [20], Zhang et al. [21] used contrastive learning to solve the folding problem of model representation space and achieved good results in short text clustering and natural language inference tasks. However, all the above contrastive learning methods take the input context as a unit to improve the representation of the model, without considering the fine-grained representation of the context.

3 Method

In this section, we will introduce the multi-task machine reading comprehension learning framework via contrastive learning. The framework of the MRCCL is illustrated in Fig 1. The model can be divided into the MRC module, the Contrastive Learning module and multi-task joint training strategy. The MRC module is used to extract the correct answers, the contrastive learning module is used to improve the representation ability of the model, the MRC module and the contrastive learning module are jointly trained by multi-task joint training strategy. The MRC module shares the same encoding layer parameters as the contrastive learning module. Each of these two modules has its own loss function, respectively, but we combine the two loss functions to produce a joint loss function and adjust the model’s parameters. Besides, the contrastive learning module only works in the training stage. In the next section, we will illustrate each module in the MRCCL in detail.

Fig. 1
figure 1

The overview of multi-task machine reading comprehension learning framework via contrastive learning

3.1 MRC Model Architecture

In MRC module, we adopt the most common extractive MRC model. The structure of MRC module is shown in the left of Fig. 1. It is composed of an encoder and a downstream multi-grain classifier. In the extractive MRC task, given dataset \(D=\left \{{C_{i},Q_{i},A_{i}} \right \}_{i=1}^{n}\), where Ci denotes the context that needs to be understood by the model, Qi denotes the question, Ai denotes the answer label corresponding to the question, and n denotes the size of the dataset. In the training stage, each input data is always composed of such triples. In the extractive MRC task, answer Ai is composed of the starting position \({A_{i}^{s}}\) and the ending position \({A_{i}^{e}}\). The model needs to find the starting and ending positions of the answer from Ci according to the input Qi.

Given input data \(\left \{C_{i},Q_{i},A_{i} \right \}\), the input devoted as encoder_input is the concatenation of Ci and Qi with special tokens [CLS] and [SEP] as [CLS] Qi [SEP] Ci [SEP]. The encoder_input will be encoded by encoder and produce the encoding result \(encoder\_output\in \mathbb {R}^{m\ast d}\), where d denotes the maximum input sequence length and m denotes the dimension of the hidden layer. The MRC model calculates the answer position through the linear layer with dimension 2 ∗ m. It compress the encoded token in the encoder_output by weighting, and obtains answer’s position score \(logits\in \mathbb {R}^{2\ast d}\). Then, the model calculate the sentence level score \(sentence\_logits\in \mathbb {R}^{2\ast d}\) and the word level score \(start\_logits\in \mathbb {R}^{d}\) and \(end\_logits\in \mathbb {R}^{d}\) separately. Start_logits and end_logits can be obtained directly from logits. For each sentence in the context, the model calculates the mean value of the answer position score for each word in the sentence to get the sentence-level scores. Sentence-level scores is not a value, it has the same dimension as sentence length. Splice the sentence-level scores of all sentences in the context to get sentence_logit. Finally, we add the word level score and the sentence level score to obtain \(fstart\_logits\in \mathbb {R}^{d}\) and \(fend\_logits\in \mathbb {R}^{d}\). fstart_logits and fend_logits are the start and end position scores used to generate the answer fragment

The encoder_output directly affects the calculation of fstart_logits and fend_logits in the MRC model and then affects the extraction of answer fragments. In other words, if the two vectors are represented at near positions in the context representation space, their answer scores will be close. Therefore, dense context representation space leads to poor robustness of the model and easy to get wrong answers under the adversarial attacks. Our MRC module uses the cross-entropy function to calculate the loss according to the score. The loss function can be expressed as follows:

$$ L=\frac{1}{2}(f_{CE}(fstart\_logits,{A_{i}^{s}})+f_{CE}(fend\_logits,{A_{i}^{e}})) $$
(1)

where fCE denotes the cross-entropy function, \({A_{i}^{s}}\) and \({A_{i}^{e}}\) denote the starting position label and ending position label of the answer, respectively. The encoding result and linear layer weight will be changed by back propagation.

3.2 Contrastive learning in context representation space

In order to solve the problem of dense representation space in traditional MRC models, we introduce the contrastive learning into the MRC model, corresponding to the contrastive learning module described above. Due to the particularity of the extractive MRC task, we make improvements to the contrastive learning and call it Contrastive Learning in Context Representation Space (CLCRS). CLCRS is a type of supervised contrastive learning, it can be considered as the contrastive learning in the representation space of context. Common contrastive learning, which compares positive and negative samples taken from same batch, has little effect in distinguishing between answer and misleading sentences. To deal with this, CLCRS has a unique sampling strategy for MRC models. CLCRS samples sentences containing context information from the context as positive and negative samples, expanding the distance between the answer sentence and other sentences in the context. CLCRS can solve the problem of dense representation space of the original MRC model based on the pre-training model by enlarge the distance between different sentence vectors.

Specifically, CLCRS is shown in the right of Fig. 1. The input of CLCRS has the same form as the MRC module. Its input can be expressed as \(ctx=\left \{{C_{i},Q_{i},A_{i}}\right \}_{i=1}^{n}\), where \(C_{i}=\left \{{c_{i}^{1}},\ldots ,{c_{i}^{n}}\right \}\) denotes the context, \({c_{i}^{j}}\) denotes each of the sentence in the context, Q denotes the question sentence, and A denotes the sentence where the answer lies. Different from the previous mainstream contrastive learning strategy of sampling negative samples from the same batch, we use contrastive learning to sample sentences in the context as negative samples for comparison. Specifically, following Gao et al. [18], we generate the positive sample corresponding to A by dropout and select other sentences in the input ctx, except question Q, as negative samples for contrastive learning. The effect of CLCRS on the ability to represent the model is shown in Fig. 2. For the input ctx, the encoding result \(encoder\_output\in \mathbb {R}^{m\ast d}\) is obtained after encoding. We divide the encoder_output into different sentence vectors according to the original sentences and use the mean pooling to generate the vector representation \(cl\_output\in \mathbb {R}^{k\ast m}\) of the sentences, where k denotes the number of sentences in the context and m denotes the dimension of the hidden layer. In CLCRS, we use InfoNCE as its loss function, and it is shown as follows:

$$ -\log{\left( \frac{e^{S\left( z_{i},z_{i}^{\prime}\right)/\tau}}{{\sum}_{j=0}^{K}e^{S\left( z_{i},z_{j}\right)/\tau}}\right)} $$
(2)

where \(S\left (\cdot \right )\) denotes the cosine similarity function, τ is a hyper parameter, \(z_{i}^{\prime }\) denotes the positive sample, and zj denotes the negative sample. By optimizing this loss function, the distance between each sentence in the context is enlarged, and the context representation space is expanded.

Fig. 2
figure 2

The strategy of contrastive learning in our method. Q denotes question sentences, C denotes context, ci denotes other sentences in the context except for the sentences where the answer lies, A denotes the sentences where the answer lies, and V denotes misleading sentences. Two sentences with similar colors represent a pair of positive samples generated by dropout. When the model encodes adversarial samples, the contrastive learning module can effectively distance the answer sentence from other sentences such as misleading sentences

3.3 Multi task learning

Contrastive learning can expand the context representation space. In order to expand the representation space of the MRC model, we introduce a multi-task learning strategy. In our method, we combine the loss function of MRC and the loss function of contrastive learning, optimize the two modules simultaneously in the training stage. CLCRS only works in the training stage. Specifically, we share encoder parameters between the MRC module and CLCRS. Referring to the work of Liebel and Körner [22], we combine the loss functions of the two modules into a joint loss function as follows:

$$ L_{union} = f(L_{mrc},L_{cl}) = \frac{1}{a^{2}}l_{mrc}+log(1+a^{2})+\frac{1}{b^{2}}L_{cl}+1og(1+b^{2}) $$
(3)

where a and b are parameters that can be learned, Lmrc and Lcl are the loss functions of the MRC module and CLCRS, respectively.

4 Experiments

In order to verify the performance of our algorithm, we carried out several experiments and analyzed the experimental results. First, we introduce the datasets and experiments setting. Second, we evaluate our method on adversarial datasets in two kinds of baseline pre-train language models and compare it with other methods. Finally, we conduct the ablation study to verify the effectiveness of each module in MRCCL.

4.1 Datasets

We only use the SQuAD1.1 training set to train our model. And for the problem we want to solve, we generate an adversarial test set AddCfa for evaluating the robustness of the model according to SQuAD1.1-dev set. Following Jia and Liang [3], the generation method of AddCfa is as follows: firstly, use the similar words in GLOVE [23] to replace the named entities and numbers in the answer sentence, then use the antonyms in WordNet [24] to replace the nouns and adjectives in the answer sentence to obtain the misleading sentence, and finally insert it into the back of the answer sentence in the context. The example in Table 1 is taken from AddCfa. We choose DEV [1], AddSent [3], AddCfa, and AddSentMod [3] as test sets to evaluate our approach.

4.1.1 Training datasets

  • SQuAD1.1 training set [1]: One of the most authoritative datasets in the field of MRC. This dataset is selected from Wikipedia articles and annotates 87,599 question and answer pairs

4.1.2 Test datasets

  • AddSent(AS) [3]: Adversarial test set in the field of MRC. The construction method of the adversarial sample is to convert the question sentence into an misleading sentence and append it to the end of the context by some rules. The dataset contains 2560 adversarial examples and 1000 normal examples.

  • AddCfa(AC): Similar to AddSent, a adversarial test set in which misleading texts are converted from answer sentence through rules and crowdsourcing. The construction method of AddCfa is introduced in the previous subsection. The dataset contains 9620 adversarial instances and 10570 normal instances.

  • AddSentMod(ASM) [3]: Same as AddSent but insert misleading text at the beginning of the context. This dataset has 2225 adversarial instances and 1000 normal instances.

  • DEV [1]: The development set of SQuAD v1.0 in which contains 10570 question and answer pairs for evaluation.

4.2 Experiment settings

We selected five pre-trained language models for our experiments: BERT-base [25], BERT-large [25], BERT-large-wwm, RoBERTa-base [26], RoBERTa-large [26]. For the common set, AdamW optimizer is used during the training stage. All the parameters required for multi-task joint training are optimized by AdamW optimizer. The maximum input length for our model is set to 384. To deal with long text, we chunk them into equally-spaced segments and use a sliding window of 128 size. We set the number of training epochs to 3. We use 0.1 for dropout on all layers and in attention. The temperature τ of InfoNCE loss and learning rate(lr) are the parameters that have the most impact on the accuracy of the model. We set τ in {0.05,0.1,0.15,0.2} and lr in {3e-5,4e-5,5e-5} to train our model and select the best result on the test set. For BERT model, τ= 0.15 and lr= 3e-5 are more conducive to the optimal performance of the model. For RoBERTa-base and RoBERTa-large model, the optimal parameters are τ= 0.05 and lr= 4e-5,3e-5. All models are implemented by PyTorch-1.7.1.

4.3 Results and analysis

4.3.1 Experiment on baseline model

We selected five different size pre-trained models as the baseline model to verify our algorithm: BERT-base, BERT-large, BERT-large-wwm, RoBERTa-base and RoBERTa-large. We test the F1 and EM of the model on four test sets as evaluation metrics. AVG is calculated according to the results of the model on DEV, AddSentMod, AddSent-adv and AddCfa-adv, and the results are used as the final metrics to evaluate the robustness of the model. All the results are shown in Table 2 and the best results are highlighted in bold.

Table 2 Performance of MRCCL in five baseline models. Adv represents the adversarial sample in the test set

Compared with AVG results, our model improved robustness across all baseline models. It has the best performance in RoBERTa-large model with a 2.3 improvement. Even the least significant improvements in bert-Large and Roberta-Base were 1.4.

The large model not only performs well on non-adversarial samples, but also has a high accuracy on adversarial samples. Compared with the base model, they have smaller differences in performance between the adversarial and non- adversarial samples, stronger robustness, and less vulnerability to attack from the adversarial samples. The larger model structure has stronger anti-jamming ability and can effectively improve the robustness of the model.

The RoBERTa model performs better on the adversarial test set than the BERT model. Our results show that RoBERTa-base and RoBERTa-large outperform BERT-base and BERT-large models in AVG indices.

4.3.2 Algorithm comparison

To further illustrate the advantages of our algorithm, we choose the following eleven methods for comparison: QAInfoMax [16], MAARS [9], R.M-Reader [27], KAR [11], BERT+Adv [12], ALUM [13], Sub-part Alignment [14], BERT+DGAdv [8], BERT+PR [15], HKAUP [28], and PQAT [13]. These eleven methods are used to improve the robustness of MRC model. The results are shown in Tables 3 and 4, and the best results are highlighted in bold. QAInfoMax [16] and MAARS [9] use part of the data in AddSent to verify their effectiveness. So, we divide the results into two tables. AddSent-small in Table 3 is a subset of AddSent in Table 4. Since most algorithms work on BERT, we also compare our results on BERT-base and BERT-large-wwm baseline models with those eleven algorithms.

Table 3 Performance of MRCCL in AS-small
Table 4 Performance of MRCCL in AS

In Table 4, compared with Sub-part Alignment, our algorithm has a 3.3 F1 increase on AddSent and a 15.0 F1 value increase on adversarial samples. Compared with MAARS (Majumder et al. 2021) in Table 3, which outperforms state-of-the-art defense techniques, the F1 of MRCCL on addSent is improved by 4.0 points. It shows that our method is better than the other eleven methods in adversarial test sets. On DEV, BERT-large-wwm+MRCCL performers better than other methods. On AddSent, BERT-base+MRCCL and BERT-large-wwm+MRCCL are better than other methods.

4.3.3 Ablation study

We have demonstrated through ablation experiments that all modules in the MRCCL are indispensable. We performed our experiments on the BERT and RoBERTa models. The ablation experiments have similar results on models of different sizes, so here we show the results on the base version. We report the ablation test in terms of 1) w/o CLCRS: we remove the contrastive module in our model. 2) w/o sentence logit: in the MRC module, the model will no longer calculate sentence logits. The results are shown in Table 5.

Table 5 Ablation experiments on the BERT model and the RoBERTa model

Both the sentence logits and the contrastive learning modules have a beneficial effect on the robustness of the BERT model. Sentence logits improves the accuracy of the BERT and RoBERTa models on the adversarial sample, but is still lower than MRCCL. In fact, since the contrastive learning module is computed on a sentence level, the addition of sentence logits allows for a better introduction of the contrastive learning task into the MRC model, and there is no conflict between the two in terms of model improvement.

Finally, we explore which kinds of training strategies are suitable for our approach. We compare F1 of the model trained by sequential and the model trained by multi-task on four test sets. Only the training strategy is modified, and other settings are unchanged. The results are shown in Table 6. It shows that the multi task joint training method has better effect than the sequential training method. Multi task joint training plays an important role in our method. Pipline in the table represents that we first carry out the comparative learning task, and then carry out the training of MRC task.

Table 6 Influence of different training strategies on the model

4.4 Parameter efficiency of MRCCL

MRCCL has no additional new parameters and has the same number of model parameters as the baseline model. However, as the contrastive learning module requires the construction of positive samples, the amount of training data is doubled, resulting in longer model training times. The results of the model training time comparison are shown in Table 7.

Table 7 Comparison of training time for different models

5 Discussion

Context representation space for reading comprehension models

CLCRS is the core of our algorithm. It is a special kind of contrastive learning suitable for MRC tasks, aims to solve the problem of over-density of the model context representation space. In our experiments, we found that the model’s ability to represent context directly influenced its robustness. If the distance between the sentences in the encoding context is too close, the accuracy of the model on that example will be greatly reduced. We illustrate this even further through experimentation. We counted the relationship between the accuracy of the five pre-trained models on adversarial samples and the denseness of the representation space. The results are shown in Fig. 3. The answer sentence is the key sentence for extracting the answer, and we chose the distance between the answer sentence and other sentences in the context as a measure of whether the representation space is dense. We use the cosine similarity between sentence vectors to calculate the distance. Set the threshold t in {0.75, 0.8, 0.85, 0.9} and counted the F1 score of samples where the average distance between the answer sentence and the other sentences exceeded t.

Fig. 3
figure 3

The impact of a dense context representation space on model accuracy

As shown in the figure, the denser the context representation space is, the less accurate the model is on it. Experiments demonstrate that the ability of the MRC model to represent each sentence in the context directly affects the robustness of the model.

Next, we experimentally illustrate the improvement of MRCCL for model representation. As before,we counted the distance between the answer sentence and other sentences in the context. The results are show in Table 8.

Table 8 Impact of MRCCL on model representation

Mean Distance of the model has been reduced from 0.8 to -0.1. Range of Distance changed from (-0.142, 0.990) to (-0.777, 0.722). MRCCL effectively distances individual sentences in the context and generalises well on adversarial samples as well.

6 Conclusion and furtuer work

In this paper, we are committed to solving the problem of poor robustness of extractive MRC models. More specifically, we are committed to solving the problem that MRC model is prone to error on instances with additional misleading sentences, which is called the oversensitivity problem. We found that the poor robustness of the MRC model was caused by its overly dense context representation space. Therefore, we propose a multi task machine reading comprehension framework via contrastive learning called MRCCL. By introducing CLCRS into MRC model, we enhance the representation ability of MRC model, improve the robustness of the model, and then solve the oversensitivity problem. The experimental results show that our method is able to further improve model robustness and outperform state-of-the-art performance.