Improving the robustness of machine reading comprehension via contrastive learning

Pre-trained language models achieve high performance on machine reading comprehension task, but these models lack robustness and are vulnerable to adversarial samples. Most of the current methods for improving model robustness are based on data enrichment. However, these methods do not solve the problem of poor context representation of the machine reading comprehension model. We find that context representation plays a key role in the robustness of the machine reading comprehension model, dense context representation space results in poor model robustness. To deal with this, we propose a Multi-task machine Reading Comprehension learning framework via Contrastive Learning. Its main idea is to improve the context representation space encoded by the machine reading comprehension models through contrastive learning. This special contrastive learning we proposed called Contrastive Learning in Context Representation Space(CLCRS). CLCRS samples sentences containing context information from the context as positive and negative samples, expanding the distance between the answer sentence and other sentences in the context. Therefore, the context representation space of the machine reading comprehension model has been expanded. The model can better distinguish between sentence containing correct answers and misleading sentence. Thus, the robustness of the model is improved. Experiment results on adversarial datasets show that our method exceeds the comparison models and achieves state-of-the-art performance.


Introduction
In recent years, with the development of large-scale pretrained language models based on transformers, using pretrained model with fine-tuning has gradually become the mainstream method for natural language processing tasks. on pre-trained models have achieved good results on some popular MRC datasets, such as SQuAD [1] and RACE [2], but some studies [3][4][5] have shown that MRC models are vulnerable to adversarial samples, which called the robustness problem of MRC model. As shown in Table 1, adding additional sentences which has similar semantic with answer sentences could mislead the model and make it output wrong answers. Thus, the robustness of the model still needs to be further improved.
MRC models go wrong because they fail to distinguish between misleading sentences and answer sentences. Answer sentences and misleading sentences are recognized by the model as correct answers to question. Human beings can understand question well, discover the subtle differences among multiple candidate answer sentences from the question's perspective, and then find the correct answer. As shown in Table 1, people can judge that the type of answer is "time" according to the content of the answer sentence, and the important clue is "Richard's fleet". Then we can find the semantic differences between the two sentences from the perspective of the participants, and finally choose the correct answer according to the content described in the question. However, in the MRC, the encoding results of the answer sentence vector is close to that of the misleading sentence vector. And the model pays more attention to the misleading sentences because the misleading sentences contain some non-key words which appeared in the question sentence, and ignores the important decisive significance of the substantive words in the sentence for the answer result.
To deal with the robustness issue mentioned above, Welbl et al. [6] used data augmentation and adversarial training, Jia and Liang [3], Wang and Bansal [7], Liu et al. [8] enriched the training set by generating adversarial examples. However, since the types of adversarial examples are innumerable, all the above methods by augmenting the training dataset have some limitations. Majumder et al. [9] used an answer candidate reranking mechanism to avoid model errors on adversarial examples, but it sacrifices the accuracy on non-adversarial datasets and requires an additional complex structure to support the algorithm. All the above methods have improved the robustness of MRC models. We find that model representation has a great influence on model robustness. Existing MRC models' representation space is dense, especially the distance between answer sentence vectors and misleading sentence vectors is too close. Therefore, we propose Multi-task machine Reading Comprehension learning framework via Contrastive Learning (MRCCL), which introduces Contrastive Learning (CL) task into the MRC model through multi-task joint training. Specifically, we use dropout to generate positive samples, select other sentences in the passage as negative samples for contrastive learning, and jointly train the model with multitask. While expanding the model representation space, the representation consistency in the model representation space is maintained. Our contributions are summarized as follows: 1. We propose a new contrastive learning algorithm to improve the robustness of machine reading comprehension. By selecting each sentence in the context as positive and negative samples, the context representation space of the pre-trained language model is improved, and the robustness of the machine reading comprehension model is further improved. 2. Experimental results show that our algorithm effectively alleviates the problem of poor robustness of the MRC model and has achieved the state-of-the-art on the adversarial dataset.

Adversarial Attacks in Machine Reading Comprehension Model
The research on the robustness of MRC models have just emerged in recent years, among which are various attack methods. Jia and Liang [3] successfully attacked the MRC model by adding a misleading sentence at the end of the text. Wang and Bansal [7] inserted a misleading segment at a random position in the context. Liu et al. [8] proposed a method that generates adversarial examples automatically, perfecting the method mentioned above. Welbl [6] attacked the MRC model by using part-ofspeech-based and entity-word-based replacement methods.
Schlegel et al. [10] chose to automatically generate the adversarial set, which corresponds to the original by adding negative words. In our research, we focused on solving problems where the context contained misleading sentences similar to the answer sentences. The semantics of the two sentence are naturally similar, but there are huge differences from the perspective of the problem, which is very common in practical applications. Data augmentation and adversarial training have been used most widely for adversarial defenses. Jia and Liang [3], Wang and Bansal [7], Liu et al. [8] have improved the accuracy of the MRC model in oversensitivity problems by automatically generating adversarial samples. However, these methods use the adversarial training set generated by rules and have poor robustness in out-of-distribution data. Wang and Jiang [11] combined general knowledge with neural networks through data augmentation. Yang et al. [12,13] used adversarial training to maximize the countermeasure loss by adding perturbations in the embedding layer. In addition, some studies have attempted to change the process of model inference. Chen et al. [14] decompose both the question and context into small units and construct the graph, converting the question answering into an alignment problem. Majumder et al. [9] re-rank the candidate answers according to the degree of overlap between the candidate sentence and the question. Zhou et al. [15] introduce rules based on external knowledge to regularize the model and adjust its output distribution. Yeh and Chen [16] trained the model by maximizing the mutual information between passages, questions, and answers, avoiding the effect of superficial biases in the data on the robustness of the model. Although these methods improve the robustness of MRC model to varying degrees, they do not consider the influence of model representation on robustness.
Contrastive Learning Recent years, contrastive learning has become a popular self-supervised representation learning technique, which has been extensively used in computer vision. The main idea of contrastive learning is to shorten the distance of positive sample pairs in the representation space. Chen et al. [17] proposed SimCLR, which constructs positive samples by data augmentation and constructs negative samples by random sampling in the same batch.
Contrastive learning has been applied to learn better sentence representations in the field of NLP. Gao et al. [18] proposed SimCSE, which uses dropout as a means of data augmentation and achieves good performance in natural language inference tasks. Wang et al. [19] proposed a method to construct semantic negative examples for contrastive learning to improve the robustness of the pretrained language model. Yan et al. [20], Zhang et al. [21] used contrastive learning to solve the folding problem of model representation space and achieved good results in short text clustering and natural language inference tasks. However, all the above contrastive learning methods take the input context as a unit to improve the representation of the model, without considering the fine-grained representation of the context.

Method
In this section, we will introduce the multi-task machine reading comprehension learning framework via contrastive learning. The framework of the MRCCL is illustrated in Fig 1. The model can be divided into the MRC module, the Contrastive Learning module and multi-task joint training strategy. The MRC module is used to extract the correct answers, the contrastive learning module is used to improve the representation ability of the model, the MRC module and the contrastive learning module are jointly trained by multi-task joint training strategy. The MRC module shares the same encoding layer parameters as the contrastive learning module. Each of these two modules has its own loss function, respectively, but we combine the two loss functions to produce a joint loss function and adjust the model's parameters. Besides, the contrastive learning module only works in the training stage. In the next section, we will illustrate each module in the MRCCL in detail.

MRC Model Architecture
In MRC module, we adopt the most common extractive MRC model. The structure of MRC module is shown in the left of Fig. 1. It is composed of an encoder and a downstream multi-grain classifier. In the extractive MRC task, given where C i denotes the context that needs to be understood by the model, Q i denotes the question, A i denotes the answer label corresponding to the question, and n denotes the size of the dataset. In the training stage, each input data is always composed of such triples. In the extractive MRC task, answer A i is composed of the starting position A s i and the ending position A e i . The model needs to find the starting and ending positions of the answer from C i according to the input Q i .  position score logits ∈ R 2 * d . Then, the model calculate the sentence level score sentence logits ∈ R 2 * d and the word level score start logits ∈ R d and end logits ∈ R d separately. Start logits and end logits can be obtained directly from logits. For each sentence in the context, the model calculates the mean value of the answer position score for each word in the sentence to get the sentence-level scores. Sentence-level scores is not a value, it has the same dimension as sentence length. Splice the sentence-level scores of all sentences in the context to get sentence logit. Finally, we add the word level score and the sentence level score to obtain f start logits ∈ R d and f end logits ∈ R d . f start logits and f end logits are the start and end position scores used to generate the answer fragment The encoder output directly affects the calculation of f start logits and f end logits in the MRC model and then affects the extraction of answer fragments. In other words, if the two vectors are represented at near positions in the context representation space, their answer scores will be close. Therefore, dense context representation space leads to poor robustness of the model and easy to get wrong answers under the adversarial attacks. Our MRC module uses the cross-entropy function to calculate the loss according to the score. The loss function can be expressed as follows: where f CE denotes the cross-entropy function, A s i and A e i denote the starting position label and ending position label of the answer, respectively. The encoding result and linear layer weight will be changed by back propagation.

Contrastive learning in context representation space
In order to solve the problem of dense representation space in traditional MRC models, we introduce the contrastive learning into the MRC model, corresponding to the contrastive learning module described above. Due to the particularity of the extractive MRC task, we make improvements to the contrastive learning and call it Contrastive Learning in Context Representation Space (CLCRS). CLCRS is a type of supervised contrastive learning, it can be considered as the contrastive learning in the representation space of context. Common contrastive learning, which compares positive and negative samples taken from same batch, has little effect in distinguishing between answer and misleading sentences. To deal with this, CLCRS has a unique sampling strategy for MRC models. CLCRS samples sentences containing context information from the context as positive and negative samples, expanding the distance between the answer sentence and other sentences in the context. CLCRS can solve the problem of dense representation space of the original MRC model based on the pre-training model by enlarge the distance between different sentence vectors. Specifically, CLCRS is shown in the right of Fig. 1. The input of CLCRS has the same form as the MRC module. Its input can be expressed as ctx . . , c n i denotes the context, c j i denotes each of the sentence in the context, Q denotes the question sentence, and A denotes the sentence where the answer lies. Different from the previous mainstream contrastive learning strategy of sampling negative samples from the same batch, we use contrastive learning to sample sentences in the context as negative samples for comparison. Specifically, following Gao et al. [18], we generate the positive sample corresponding to A by dropout and select other sentences in the input ctx, except question Q, as negative samples for contrastive learning. The effect of CLCRS on the ability to represent the model is shown in Fig. 2. For the input ctx, the encoding result encoder output ∈ R m * d is obtained after encoding. We divide the encoder output into different sentence vectors according to the original sentences and use the mean pooling to generate the vector representation cl output ∈ R k * m of the sentences, where k denotes the number of sentences in the context and m denotes the dimension of the hidden layer. In CLCRS, we use InfoNCE as its loss function, and it is shown as follows: where S (·) denotes the cosine similarity function, τ is a hyper parameter, z i denotes the positive sample, and z j denotes the negative sample. By optimizing this loss function, the distance between each sentence in the context is enlarged, and the context representation space is expanded.

Multi task learning
Contrastive learning can expand the context representation space. In order to expand the representation space of the MRC model, we introduce a multi-task learning strategy. In our method, we combine the loss function of MRC and the loss function of contrastive learning, optimize the two modules simultaneously in the training stage. CLCRS only works in the training stage. Specifically, we share encoder parameters between the MRC module and CLCRS.
Referring to the work of Liebel and Körner [22], we combine the loss functions of the two modules into a joint loss function as follows: L union = f (L mrc , L cl ) = 1 a 2 l mrc +log(1+a 2 )+ where a and b are parameters that can be learned, L mrc and L cl are the loss functions of the MRC module and CLCRS, respectively.

Experiments
In order to verify the performance of our algorithm, we carried out several experiments and analyzed the experimental results. First, we introduce the datasets and experiments setting. Second, we evaluate our method on adversarial datasets in two kinds of baseline pre-train language models and compare it with other methods. Finally, we conduct the ablation study to verify the effectiveness of each module in MRCCL.

Datasets
We only use the SQuAD1.1 training set to train our model. And for the problem we want to solve, we generate an adversarial test set AddCfa for evaluating the robustness  Fig. 2 The strategy of contrastive learning in our method. Q denotes question sentences, C denotes context, c i denotes other sentences in the context except for the sentences where the answer lies, A denotes the sentences where the answer lies, and V denotes misleading sentences. Two sentences with similar colors represent a pair of positive samples generated by dropout. When the model encodes adversarial samples, the contrastive learning module can effectively distance the answer sentence from other sentences such as misleading sentences of the model according to SQuAD1.1-dev set. Following Jia and Liang [3], the generation method of AddCfa is as follows: firstly, use the similar words in GLOVE [23] to replace the named entities and numbers in the answer sentence, then use the antonyms in WordNet [24] to replace the nouns and adjectives in the answer sentence to obtain the misleading sentence, and finally insert it into the back of the answer sentence in the context. The example in Table 1 is taken from AddCfa. We choose DEV [1], AddSent [3], AddCfa, and AddSentMod [3] as test sets to evaluate our approach.

Experiment settings
We selected five pre-trained language models for our experiments: BERT-base [25], BERT-large [25], BERTlarge-wwm, RoBERTa-base [26], RoBERTa-large [26]. For the common set, AdamW optimizer is used during the training stage. All the parameters required for multi-task joint training are optimized by AdamW optimizer. The maximum input length for our model is set to 384. To deal with long text, we chunk them into equally-spaced segments and use a sliding window of 128 size. We set the number of training epochs to 3. We use 0.

Experiment on baseline model
We selected five different size pre-trained models as the baseline model to verify our algorithm: BERT-base, BERTlarge, BERT-large-wwm, RoBERTa-base and RoBERTalarge. We test the F1 and EM of the model on four test sets as evaluation metrics. AVG is calculated according to the results of the model on DEV, AddSentMod, AddSent-adv and AddCfa-adv, and the results are used as the final metrics to evaluate the robustness of the model. All the results are shown in Table 2 and the best results are highlighted in bold. Compared with AVG results, our model improved robustness across all baseline models. It has the best performance in RoBERTa-large model with a 2.3 improvement. Even the least significant improvements in bert-Large and Roberta-Base were 1.4.
The large model not only performs well on nonadversarial samples, but also has a high accuracy on adversarial samples. Compared with the base model, they have smaller differences in performance between the adversarial and non-adversarial samples, stronger robustness, and less vulnerability to attack from the adversarial samples. The larger model structure has stronger anti-jamming ability and can effectively improve the robustness of the model.
The RoBERTa model performs better on the adversarial test set than the BERT model. Our results show that RoBERTa-base and RoBERTa-large outperform BERT-base and BERT-large models in AVG indices.

Algorithm comparison
To further illustrate the advantages of our algorithm, we choose the following eleven methods for comparison: QAInfoMax [16], MAARS [9], R.M-Reader [27], KAR [11], BERT+Adv [12], ALUM [13], Sub-part Alignment [14], BERT+DGAdv [8], BERT+PR [15], HKAUP [28], and PQAT [13]. These eleven methods are used to improve the robustness of MRC model. The results are shown in Tables 3 and 4, and the best results are highlighted in bold. QAInfoMax [16] and MAARS [9] use part of the data in AddSent to verify their effectiveness. So, we divide the results into two tables. AddSent-small in Table 3 is a subset of AddSent in Table 4. Since most algorithms work on BERT, we also compare our results on BERT-base and BERT-large-wwm baseline models with those eleven algorithms.
In Table 4, compared with Sub-part Alignment, our algorithm has a 3.3 F1 increase on AddSent and a 15.0 F1 value increase on adversarial samples. Compared with MAARS (Majumder et al. 2021) in Table 3, which outperforms state-of-the-art defense techniques, the F1 of MRCCL on addSent is improved by 4.0 points. It shows that our method is better than the other eleven methods in adversarial test sets. On DEV, BERT-large-wwm+MRCCL    Table 5. Both the sentence logits and the contrastive learning modules have a beneficial effect on the robustness of the BERT model. Sentence logits improves the accuracy of the BERT and RoBERTa models on the adversarial sample, but is still lower than MRCCL. In fact, since the contrastive learning module is computed on a sentence level, the addition of sentence logits allows for a better introduction of the contrastive learning task into the MRC model, and there is no conflict between the two in terms of model improvement.
Finally, we explore which kinds of training strategies are suitable for our approach. We compare F1 of the model Fig. 3 The impact of a dense context representation space on model accuracy trained by sequential and the model trained by multi-task on four test sets. Only the training strategy is modified, and other settings are unchanged. The results are shown in Table 6. It shows that the multi task joint training method has better effect than the sequential training method. Multi task joint training plays an important role in our method. Pipline in the table represents that we first carry out the comparative learning task, and then carry out the training of MRC task.

Parameter efficiency of MRCCL
MRCCL has no additional new parameters and has the same number of model parameters as the baseline model. However, as the contrastive learning module requires the construction of positive samples, the amount of training data is doubled, resulting in longer model training times. The results of the model training time comparison are shown in Table 7.

Discussion
Context representation space for reading comprehension models CLCRS is the core of our algorithm. It is a special kind of contrastive learning suitable for MRC tasks, aims to solve the problem of over-density of the model context representation space. In our experiments, we found that the model's ability to represent context directly influenced its robustness. If the distance between the sentences in the encoding context is too close, the accuracy of the model on that example will be greatly reduced. We illustrate this even further through experimentation. We counted the relationship between the accuracy of the five pre-trained models on adversarial samples and the denseness of the representation space. The results are shown in Fig. 3. The answer sentence is the key sentence for extracting the answer, and we chose the distance between the answer sentence and other sentences in the context as a measure of whether the representation space is dense. We use the cosine similarity between sentence vectors to calculate the distance. Set the threshold t in {0.75, 0.8, 0.85, 0.9} and counted the F1 score of samples where the average distance between the answer sentence and the other sentences exceeded t. As shown in the figure, the denser the context representation space is, the less accurate the model is on it.
Experiments demonstrate that the ability of the MRC model to represent each sentence in the context directly affects the robustness of the model.
Next, we experimentally illustrate the improvement of MRCCL for model representation. As before,we counted the distance between the answer sentence and other sentences in the context. The results are show in Table 8.
Mean Distance of the model has been reduced from 0.8 to -0.1. Range of Distance changed from (-0.142, 0.990) to (-0.777, 0.722). MRCCL effectively distances individual sentences in the context and generalises well on adversarial samples as well.

Conclusion and furtuer work
In this paper, we are committed to solving the problem of poor robustness of extractive MRC models. More specifically, we are committed to solving the problem that MRC model is prone to error on instances with additional misleading sentences, which is called the oversensitivity problem. We found that the poor robustness of the MRC model was caused by its overly dense context representation space. Therefore, we propose a multi task machine reading comprehension framework via contrastive learning called MRCCL. By introducing CLCRS into MRC model, we enhance the representation ability of MRC model, improve the robustness of the model, and then solve the oversensitivity problem. The experimental results show that our method is able to further improve model robustness and outperform state-of-the-art performance.
Data Availability Dataset was derived from the following public domain resources: https://github.com/rajpurkar/SQuAD-explorer Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.