Introduction

Automatic grading can save teaching personnel time and effort, while also offering nearly instantaneous, inexhaustible feedback to students. Additionally, it can shift students’ attention from the potential reputation gain or loss in the graders’ eyes to their work (Lipnevich & Smith, 2009). Thus, it is unsurprising that automatic grading has garnered a lot of research and commercial attention in the last decades, with platforms like MoodleFootnote 1 and EdgenuityFootnote 2 even offering automatic short answer grading (ASAG) options. In ASAG, short free-text responses are automatically evaluated based on how wholly and correctly they answer a given question. While the commonly available applications are still based on keyword and string matching, the next logical step is using more sophisticated models.

Transformer models, such as BERT (Devlin et al, 2019), have been shown to perform well on the ASAG task (Camus & Filighera, 2020). On specific datasets, they are even beginning to compare to human performance (Sung et al, 2019). However, such models do not differentiate between valuable and brittle features in their decision process (Ilyas et al, 2019). For example, a model may observe that incorrect answers typically contain fewer punctuation symbols than correct answers. While there is hardly a causal relation between the number of punctuation symbols in a short-answer response and its factual correctness, current neural networks would utilize such spurious correlations for their predictive power.

Unreliable features in automatic grading models mainly pose two problems. Firstly, students may not receive the points they deserve due to misclassifications. This is especially problematic for student populations represented less well in the training data. For example, students with Developmental Language Disorder will likely express themselves differently, such as using fewer or less complex adjectives Davies et al (2023); Tribushinina and Dubinkina (2012), than their typically developing counterparts. Should adjective usage be correlated with high grades in the overall dataset due to sampling effects and high-performing students typically utilizing expressive language, students with language deficiencies will be systematically disadvantaged by neural automatic grading models. Secondly, unreliable features may cause students to receive points for incorrect responses. Noticeable patterns caused by unreliable features may even be purposefully exploited.

Students can exploit a grading model’s weaknesses to achieve better grades, similar to copying from a cheat sheet or other students during an assessment. It is likely that students would be just as willing to employ such methods as more traditional cheating tactics, provided the necessary skills and opportunities. Although the reported percentage of students employing traditional cheating techniques in practice varies greatly in various studies (Jordan, 2001), large-scale reviews report around 70%—86% of students cheating on exams or assignments during their college career (Klein et al, 2007; Whitley, 1998). Many cheating incidents are never caught (Franklyn-Stokes & Newstead, 1995), which is problematic for two reasons. Firstly, undetected cheating cases can call all assessment results into question, even when most of them are legitimate. Secondly, cheating during a semester correlates with lower learning outcomes in final exams (Palazzo et al, 2010). This indicates that cheaters retain less knowledge than they would learn by actually completing the coursework. Therefore, automatic grading models should not only score well on a given dataset but also make their predictions for the right reasons.

Cheating ASAG models may be conducted by answering with motley lists of potential keywords, which has proven successful in getting good grades from current ASAG systems (Ding et al, 2020). Alternatively, it may also extend to adversarial attacks, which subtly modify inputs to prompt an incorrect prediction. Adversarial attacks have been a hot topic of research in the last few years, with thousands of proposed approaches to exploiting unreliable features and considerable attention from the general publicFootnote 3Footnote 4. However, most do not translate well to the automatic grading scenario where knowledge of the model’s inner workings is scarce and the time and expertise of potential attackers are limited. For this reason, we design an attack to answer the following research question:

How vulnerable are neural automatic short answer grading models to an adversarial attack based on inserting adjectives and adverbs?

In summary, we make the following contributions in this paper:

  • We propose an adversarial attack specifically tailored to assessment scenarios. Querying the model prior to the assessment identifies adjectives and adverbs the model associates with the target class. These adjectives and adverbs can then be inserted into grammatically valid places during testing to fool the model into predicting the target class. A successful attack is depicted in Table 1.

  • We demonstrate the attack’s effectiveness on BERT and T5 (Raffel et al, 2020) using automatic short answer grading and related datasets. Our evaluation shows that BERT’s and T5’s predictions are affected by spurious correlations between adjective and adverb usage and answer correctness.

  • We conduct a human evaluation of our attack to investigate its detectability. Knowing how easily human graders can spot adversarial attacks is vital for estimating the risk of discovery. The riskiness of cheating, in turn, influences how likely students are to employ adversarial attacks in practice.

  • We formulate recommendations for using automatic grading systems more securely in practice. Our recommendations are based on related work, our experiments, and a systematic investigation of the models’ brittleness.

Table 1 A successful adverb insertion causing the automatic grading model to shift its prediction from incorrect to correct

The rest of the paper is structured as follows. In the following section, we state the considerations that affected the requirements and design of our attack. Then, we discuss related approaches to gaming educational systems, automatic short answer grading and adversarial attacks. In "Methods", we describe our attack in detail, followed by the setup of our experiments. Next, "Results" presents our hypotheses, a comparison of our attack with the state-of-the-art, our analysis of the model’s brittleness and the results of our human evaluation. Finally, we discuss our results and provide recommendations for safely employing automatic grading systems in "Discussion & Conclusion".

Adversarial Attack Design Considerations

Most adversarial attacks manipulate specific input instances. In our scenario, this means that they manipulate individual student answers. They are not suited to fool automatic grading systems because one would either have to know what one will answer before the assessment or one would have to run the adversarial attack during the assessment. As most adversarial attacks require time and feedback from the model, they are hard to use in time-constrained assessments.

Universal adversarial attacks, on the other hand, aim to apply to all answers. For example, a model may be vulnerable to specific trigger phrases that increase the model’s likelihood of predicting a target class, regardless of the actual sample (Filighera et al, 2020a). Once found, such trigger phrases can easily be inserted into new answers at test time without necessitating any on-the-fly adaptation or know-how on the student’s side. However, such a cheating strategy is risky as manual graders quickly identify nonsensical trigger token sequences as cheating attempts. This example illustrates some of the unique constraints encountered in the educational assessment scenario, underlining the need for a specifically tailored attack. In summary, our attack is based on the following considerations:

  • Access to the model. Many adversarial attacks use information about the inner workings of a model to inform their search. For example, they may propagate the model’s gradients to find influential words in a sample. Modifying important words is more likely to fool the model successfully. However, students would not typically have access to the grading model’s inner workings. Furthermore, students would not have access to the model’s raw output. Many approaches utilize the class probabilities outputted to find sequences of perturbations that increase the probability of the target class. This constraint already makes most of the current adversarial attacks proposed in the literature nonviable for the assessment domain. However, we assume that students can receive verification feedback from the model prior to the targeted assessment. For example, this would be the case when students have multiple assignments graded by the model throughout the semester. Alternatively, students may be allowed to submit multiple answers to the model for formative assessment. Prior access to the model is likely, considering one of the main advantages of automatic grading models is their ability to provide an inexhaustible source of verification feedback queriable as often as desired.

  • Detectability. When the perceived cost of cheating is low, students are more willing to engage in academically dishonest behavior (Murdock & Anderman, 2006). One of the main factors influencing the perceived cost is the likelihood of being caught (Murdock & Anderman, 2006). Therefore, the chance of detection will impact the students’ decision whether to employ a given adversarial attack. Detectability includes how easily manipulated samples are spotted, automatically or manually, and how hard it would be to prove a deceptive intent. For example, concatenating the same nonsensical phrase to every answer the student is unsure about could not only quickly be flagged automatically, but a student would also be hard-pressed to provide a believable excuse. In contrast, overusing adjectives or adverbs the model is vulnerable to is much harder to spot and could also be explained away by the student’s writing style.

  • Expertise necessary to utilize attack during test time. In general, we do not expect students to be machine learning experts. While some students may very well have the ability to identify a model’s weakness given enough time and knowledge (Filighera et al, 2020b), it is unlikely that a majority of students will be able to perform a complex adversarial attack under pressure. For this reason, it is essential that any attack would be straightforwardly executable during the assessment.

  • Class equivalency. The modified samples produced by an adversarial attack are called adversarial examples. Their exact definition varies in the literature, but it is common to define them as intentionally modified versions of clean inputs aiming to fool a machine learning technique (Akhtar & Mian, 2018). This definition implies that the adversarial example’s actual class should remain identical to the original clean sample’s. In our case, this means that any perturbation of incorrect answers should not actually make the answers correct. It should only fool the model into predicting them as such.

  • Type of Input. We aim to fool automatic short answer grading systems. Therefore, we expect to deal with short answers between a phrase and a few paragraphs long (Burrows et al, 2015). The evaluation focus is on the semantic content of the response in contrast to the writing style or grammatical correctness. For the design of the attack, this means that linguistic modifications and even grammatical mistakes are acceptable as long as they do not change the response’s meaning significantly.

Related Work

In this section, we discuss related work intersecting with ours. First, we summarize prior art on exploiting educational systems. Moving on, we present automatic short answer grading systems. Finally, we recapitulate various adversarial attack methods found in the natural language processing field (NLP).

“Gaming" Educational Systems

As in most systems where people stand to gain, educational systems encounter learners that try to achieve their goal through unintended strategies. This is true for traditional physical classrooms and especially relevant in online or distance learning, where students are not restricted to copying from their neighbors but may access the entire internet (Austin & Brown, 1999; Lanier, 2006; Watson & Sottile, 2010).

Beyond plagiarism, there are also cheating strategies unique to digital learning. Students may take screenshots of assessment questions to share with students being assessed later, gain illicit access to the question pool’s repository by exploiting lax security measures or disrupt internet connections to re-take assessments (Rowe, 2004). McGee (2013) even advises against using traditionally popular formats, such as Multiple Choice questions, in online assessments as the correct answer can easily be found on the web. Instead, they recommend constructed response and essay questions where multiple correct answers exist.

In Massive Open Online Courses (MOOCs), gaming the system is prevalent enough to warrant a designation for learners committing to non-learning strategies: fake learners (Alexandron et al, 2018, 2019). Here, some learners set up multiple accounts gathering solutions to assessments to use in their main account (Northcutt et al, 2016; Ruiperez-Valiente et al, 2016). Students may cooperate to share valid answers even when multiple accounts are impossible, like in Small Private Online Courses (SPOCs) (Jaramillo-Morillo et al, 2020).

The work discussed so far mainly investigated academic dishonesty on a system level by exploiting the lack of direct supervision or the structure of an online course. We will now focus on the work closest to our own, namely task-oriented cheating attempts. Such behavior and possible mitigation approaches have been well studied in intelligent tutoring systems (Baker et al, 2006; Muldner et al, 2010, 2011; Peters et al, 2018; Walonoski & Heffernan, 2006a, b). Beyond exploiting systematic weaknesses, such as known savepoints or progressive hints, students may also systematically probe tasks to guess the correct answers (Baker et al, 2008). For instance, they can select every choice in a Multiple Choice question or exhaustively try out different numbers in a math problem. Depending on the tutor, students may also repeatedly submit the same answer or empty answers to prompt the tutor to provide the correct solution (Baker et al, 2010).

Similar to previous work (Ding et al, 2020; Filighera et al, 2020a), we aim to extend this line of research to short answer constructed-response formats that have been less popular in tutors and online assessments due to the difficulty of automatically grading them. As this seems to be changing (Sung et al, 2019), exploring potential weaknesses and cheating detection strategies is essential before ASAG systems see widespread use.

Automatic Short Answer Grading

The challenge of automatically grading short answers was first posed a few decades ago. Earlier ASAG approaches consisted of clustering similar answers(Basu et al, 2013; Zehner et al, 2016), utilizing hand-crafted rules, schemes and ideal answer models (Leacock & Chodorow, 2003; Willis, 2015), or combining manually engineered features with various machine learning models (Marvaniya et al, 2018; Mohler et al, 2011; Saha et al, 2018; Sahu & Bhowmick, 2020; Sultan et al, 2016). Please refer to one of the comprehensive surveys of this field for a more in-depth elaboration of these approaches (Burrows et al, 2015; Galhardi & Brancher, 2018; Roy et al, 2015).

In recent years, deep learning approaches have outperformed classical methods (Kumar et al, 2017; Riordan et al, 2017; Tan et al, 2018, 2020). They mainly treat ASAG as a text similarity or entailment problem and focus on encoding student answers and reference answers in the same vector space. This learned representation of the answers then determines their similarity. Additionally, some approaches consider the question (Lv et al, 2021), student models (Zhang et al, 2020b) or results from True/False questions posed in the same assessment (Uto & Uchida, 2020). Transformer-based approaches are also noteworthy here (Camus & Filighera, 2020; Ghavidel et al, 2020; Lun et al, 2020; Sung et al, 2019). They achieve high performances on the SemEval short answer grading benchmark dataset (Dzikovska et al, 2013). We selected two transformer-based models for grading in this paper: BERT (Devlin et al, 2019), for its high performance in related work, and T5 (Raffel et al, 2020), for its high performance on the SuperGLUE benchmarkFootnote 5 containing various NLP tasks. Both models are Transformers, meaning they use attention instead of recurrence or convolution to extract information from sequences. They are pretrained by language modeling on large corpora to learn a basic representation of general language. While BERT is pretrained on books and Wikipedia, T5 utilizes a filtered version of a Common Crawl web dump. After pretraining, the models can then be finetuned on task-specific data. Typically, the pretrained weights are only adjusted for a few epochs before the best performance on the task is reached. In contrast to T5, BERT only consists of an encoder. Thus, it is half as large in terms of parameters and requires the addition of a task-specific output layer.

Adversarial Attacks in NLP

In the last years, the number of adversarial example generation methods has increased exponentially (Chakraborty et al, 2021; Huang et al, 2020; Xu et al, 2020; Yuan et al, 2019; Zhang et al, 2020a). Automatic approaches mainly consist of strategically making minor, often meaning-preserving adjustments to the input text.

Changes can be done on a word level by inserting, deleting or replacing words. Proposed replacement strategies include replacing words with their synonyms (Jin et al, 2020; Ren et al, 2019), their closest neighbors in the embedding space (Alzantot et al, 2018), legitimate words that could result from potential typos (Samanta & Mehta, 2017) or other words with a high probability of matching the input context (Zhang et al, 2019). Recently, researchers also utilized BERT to generate adversarial examples by masking parts of the input text (Garg & Ramakrishnan, 2020) or predicting possible token replacements (Li et al, 2020). Belinkov and Bisk (2018) consider character-level modifications, such as word scrambling or swapping adjacent characters. Lastly, paraphrasing approaches aim to modify the structure of whole sentences (Iyyer et al, 2018) or use variational autoencoders to generate adversarial examples from scratch (Ren et al, 2020). Manual or semiautomatic approaches, on the other hand, ask experts (Ettinger et al, 2017; Wallace et al, 2019b) or students (Filighera et al, 2020b) to find adversarial perturbations for specific examples manually.

Important to mention here is the TextFooler attack proposed by Jin et al (2020) since it forms the basis of our comparison with the state-of-the-art in "Results" section. The first step of this attack is to identify important words by deleting them from an input sequence and observing their effect on the outputted classification probabilities. While this can be considered a black-box approach according to common definitions (Zhang et al, 2020a), the raw class probabilities outputted by a model are not usually accessible to the model’s users. However, using this information makes the attack more powerful and, thus, a better representative of state-of-the-art performance. Once important words are identified, they can be replaced by synonyms to fool the target model in the second step of the attack.

All the previously described approaches have in common that they target individual texts. As discussed in "Adversarial Attack Design Considerations", they do not apply to assessment scenarios. Students would have to know exactly what they will answer to the assessment questions beforehand to find adversarial modifications that work for precisely those answers.

Instead, students require input-agnostic strategies that they can then apply to unexpected questions during test time. Universal attacks aim to consistently fool the model on all samples instead of individually manipulating each sample. Sample independence can be achieved by generalizing individual adversarial examples to generally applicable rules (Ribeiro et al, 2018). Ribeiro et al (2018) first translate the input into a pivot language and back to generate paraphrases. Paraphrases that are semantically similar to the original input and cause a misclassification in the target model are abstracted into candidate rules which are then manually verified to be semantically equivalent. For example, one could observe that doubling question marks in texts often succeeds in fooling the model. So “? \(\to\) ??" would be a legitimate replacement rule for all texts, even if it may not be applicable or successful on every example. However, attacks aiming to find semantically equivalent, general replacement rules often suffer losses to their success rate. Ribeiro et al (2018), for instance, flip the predicted label of 1–4% of the samples in their experiments.

Our proposed attack is similar to Ribeiro et al’s (2018) approach as we also probe the model to find adjectives and adverbs that fool the model as often as possible that we can then generally insert in grammatically proper places. Whereas Ribeiro et al (2018) constrain their modifications to be semantically equivalent to the original example, we only require the actual class to remain unchanged. While inserting adjectives and adverbs likely changes the sample’s class in some NLP tasks, like sentiment analysis, it is unlikely to make incorrect answers correct—excluding negating adverbs, such as not. Thus, we can find more viable rules with higher success rates by carefully relaxing the equivalency constraint.

Alternatively, Gao and Oates (2019) search for a small perturbation in the embedding space that is then applied to all tokes indiscriminately, similar to adding noise to images. Their attack requires access to the preprocessed and embedded inputs, which students would not typically have. The last category of approaches constructs meaningless trigger sequences of tokens that a model associates with a specific class (Behjati et al, 2019; Filighera et al, 2020a; Song et al, 2021; Wallace et al, 2019a). While these triggers can then be applied straightforwardly to all answers in an assessment, they are detectable due to their nonsensical nature.

Methods

In this section, we will first introduce the details of our proposed attack. Then, we describe our experimental setup for measuring the attack’s quality. As briefly discussed in "Adversarial Attack Design Considerations", we are not only interested in how successfully it can fool victim models but also in its feasibility, the likelihood of being detected, and the validity of the generated samples.

Adversarial Word Insertion

To systematically insert adjectives and adverbs that cause misclassifications, we first require a source of promising adjectives and adverbs. As can be seen in the overview of our attack in Fig. 1, we selected the Brown Corpus (Maverick, 1969) for the extraction of candidates. The corpus contains a decent collection of English texts from various domains. The most significant benefit of this corpus is that the texts are already annotated with their part-of-speech tags. While automatic tagging was used in the annotation, reliability was increased through manual proofreading. The high-quality annotation allows us to identify potential adjectives and adverbs. Since we plan to insert them before nouns and verbs, we analyze all bigrams contained in the corpus to find adjectives and adverbs that appear in the targeted constellation. Specifically, we only retain bigrams of the following forms:

  • (Adjective, Noun)

  • (Adjective, Pronoun)

  • (Adjective, Proper Noun)

  • (Adverb, Verb)

Fig. 1
figure 1

Schematic overview of the attack

Consequently, our list of adjectives will only contain adjectives that appear directly before a noun or pronoun in the texts. For example, “The hat was alive." would not yield an adjective for our selection, but “The blue hat was." would. While this limits our potential insertion candidates, it increases the likelihood of grammatically valid insertions later on. We filter stop-words to decrease the likelihood of correcting incorrect answers through our insertions and degrading the grammatical structure significantly. Fortunately, Bird et al (2009) provide a list of stop-words in their Natural Language Toolkit that also includes meaning-inverting words, such as not, that could easily turn a contradictory response into a correct one. Thus, stop-words and meaning-inverting words are deleted from the candidate list. Finally, we select the 100 most frequent adjectives and adverbs from the filtered lists as the basis for our insertions. Prioritizing commonly used words should make the generated adversarial examples appear more natural compared to “students" suddenly using rare words like contumacious or Rhadamanthine.

Next, we need to identify possible insertion places for our adjectives and adverbs. Commonly, adversarial approaches would utilize the model’s gradients or class probabilities to identify words that have a high impact on the model’s prediction. For example, if deleting a word significantly reduces the probability assigned by the model to the true class, it would be marked as a good replacement candidate. However, we do not believe students will have detailed information on the grading model in practice. Therefore, we take the model-agnostic approach of declaring all nouns, proper nouns and pronouns available for adjective-prepending and, correspondingly, all verbs targets for prepending adverbs. This process is illustrated under “Viable Positions" in Fig. 1. However, should the grading model become available to students, the number of positions can be constrained to the most promising ones to make the attack more efficient.

Now that we have generated a multitude of adversarial candidates by inserting our adjectives and adverbs into the viable positions, it is time to query the model to see which candidates lead to misclassification. All successful adversarial examples are then collected to determine adjectives and adverbs that cause the most misclassifications. Students could then use these in assessments to improve their automatically assigned grades.

Experiment Setup

This section describes the hyperparameters, datasets and experiment configurations used in this paper. In all our experiments, we use the base-sized BERT and T5 models provided by the huggingface library (Wolf et al, 2019). We perform hyperparameter-tuning using 10% of the training data for validation. Each model trains for 8 epochs before the check-point with the best macro-averaged F1 score on the validation set is selected. After training, the respective best models are evaluated on the test splits of each dataset. All true negatives, that is, incorrect responses that the model correctly identifies as such, form the basis for the adversarial search. To avoid an over-estimation of the attacks’ success, we exclude incorrect answers already misclassified by the model and thus do not require modification.

Datasets & Hyperparameters

As discussed in the "Automatic Short Answer Grading" Section, automatic short answer grading is often viewed as a textual entailment or paraphrase detection task. For this reason, we also included such tasks from the popular GLUE and SuperGLUE benchmarks in our evaluation. In total, we experiment with the following four datasets, allowing us to investigate our attack’s applicability to a broad range of domains:

  • SciEntsBank (SEB) is a common ASAG benchmark providing questions, reference and student answers from various domains (Dzikovska et al, 2013). The answers stem from primary and middle school classes in the USA. We select the 3-way variant of this dataset, where answers are labeled as correct, incorrect or contradictory. The dataset contains three test sets: unseen answers for training questions (UA), unseen questions (UQ) and questions belonging to unseen domains (UD). The best performing BERT model (found after 3 epochs) used a batch size of 32 and a learning rate of \(2e-5\). The best performing T5 model (found after 7 epochs) trained with a batch size of 8, gradient accumulation over 4 batches and an Adafactor optimizer using relative steps and initial warmup (Shazeer & Stern, 2018). All reported T5 models use the same optimizer settings.

  • Recognizing Textual Entailment (RTE) is a task included in the GLUE and SuperGLUE benchmark. We selected this dataset because the limited amount of data proves to be challenging even for pre-trained transformer-based models. The data set contains sequence pairs of texts and hypotheses, and the model predicts whether the hypothesis can be inferred from the text. Recognizing textual entailment is quite similar to automatic short answer grading, where student answers should entail the reference answer (Dzikovska et al, 2013). The text pairs are labeled as entailment and not_entailment, corresponding to correct and incorrect in SciEntsBank. Since the test set for this benchmark is not public, we report the performance on the development set instead. The best performing BERT model (6 epochs) trained with a batch size of 32 and a learning rate of \(1e-5\). The best T5 model (6 epochs) was found using a batch size of 8 and gradients accumulated over 8 batches.

  • Multi-Genre Natural Language Inference (MNLI) is also a textual entailment task and part of the GLUE benchmark, containing pairs of premises and hypothesis (Williams et al, 2018). In contrast to RTE, the data set is categorized with three labels: entailment, contradictory and neutral. While the labeled test set is not publicly available, two development sets are provided, of which one was used as test set in our experiments. The best performing BERT model (2 epochs) utilized a batch size of 64, a learning rate of \(2e-5\) and mixed-precision training (FP16). The hyperparameters of the T5 model remained unchanged.

  • The Microsoft Research Paraphrase Corpus (MRPC) aims to teach models to detect paraphrases (Dolan & Brockett, 2005). It makes up a part of the GLUE benchmark as well. Here, sequence pairs are labeled as 1, if the second sequence is semantically equivalent to the first one, and 0 otherwise. Detecting paraphrases is similar to grading short answers, where student answers should be semantically equivalent to the reference solution. Therefore, we can view instances labeled with 0 as incorrect and paraphrases as correct. The best BERT model (3 epochs) trained with a batch size of 32, a gradient accumulation over 2 batches, a learning rate of \(2e-5\) and mixed precision. The best T5 model (3 epochs) used a batch size of 8 and gradient accumulation over 4 batches.

Human Evaluation

While calculating the attack’s success rate is easily done, other quality dimensions are harder to measure. For example, since automatic metrics and models have difficulties capturing the meaning of utterances (Bender & Koller, 2020; Reiter, 2018), we need to rely on human judgment to determine whether our generated samples adhere to the class equivalency constraint. That is, whether the answers are still incorrect after our modification. Similarly, we require human opinions to estimate how easily adversarial examples are detected. While there are attempts to detect adversarial attacks automatically, they are most often bypassable with tweaks to the algorithm (Carlini & Wagner, 2017). Ultimately, we also expect a human grader to have the final say, making their judgment the most important to students. Asking humans to evaluate given texts is a well-known task in Natural Language Generation (NLG). Therefore, we will defer to NLG guidelines when evaluating our manipulated student responses. As judgements are often subjective, it is recommended to collect at least 3 different annotations per text to increase the evaluation’s reliability (Van Der Lee et al, 2019). Thus, we need at least 3 human graders for each experimental condition.

For this purpose, we conducted an online survey with 7 experienced graders. We selected graders based on their teaching and grading experience, English skills and availability. All annotators possessed university degrees and routinely graded short answer tasks for university courses—mainly in the computer science domain. Therefore, they should have the general education required to assess the primary and middle school science questions contained in the ASAG benchmark dataset SciEntsBank. We also included the reference answers in the questionnaire and were available to answer questions about the material to ensure the understanding necessary for grading. The graders filled out the questionnaire independently from each other.

The annotators had diverse backgrounds, hailing from India, Iran, Syria, Slovenia and Germany. While none of them were native English speakers, all of them spoke English fluently. Two of the annotators were female and five were male. We randomly assigned annotators to either the control (N = 4) or the experimental (N = 3) condition. In the control condition, annotators viewed 30 unmodified student answers and rated the answers’ naturalness, correctness and suspiciousness on 5-point Likert scales. Here naturalness refers to how likely a text was produced by a human, considering only form (Howcroft et al, 2020). Correctness refers to how accurately and completely the question is answered. Suspiciousness or mistrust capture how much a person believes the student is trying to cheat an automatic grading system.

After piloting this study, we chose to include explanations with examples for each level on the scale to increase the annotators’ understanding. The exact questions, as well as the hints, can be seen in Fig. 2. When annotators thought the student was cheating (by scoring at least 4 on the mistrust scale), they were also asked whether they would take action based on their opinion. This conditional Yes/No question can be seen in Fig. 3. The experimental group answered the same questions for the adversarially modified but otherwise identical answers.

Fig. 2
figure 2

Screenshot of survey questions posed in the human evaluation of the attack

Fig. 3
figure 3

Screenshot of the conditional question whether the annotator would act on their suspicion

The answers were randomly sampled from the successful adversarial examples that fooled the model, discovered using the SciEntsBank data set on the T5 model so that each question would only appear once in the survey. Thus, each test set resulted in a list of questions with a random student answer and a random adversarial perturbation. To save our experts’ time, we each selected the 10 shortest that did not reference external material, such as graphs or tables. Since this only left 8 questions stemming from the unseen questions test set, we oversampled the unseen answers test set to compensate. Annotators were informed that some of the responses may have been manipulated to fool an automatic grading system.

In compliance with the guidelines on ethical studies with human participants, we informed our annotators of the study’s risks and benefits, gave our contact information and stated that the study was voluntary and could be aborted at any time. Additionally, we ensured that all given opinions were anonymized prior to analysis and publication. We did not impose any time constraints on filling out the questionnaire. However, the questionnaire was designed to take 45–60 min. We deemed 60 min to be the upper time limit justifiable considering the annotation task’s complexity and the required concentration. Since we estimated that annotators would need 1–2 min per answer, we limited the number of answers to be evaluated to 30. On average, annotators required 53.14 min to complete the survey.

Results

This section presents our hypotheses, compares the effectiveness of our attack to the state-of-the-art attack TextFooler (Jin et al., 2020) and provides a deeper analysis of the models’ brittleness. Finally, we offer the results of our human evaluation and analyze the agreement between our expert graders.

Predictions

The following expectations (E) and hypotheses (H) motivate our experiments. Expectations will be explored descriptively while hypotheses will be tested.

E1

We expect our attack to perform competitively compared to the state-of-the-art attack TextFooler in terms of accuracy degradation.

E2

We expect our attack to exploit spurious correlations between adjectives and adverbs and the target class. Thus, adjectives and adverbs that successfully fool a model should appear more often in correct than incorrect student responses in the model’s training set.

E3

Our attack is primarily successful on low-confidence predictions, that is, predictions where the class probability assigned by the model, is considerably smaller than one.

H4

Manipulations generated by our attack do not make incorrect student responses appear more correct to humans.

H5

Humans perceive manipulated responses as less natural compared to unmodified student responses.

H6

Humans do not perceive manipulated responses as more suspicious compared to unmodified student responses.

Comparison to State-of-the-Art Attack TextFooler

First, we want to compare how well our attack can degrade a model’s performance compared to the state-of-the-art. We choose Jin et al. (2020)'s TextFooler approach to represent the state-of-the-art for two reasons. First, it has a high success rate compared to other attacks. Second, it is open-source, allowing for quick and easy reproduction of the authors’ approach. Table 2 shows our attack’s and TextFooler’s performance on the datasets introduced in "Datasets & Hyperparameters" We target BERT and T5 models with our attack and the same BERT model with TextFooler. We do not evaluate TextFooler on T5, as the attack utilizes the prediction score for the target class, which we do not have readily available in a text generation model.

Table 2 Comparison of our attack to TextFooler (TF)

As expected, the models’ base performance without adversarial manipulation varies from dataset to dataset, with small datasets, such as RTE and MRPC, and challenging tasks, such as generalizing to unseen questions or domains, lagging in terms of accuracy. Interestingly, the absolute loss in accuracy caused by each attack seems relatively stable across tasks and datasets, even when the original performance varies.

TextFooler takes less calculation time than our attack on every dataset. The lesser time is expected since they use the target label’s prediction scores to find important words in a sequence that they can then manipulate. In contrast, our attack assumes such information to be inaccessible to students and, therefore, does not tailor its manipulations to significant words. This difference is also reflected in our model finding more adversarial examples as it tries more possible combinations per student answer. Even though our search is less guided, our attack seems to be slightly more effective at dropping models’ accuracy on the ASAG task, degrading the accuracy by an additional 0.4—3.8 percentage points across the SciEntsBank test splits. However, since TextFooler outperforms our attack on the other tasks (by 2.9—8.1 percentage points), we conclude that the attacks’ performance is dataset-dependent. Across all models and datasets, our attack deteriorates a model’s accuracy by 8 to 22 percentage points.

Interestingly, our attack seems to be equally or more effective on T5 than BERT, even though T5 is a newer model. Especially for the data splits SEB UQ and MRPC, where T5 originally outperforms BERT, this indicates that at least some of T5’s performance gain is due to unreliable statistical features.

Source of the Model’s Brittleness

Next, we want to investigate possible reasons for the attack’s success. Knowing why the model’s predictions are brittle may allow educators to develop appropriate defense mechanisms or reveal potential warning signs. Since we are mainly interested in our attacks behavior in automatic grading scenarios, the rest of our analyses will focus on the SciEntsBank dataset. First, we will investigate the distribution of adjectives and adverbs in the training data. We expect that successful adjectives and adverbs found with our attack are more often associated with correct student responses (E2).

In general, the dataset contains slightly more incorrect responses (2462) than correct ones (2008). On average, correct responses are slightly longer than incorrect answers, with 13.4 words per answer compared to 11.7 words per answer. Correct answers also average more adjectives (1.1) and adverbs (0.6) per answer than incorrect ones (0.8 and 0.5, respectively). We mainly observed two patterns when plotting the occurrences of the most successful adjectives and adverbs in each class. Either the adjectives and adverbs were much more common in correct student responses, or they hardly appeared in the training set. Figure 4 illustrates both patterns for the 10 adjectives causing the most misclassifications on the unseen answers test split. Some rare words seem to be synonyms of words common in correct responses, like “complete” and “entire”. Others are also expected to be close in the embedding space, such as “completely”—one of the top ten adverbs. Only one of the most successful insertion words appeared notably more often in incorrect student responses. The adjective “better” occurred 15 times in incorrect responses and only 4 times in correct answers. Thus, we conclude that our evidence supports E2 for most adjectives and adverbs, but not all.

Fig. 4
figure 4

Number of occurrences of the 10 most successful adjectives (top) and adverbs (bottom) in the SciEntsBank training set per class

Next, we investigate the model’s confidence when classifying adversarial examples. To be specific, we analyze the class probabilities given by a softmax of BERT’s final outputs. We plot them before and after the adversarial insertion in Fig. 5. For reference, we also provide the confidence scores for all incorrect student responses correctly classified by the model. We can see that soon-to-be adversarial examples elicit lower confidence than most predictions before the attack. Most test answers are classified with a confidence score between 0.8 and 1, while the model estimates most soon-to-be adversarial examples to be incorrect with a probability between 0.45 and 0.65. Since we have three classes in the dataset, a class needs at least a probability of 0.33 to be selected. After the attack, adversarial examples tend to elicit similar confidence—but for the target class. These observations are in line with our expectation E3. We will further discuss the ramifications of our results in "Recommendations".

Fig. 5
figure 5

BERT’s confidence levels for all incorrect samples it classifies correctly (left), all examples that will be misclassified after the attack (middle) and all adversarial examples (right)

Human Evaluation

The goal of the following survey was to investigate our attack’s effect on the naturalness, correctness and suspiciousness of student answers. Figure 6 shows the distribution of scores assigned to the answers in the control and experimental group. The means and standard deviations for each question can be found in Table 3.

Fig. 6
figure 6

Distribution of assigned Likert scores by the annotators in the human evaluation. The top row depicts the ratings given by graders in the control group, while the bottom row shows the same for the experimental group. A one on the Likert scale encodes a low magnitude of the given construct, while a five indicates the answer was very natural, correct or suspicious. The absolute number of each rating and percentage are displayed next to the respective bars

Table 3 Krippendorff’s \(\alpha\), mean (M) and standard deviation (SD) of the graders ratings

To test the hypothesis that our attack does not increase the actual correctness of responses (H4), we test for inferiority employing the two one-sided tests (TOST) procedure as discussed by Wellek (2002). We select the non-parametric Mann–Whitney U test since our data is ordinal and average the scores assigned by the various graders in a group into a more reliable and independent measurement of each answer’s correctness. As suggested by Lakens (2017), we chose -\(\infty\) as lower bound to test for inferiority instead of equivalence and 0.5 as upper bound. Our observations are consistent with H4 (\({n}_{1}=\) \({n}_{2}=\) 30, \({U}_{control}=\) 597.5, \({U}_{adv}=\) 302.5, \(p=\) 0.015). Thus, human graders generally awarded less or equal points to manipulated answers, indicating that our attack does not make the student answers correct. It only tricks the automatic model into predicting them as such, hence adhering to the class equivalency constraint of adversarial examples.

Next, we assess whether our attack decreases the naturalness of answers (H5) using a left-tailed Mann–Whitney U test. Here, our collected data is also consistent with H5 (\({n}_{1}=\) \({n}_{2}=\) 30, \({U}_{control}=\) 627, \({U}_{adv}=\) 273, \(p=\) 0.004, \(Z=\) -2.6174, \(r=\) 0.34). This result indicates that human graders perceive student answers with inserted adjectives and adverbs as less natural. We hypothesized that graders would be able to sense the manipulation but not identify it as a cheating attempt (H6).

Equivalently to our inferiority test conducted on the responses’ correctness, we utilize two one-sided Mann–Whitney U tests to test whether our attack increases the mistrust of human graders. We also selected -\(\infty\) as lower bound and 0.5 as upper bound. We found that human graders in the experimental group generally thought the students were cheating less or as often as in the control group (\({n}_{1}=\) \({n}_{2}=\) 30, \({U}_{control}=\) 576, \({U}_{adv}=\) 324, \(p=\) 0.031). A similar trend can be observed when asking whether graders would take action based on their suspicions. In the control group, graders reported the intention of acting 14 times (N = 120). Conversely, graders only wanted to act 5 times (N = 90) on the adversarial examples. Graders declined to speak with the student, superior or take disciplinary action for all other answers they rated with at least 4 on the mistrust scale. Examples for the most suspicious responses can be seen in Table 4. The examples also illustrate a concerning phenomenon that one of the annotators reported (translated from German):“Generally, I find it difficult to differentiate between bad English and unnatural responses."

Table 4 Examples of the most suspicious responses from the control (top) and the adversarial group (bottom)

Inter-Annotator Agreement

As discussed in the "Human Evaluation" Section, human judgements can be subjective and inconsistent. For this reason, it is common in the NLP field to employ multiple annotators and report their agreement. The inter-annotator agreement provides a measure for how consistent judgements are across annotators. Similar to related work, we select Krippendorff’s Alpha to estimate our annotators’ agreement. As can be seen in Table 3, \(\alpha\) is relatively low compared to the broadly applied benchmark of 0.67 (Krippendorff, 2018). For the highly subjective and open mistrust question, a low agreement is to be expected. The annotators were informed that some student answers might have been manipulated to fool automatic grading models but not schooled on how such a manipulation could look like. The low agreement (\(\alpha =\) 0.13) and slight systematic disagreement (\(\alpha =\) -0.11) indicate that the annotators developed individual theories of what cheating would entail in an automatically graded environment.

Additionally, there was a moderate negative Spearman’s rank correlation (\(\rho\)) between mistrust and naturalness (\(\rho =\) -0.41) as well as mistrust and correctness (\(\rho =\) -0.51) in the control group. In contrast, the correlations in the experimental group were much weaker with \(\rho =\) 0.20 and \(\rho =\) 0.07, respectively. This indicates that graders suspect poorly written and wrong answers in the absence of other clues. We will further discuss this behavior and possible ramifications in "Recommendations".

While low inter-annotator agreement is a phenomenon commonly observed in natural language evaluation (Amidei et al, 2019), we were surprised to see \(\alpha\) below 0.3 for naturalness. As recommended by (Amidei et al, 2019), we calculate \(\rho\) for each annotator pair to gain more detailed insight compared to \(\alpha\)’s holistic score. In the control group, one of the annotators is an outlier with pairwise \(\rho\)’s of 0.14, 0.07 and -0.02. The rest of the annotators average a moderate to strong correlation of \(\rho\) = 0.57 (Corder & Foreman, 2011; Dancey & Reidy, 2007). We decided against excluding the outlying annotator from further analysis. Their judgment on the other questions was more in line with the majority, indicating a divergent but potentially valid interpretation of naturalness instead of a systematic disregard for the task. In the experimental group, the average \(\rho\) is 0.47.

For correctness, the agreement levels are \(\alpha =\) 0.51, \(\rho =\) 0.6 in the control group and \(\alpha =\) 0.55, \(\rho =\) 0.61 in the experimental group. Our observed agreement is expected, considering the generally high inter-grader variability of scores assigned in short answer grading tasks (Starch & Elliott, 1913).

Discussion & Conclusion

In summary, we have introduced an adversarial attack strategy developed explicitly for automatic short answer grading scenarios. It first identifies promising adjectives and adverbs during formative assessment in preparation for employing them during the summative assessment. Our proposed attack reduces a model’s accuracy by 8 to 22 percentage points. We demonstrate the attack’s applicability to various domains and datasets, where inserting a single adjective or adverb is unlikely to change an input’s actual class. Thus, the attack is suited for academic disciplines where the factual correctness of responses is essential and may be unsuited to language learning scenarios where linguistic expression is vital.

Further, we conducted a human expert evaluation to measure our attack’s influence on the student answers’ correctness, naturalness and suspiciousness. In our experiments, the attack did not significantly increase the correctness or suspiciousness but significantly reduced the perceived naturalness of student responses. However, the decrease in naturalness is most likely due to the imperfection of the automatic insertion process. When students discover adjectives and adverbs the model associates with correct responses, they are likely better at incorporating them more naturally into their responses. Finally, we analyzed the adjective and adverb distribution in the training data and the model’s confidence to investigate possible reasons for the model’s vulnerability. We found that successful adjectives and adverbs appeared more often in the target class or hardly occurred in the training set. Additionally, adversarial examples tended to elicit a lower confidence score in the model than answers that were not vulnerable to this attack.

The following section offers recommendations for educators looking to employ automatic short answer systems in practice. The recommendations are based on our findings and general knowledge about adversarial attacks. Finally, we will discuss the limitations of our experiments and future work in "Limitations & Future Work".

Recommendations

Know Thy Dataset

This is especially important as more and more off-the-shelf models become available for various tasks. This development makes it easy to treat machine learning models as black boxes without considering the possible consequences of their training process. However, a training data analysis can reveal statistical correlations that lead to unreliable prediction features. In our experiments, our attack exploited correlations between adjectives/adverbs and the target class. Beyond our work, non-robust features have been demonstrated for many popular datasets (Ilyas et al, 2019). One can also utilize adversarial attacks during training to automatically uncover unreliable features. This is also known as adversarial training and is one of the most promising defenses against adversarial attacks (Shafahi et al, 2019), even though it is still limited in its effectiveness since it is typically accompanied by a loss in accuracy on clean data and tends to lack generalizability to novel attack strategies. Moreover, knowledge of potential biases in the dataset can help mitigate discrimination of populations that are not well represented in the data (Mehrabi et al, 2021).

Beware of Low Confidence Predictions

The probabilities assigned to each class can be a valuable indication of whether the prediction is trustworthy. While confidence scores are by no means infallible, they can be a warning sign for when a student’s answer should be referred to a manual grader. In our experiments, many of the generated adversarial examples could have been caught this way.

Train Personnel on what to Expect

While automatic grading models are making great strides towards human-like performance on some datasets, we would still recommend employing humans in the grading loop. They can double-check low confidence predictions and perform quality control checks. However, it is vital to educate human control graders on how cheating attempts can look like in the automatic grading age. In our inter-annotator agreement analysis, we observed graders developing individual theories of what made student responses suspicious. Their mistrust would also correlate with how unnatural and incorrect they perceived student answers to be. So, in the absence of other clues or knowledge, the graders in our study would falsely suspect low-performing students and students with poor language skills. We believe that educating human graders on different kinds of attacks and how they express themselves in responses could mitigate such discrimination. In general, any detection method would have to be carefully implemented to avoid disadvantaging minorities not well represented in the data.

Balance Transparency and Exposure of Vulnerabilities

It is crucial that students comprehend their grades. Understanding why a particular grade was given is essential to foster acceptance and enable learning from feedback. Here, making the model’s decision process transparent to students is a powerful approach to increase understanding. However, transparency may also reveal exploitable weaknesses, such as unreliable features. Having access to the model’s inner workings enables more powerful and efficient adversarial attacks. Therefore, one can argue that keeping grading models secret is sensible. Moreover, one may implement measures that make it harder for adversaries to glean information from querying the model. For example, one can limit the number of times students can receive feedback from the model in a time span to humanly reasonable levels, thus, hindering automatic probing.

Limitations & Future Work

Finally, we will point out a few limitations of our experiments and ideas for future work. This paper focused on the effects of one adversarial attack strategy. As the space of possible adversarial manipulations is quite large, it will be exciting to see how well other strategies perform. We then plan to utilize developed attacks in adversarial training to make grading models more robust and explore the models’ usability and security in practice. Here, one could also investigate other effects of adversarial attack assessment strategies, such as their impact on responses the student would have answered correctly without the adversarial modification. Moreover, we assumed that potential attackers would purposefully aim to fool the model into accepting incorrect responses. It would also be interesting to investigate grading models’ robustness to non-malicious writing styles and mistakes, such as common typos or varying verbosity levels.

Additionally, our experiments could be expanded to other automatic short answer grading architectures. So far, we have explored the attack’s effectiveness on transformer-based models on various datasets. While the existence of adversarial vulnerabilities is generally believed to be a result of neural networks exploiting unreliable correlations in the training data instead of being a bug of a particular architecture or hyperparameter setup (Ilyas et al, 2019), we can not rule out that other grading models may be significantly less sensitive to our particular attack. Especially classical machine learning models based on engineered features are likely to require attacks tailored to their feature sets.

Lastly, we mainly see two factors restricting the generalizability of our human evaluation. First, the number of samples annotated was not large enough to reliably detect minor effects. Especially for the mistrust hypothesis, a follow-up study with a larger sample size would have to be conducted to rule out the attack making responses slightly more suspicious. Considering our graders took almost an hour to rate 30 responses, we think more annotators and multiple annotation sessions would make sense.

Second, all of our graders stem from engineering fields and work at a university. It would be interesting to see whether our observations also hold for other fields and other education institutions. Especially American school teachers may be better at differentiating manipulated answers from poorly written ones. While our annotators were accustomed to grading English short answers in their daily lives and speak English proficiently, they were not native speakers. Moreover, they stem from various countries, such as India and Slovenia, and may speak different English dialects. This probably impacted the evaluation of naturalness, as indicated by the low inter-annotator agreement, but we expect only a minor effect on the correctness and mistrust scales.