Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs

Filighera, Anna; Ochs, Sebastian; Steuer, Tim; Tregel, Thomas

doi:10.1007/s40593-023-00361-2

Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs

ARTICLE
Open access
Published: 26 July 2023

Volume 34, pages 616–646, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs

Download PDF

1445 Accesses
2 Citations
Explore all metrics

Abstract

Automatic grading models are valued for the time and effort saved during the instruction of large student bodies. Especially with the increasing digitization of education and interest in large-scale standardized testing, the popularity of automatic grading has risen to the point where commercial solutions are widely available and used. However, for short answer formats, automatic grading is challenging due to natural language ambiguity and versatility. While automatic short answer grading models are beginning to compare to human performance on some datasets, their robustness, especially to adversarially manipulated data, is questionable. Exploitable vulnerabilities in grading models can have far-reaching consequences ranging from cheating students receiving undeserved credit to undermining automatic grading altogether—even when most predictions are valid. In this paper, we devise a black-box adversarial attack tailored to the educational short answer grading scenario to investigate the grading models’ robustness. In our attack, we insert adjectives and adverbs into natural places of incorrect student answers, fooling the model into predicting them as correct. We observed a loss of prediction accuracy between 10 and 22 percentage points using the state-of-the-art models BERT and T5. While our attack made answers appear less natural to humans in our experiments, it did not significantly increase the graders’ suspicions of cheating. Based on our experiments, we provide recommendations for utilizing automatic grading systems more safely in practice.

Fooling It - Student Attacks on Automatic Short Answer Grading

Fooling Automatic Short Answer Grading Systems

Towards Generating Counterfactual Examples as Automatic Short Answer Feedback

Find the latest articles, discoveries, and news in related topics.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Automatic grading can save teaching personnel time and effort, while also offering nearly instantaneous, inexhaustible feedback to students. Additionally, it can shift students’ attention from the potential reputation gain or loss in the graders’ eyes to their work (Lipnevich & Smith, 2009). Thus, it is unsurprising that automatic grading has garnered a lot of research and commercial attention in the last decades, with platforms like Moodle^{Footnote 1} and Edgenuity^{Footnote 2} even offering automatic short answer grading (ASAG) options. In ASAG, short free-text responses are automatically evaluated based on how wholly and correctly they answer a given question. While the commonly available applications are still based on keyword and string matching, the next logical step is using more sophisticated models.

Transformer models, such as BERT (Devlin et al, 2019), have been shown to perform well on the ASAG task (Camus & Filighera, 2020). On specific datasets, they are even beginning to compare to human performance (Sung et al, 2019). However, such models do not differentiate between valuable and brittle features in their decision process (Ilyas et al, 2019). For example, a model may observe that incorrect answers typically contain fewer punctuation symbols than correct answers. While there is hardly a causal relation between the number of punctuation symbols in a short-answer response and its factual correctness, current neural networks would utilize such spurious correlations for their predictive power.

Unreliable features in automatic grading models mainly pose two problems. Firstly, students may not receive the points they deserve due to misclassifications. This is especially problematic for student populations represented less well in the training data. For example, students with Developmental Language Disorder will likely express themselves differently, such as using fewer or less complex adjectives Davies et al (2023); Tribushinina and Dubinkina (2012), than their typically developing counterparts. Should adjective usage be correlated with high grades in the overall dataset due to sampling effects and high-performing students typically utilizing expressive language, students with language deficiencies will be systematically disadvantaged by neural automatic grading models. Secondly, unreliable features may cause students to receive points for incorrect responses. Noticeable patterns caused by unreliable features may even be purposefully exploited.

Students can exploit a grading model’s weaknesses to achieve better grades, similar to copying from a cheat sheet or other students during an assessment. It is likely that students would be just as willing to employ such methods as more traditional cheating tactics, provided the necessary skills and opportunities. Although the reported percentage of students employing traditional cheating techniques in practice varies greatly in various studies (Jordan, 2001), large-scale reviews report around 70%—86% of students cheating on exams or assignments during their college career (Klein et al, 2007; Whitley, 1998). Many cheating incidents are never caught (Franklyn-Stokes & Newstead, 1995), which is problematic for two reasons. Firstly, undetected cheating cases can call all assessment results into question, even when most of them are legitimate. Secondly, cheating during a semester correlates with lower learning outcomes in final exams (Palazzo et al, 2010). This indicates that cheaters retain less knowledge than they would learn by actually completing the coursework. Therefore, automatic grading models should not only score well on a given dataset but also make their predictions for the right reasons.

Cheating ASAG models may be conducted by answering with motley lists of potential keywords, which has proven successful in getting good grades from current ASAG systems (Ding et al, 2020). Alternatively, it may also extend to adversarial attacks, which subtly modify inputs to prompt an incorrect prediction. Adversarial attacks have been a hot topic of research in the last few years, with thousands of proposed approaches to exploiting unreliable features and considerable attention from the general public^{Footnote 3}^{Footnote 4}. However, most do not translate well to the automatic grading scenario where knowledge of the model’s inner workings is scarce and the time and expertise of potential attackers are limited. For this reason, we design an attack to answer the following research question:

How vulnerable are neural automatic short answer grading models to an adversarial attack based on inserting adjectives and adverbs?

In summary, we make the following contributions in this paper:

We propose an adversarial attack specifically tailored to assessment scenarios. Querying the model prior to the assessment identifies adjectives and adverbs the model associates with the target class. These adjectives and adverbs can then be inserted into grammatically valid places during testing to fool the model into predicting the target class. A successful attack is depicted in Table 1.
We demonstrate the attack’s effectiveness on BERT and T5 (Raffel et al, 2020) using automatic short answer grading and related datasets. Our evaluation shows that BERT’s and T5’s predictions are affected by spurious correlations between adjective and adverb usage and answer correctness.
We conduct a human evaluation of our attack to investigate its detectability. Knowing how easily human graders can spot adversarial attacks is vital for estimating the risk of discovery. The riskiness of cheating, in turn, influences how likely students are to employ adversarial attacks in practice.
We formulate recommendations for using automatic grading systems more securely in practice. Our recommendations are based on related work, our experiments, and a systematic investigation of the models’ brittleness.

Table 1 A successful adverb insertion causing the automatic grading model to shift its prediction from incorrect to correct

Full size table

The rest of the paper is structured as follows. In the following section, we state the considerations that affected the requirements and design of our attack. Then, we discuss related approaches to gaming educational systems, automatic short answer grading and adversarial attacks. In "Methods", we describe our attack in detail, followed by the setup of our experiments. Next, "Results" presents our hypotheses, a comparison of our attack with the state-of-the-art, our analysis of the model’s brittleness and the results of our human evaluation. Finally, we discuss our results and provide recommendations for safely employing automatic grading systems in "Discussion & Conclusion".

Adversarial Attack Design Considerations

Most adversarial attacks manipulate specific input instances. In our scenario, this means that they manipulate individual student answers. They are not suited to fool automatic grading systems because one would either have to know what one will answer before the assessment or one would have to run the adversarial attack during the assessment. As most adversarial attacks require time and feedback from the model, they are hard to use in time-constrained assessments.

Universal adversarial attacks, on the other hand, aim to apply to all answers. For example, a model may be vulnerable to specific trigger phrases that increase the model’s likelihood of predicting a target class, regardless of the actual sample (Filighera et al, 2020a). Once found, such trigger phrases can easily be inserted into new answers at test time without necessitating any on-the-fly adaptation or know-how on the student’s side. However, such a cheating strategy is risky as manual graders quickly identify nonsensical trigger token sequences as cheating attempts. This example illustrates some of the unique constraints encountered in the educational assessment scenario, underlining the need for a specifically tailored attack. In summary, our attack is based on the following considerations:

Access to the model. Many adversarial attacks use information about the inner workings of a model to inform their search. For example, they may propagate the model’s gradients to find influential words in a sample. Modifying important words is more likely to fool the model successfully. However, students would not typically have access to the grading model’s inner workings. Furthermore, students would not have access to the model’s raw output. Many approaches utilize the class probabilities outputted to find sequences of perturbations that increase the probability of the target class. This constraint already makes most of the current adversarial attacks proposed in the literature nonviable for the assessment domain. However, we assume that students can receive verification feedback from the model prior to the targeted assessment. For example, this would be the case when students have multiple assignments graded by the model throughout the semester. Alternatively, students may be allowed to submit multiple answers to the model for formative assessment. Prior access to the model is likely, considering one of the main advantages of automatic grading models is their ability to provide an inexhaustible source of verification feedback queriable as often as desired.
Detectability. When the perceived cost of cheating is low, students are more willing to engage in academically dishonest behavior (Murdock & Anderman, 2006). One of the main factors influencing the perceived cost is the likelihood of being caught (Murdock & Anderman, 2006). Therefore, the chance of detection will impact the students’ decision whether to employ a given adversarial attack. Detectability includes how easily manipulated samples are spotted, automatically or manually, and how hard it would be to prove a deceptive intent. For example, concatenating the same nonsensical phrase to every answer the student is unsure about could not only quickly be flagged automatically, but a student would also be hard-pressed to provide a believable excuse. In contrast, overusing adjectives or adverbs the model is vulnerable to is much harder to spot and could also be explained away by the student’s writing style.
Expertise necessary to utilize attack during test time. In general, we do not expect students to be machine learning experts. While some students may very well have the ability to identify a model’s weakness given enough time and knowledge (Filighera et al, 2020b), it is unlikely that a majority of students will be able to perform a complex adversarial attack under pressure. For this reason, it is essential that any attack would be straightforwardly executable during the assessment.
Class equivalency. The modified samples produced by an adversarial attack are called adversarial examples. Their exact definition varies in the literature, but it is common to define them as intentionally modified versions of clean inputs aiming to fool a machine learning technique (Akhtar & Mian, 2018). This definition implies that the adversarial example’s actual class should remain identical to the original clean sample’s. In our case, this means that any perturbation of incorrect answers should not actually make the answers correct. It should only fool the model into predicting them as such.
Type of Input. We aim to fool automatic short answer grading systems. Therefore, we expect to deal with short answers between a phrase and a few paragraphs long (Burrows et al, 2015). The evaluation focus is on the semantic content of the response in contrast to the writing style or grammatical correctness. For the design of the attack, this means that linguistic modifications and even grammatical mistakes are acceptable as long as they do not change the response’s meaning significantly.

Related Work

In this section, we discuss related work intersecting with ours. First, we summarize prior art on exploiting educational systems. Moving on, we present automatic short answer grading systems. Finally, we recapitulate various adversarial attack methods found in the natural language processing field (NLP).

“Gaming" Educational Systems

As in most systems where people stand to gain, educational systems encounter learners that try to achieve their goal through unintended strategies. This is true for traditional physical classrooms and especially relevant in online or distance learning, where students are not restricted to copying from their neighbors but may access the entire internet (Austin & Brown, 1999; Lanier, 2006; Watson & Sottile, 2010).

Beyond plagiarism, there are also cheating strategies unique to digital learning. Students may take screenshots of assessment questions to share with students being assessed later, gain illicit access to the question pool’s repository by exploiting lax security measures or disrupt internet connections to re-take assessments (Rowe, 2004). McGee (2013) even advises against using traditionally popular formats, such as Multiple Choice questions, in online assessments as the correct answer can easily be found on the web. Instead, they recommend constructed response and essay questions where multiple correct answers exist.

In Massive Open Online Courses (MOOCs), gaming the system is prevalent enough to warrant a designation for learners committing to non-learning strategies: fake learners (Alexandron et al, 2018, 2019). Here, some learners set up multiple accounts gathering solutions to assessments to use in their main account (Northcutt et al, 2016; Ruiperez-Valiente et al, 2016). Students may cooperate to share valid answers even when multiple accounts are impossible, like in Small Private Online Courses (SPOCs) (Jaramillo-Morillo et al, 2020).

The work discussed so far mainly investigated academic dishonesty on a system level by exploiting the lack of direct supervision or the structure of an online course. We will now focus on the work closest to our own, namely task-oriented cheating attempts. Such behavior and possible mitigation approaches have been well studied in intelligent tutoring systems (Baker et al, 2006; Muldner et al, 2010, 2011; Peters et al, 2018; Walonoski & Heffernan, 2006a, b). Beyond exploiting systematic weaknesses, such as known savepoints or progressive hints, students may also systematically probe tasks to guess the correct answers (Baker et al, 2008). For instance, they can select every choice in a Multiple Choice question or exhaustively try out different numbers in a math problem. Depending on the tutor, students may also repeatedly submit the same answer or empty answers to prompt the tutor to provide the correct solution (Baker et al, 2010).

Similar to previous work (Ding et al, 2020; Filighera et al, 2020a), we aim to extend this line of research to short answer constructed-response formats that have been less popular in tutors and online assessments due to the difficulty of automatically grading them. As this seems to be changing (Sung et al, 2019), exploring potential weaknesses and cheating detection strategies is essential before ASAG systems see widespread use.

Automatic Short Answer Grading

The challenge of automatically grading short answers was first posed a few decades ago. Earlier ASAG approaches consisted of clustering similar answers(Basu et al, 2013; Zehner et al, 2016), utilizing hand-crafted rules, schemes and ideal answer models (Leacock & Chodorow, 2003; Willis, 2015), or combining manually engineered features with various machine learning models (Marvaniya et al, 2018; Mohler et al, 2011; Saha et al, 2018; Sahu & Bhowmick, 2020; Sultan et al, 2016). Please refer to one of the comprehensive surveys of this field for a more in-depth elaboration of these approaches (Burrows et al, 2015; Galhardi & Brancher, 2018; Roy et al, 2015).

In recent years, deep learning approaches have outperformed classical methods (Kumar et al, 2017; Riordan et al, 2017; Tan et al, 2018, 2020). They mainly treat ASAG as a text similarity or entailment problem and focus on encoding student answers and reference answers in the same vector space. This learned representation of the answers then determines their similarity. Additionally, some approaches consider the question (Lv et al, 2021), student models (Zhang et al, 2020b) or results from True/False questions posed in the same assessment (Uto & Uchida, 2020). Transformer-based approaches are also noteworthy here (Camus & Filighera, 2020; Ghavidel et al, 2020; Lun et al, 2020; Sung et al, 2019). They achieve high performances on the SemEval short answer grading benchmark dataset (Dzikovska et al, 2013). We selected two transformer-based models for grading in this paper: BERT (Devlin et al, 2019), for its high performance in related work, and T5 (Raffel et al, 2020), for its high performance on the SuperGLUE benchmark^{Footnote 5} containing various NLP tasks. Both models are Transformers, meaning they use attention instead of recurrence or convolution to extract information from sequences. They are pretrained by language modeling on large corpora to learn a basic representation of general language. While BERT is pretrained on books and Wikipedia, T5 utilizes a filtered version of a Common Crawl web dump. After pretraining, the models can then be finetuned on task-specific data. Typically, the pretrained weights are only adjusted for a few epochs before the best performance on the task is reached. In contrast to T5, BERT only consists of an encoder. Thus, it is half as large in terms of parameters and requires the addition of a task-specific output layer.

Adversarial Attacks in NLP

In the last years, the number of adversarial example generation methods has increased exponentially (Chakraborty et al, 2021; Huang et al, 2020; Xu et al, 2020; Yuan et al, 2019; Zhang et al, 2020a). Automatic approaches mainly consist of strategically making minor, often meaning-preserving adjustments to the input text.

Changes can be done on a word level by inserting, deleting or replacing words. Proposed replacement strategies include replacing words with their synonyms (Jin et al, 2020; Ren et al, 2019), their closest neighbors in the embedding space (Alzantot et al, 2018), legitimate words that could result from potential typos (Samanta & Mehta, 2017) or other words with a high probability of matching the input context (Zhang et al, 2019). Recently, researchers also utilized BERT to generate adversarial examples by masking parts of the input text (Garg & Ramakrishnan, 2020) or predicting possible token replacements (Li et al, 2020). Belinkov and Bisk (2018) consider character-level modifications, such as word scrambling or swapping adjacent characters. Lastly, paraphrasing approaches aim to modify the structure of whole sentences (Iyyer et al, 2018) or use variational autoencoders to generate adversarial examples from scratch (Ren et al, 2020). Manual or semiautomatic approaches, on the other hand, ask experts (Ettinger et al, 2017; Wallace et al, 2019b) or students (Filighera et al, 2020b) to find adversarial perturbations for specific examples manually.

Important to mention here is the TextFooler attack proposed by Jin et al (2020) since it forms the basis of our comparison with the state-of-the-art in "Results" section. The first step of this attack is to identify important words by deleting them from an input sequence and observing their effect on the outputted classification probabilities. While this can be considered a black-box approach according to common definitions (Zhang et al, 2020a), the raw class probabilities outputted by a model are not usually accessible to the model’s users. However, using this information makes the attack more powerful and, thus, a better representative of state-of-the-art performance. Once important words are identified, they can be replaced by synonyms to fool the target model in the second step of the attack.

All the previously described approaches have in common that they target individual texts. As discussed in "Adversarial Attack Design Considerations", they do not apply to assessment scenarios. Students would have to know exactly what they will answer to the assessment questions beforehand to find adversarial modifications that work for precisely those answers.

Instead, students require input-agnostic strategies that they can then apply to unexpected questions during test time. Universal attacks aim to consistently fool the model on all samples instead of individually manipulating each sample. Sample independence can be achieved by generalizing individual adversarial examples to generally applicable rules (Ribeiro et al, 2018). Ribeiro et al (2018) first translate the input into a pivot language and back to generate paraphrases. Paraphrases that are semantically similar to the original input and cause a misclassification in the target model are abstracted into candidate rules which are then manually verified to be semantically equivalent. For example, one could observe that doubling question marks in texts often succeeds in fooling the model. So “? \(\to\) ??" would be a legitimate replacement rule for all texts, even if it may not be applicable or successful on every example. However, attacks aiming to find semantically equivalent, general replacement rules often suffer losses to their success rate. Ribeiro et al (2018), for instance, flip the predicted label of 1–4% of the samples in their experiments.

Our proposed attack is similar to Ribeiro et al’s (2018) approach as we also probe the model to find adjectives and adverbs that fool the model as often as possible that we can then generally insert in grammatically proper places. Whereas Ribeiro et al (2018) constrain their modifications to be semantically equivalent to the original example, we only require the actual class to remain unchanged. While inserting adjectives and adverbs likely changes the sample’s class in some NLP tasks, like sentiment analysis, it is unlikely to make incorrect answers correct—excluding negating adverbs, such as not. Thus, we can find more viable rules with higher success rates by carefully relaxing the equivalency constraint.

Alternatively, Gao and Oates (2019) search for a small perturbation in the embedding space that is then applied to all tokes indiscriminately, similar to adding noise to images. Their attack requires access to the preprocessed and embedded inputs, which students would not typically have. The last category of approaches constructs meaningless trigger sequences of tokens that a model associates with a specific class (Behjati et al, 2019; Filighera et al, 2020a; Song et al, 2021; Wallace et al, 2019a). While these triggers can then be applied straightforwardly to all answers in an assessment, they are detectable due to their nonsensical nature.

Methods

In this section, we will first introduce the details of our proposed attack. Then, we describe our experimental setup for measuring the attack’s quality. As briefly discussed in "Adversarial Attack Design Considerations", we are not only interested in how successfully it can fool victim models but also in its feasibility, the likelihood of being detected, and the validity of the generated samples.

Adversarial Word Insertion

To systematically insert adjectives and adverbs that cause misclassifications, we first require a source of promising adjectives and adverbs. As can be seen in the overview of our attack in Fig. 1, we selected the Brown Corpus (Maverick, 1969) for the extraction of candidates. The corpus contains a decent collection of English texts from various domains. The most significant benefit of this corpus is that the texts are already annotated with their part-of-speech tags. While automatic tagging was used in the annotation, reliability was increased through manual proofreading. The high-quality annotation allows us to identify potential adjectives and adverbs. Since we plan to insert them before nouns and verbs, we analyze all bigrams contained in the corpus to find adjectives and adverbs that appear in the targeted constellation. Specifically, we only retain bigrams of the following forms:

(Adjective, Noun)
(Adjective, Pronoun)
(Adjective, Proper Noun)
(Adverb, Verb)

Consequently, our list of adjectives will only contain adjectives that appear directly before a noun or pronoun in the texts. For example, “The hat was alive." would not yield an adjective for our selection, but “The blue hat was." would. While this limits our potential insertion candidates, it increases the likelihood of grammatically valid insertions later on. We filter stop-words to decrease the likelihood of correcting incorrect answers through our insertions and degrading the grammatical structure significantly. Fortunately, Bird et al (2009) provide a list of stop-words in their Natural Language Toolkit that also includes meaning-inverting words, such as not, that could easily turn a contradictory response into a correct one. Thus, stop-words and meaning-inverting words are deleted from the candidate list. Finally, we select the 100 most frequent adjectives and adverbs from the filtered lists as the basis for our insertions. Prioritizing commonly used words should make the generated adversarial examples appear more natural compared to “students" suddenly using rare words like contumacious or Rhadamanthine.

Next, we need to identify possible insertion places for our adjectives and adverbs. Commonly, adversarial approaches would utilize the model’s gradients or class probabilities to identify words that have a high impact on the model’s prediction. For example, if deleting a word significantly reduces the probability assigned by the model to the true class, it would be marked as a good replacement candidate. However, we do not believe students will have detailed information on the grading model in practice. Therefore, we take the model-agnostic approach of declaring all nouns, proper nouns and pronouns available for adjective-prepending and, correspondingly, all verbs targets for prepending adverbs. This process is illustrated under “Viable Positions" in Fig. 1. However, should the grading model become available to students, the number of positions can be constrained to the most promising ones to make the attack more efficient.

Now that we have generated a multitude of adversarial candidates by inserting our adjectives and adverbs into the viable positions, it is time to query the model to see which candidates lead to misclassification. All successful adversarial examples are then collected to determine adjectives and adverbs that cause the most misclassifications. Students could then use these in assessments to improve their automatically assigned grades.

Experiment Setup

This section describes the hyperparameters, datasets and experiment configurations used in this paper. In all our experiments, we use the base-sized BERT and T5 models provided by the huggingface library (Wolf et al, 2019). We perform hyperparameter-tuning using 10% of the training data for validation. Each model trains for 8 epochs before the check-point with the best macro-averaged F1 score on the validation set is selected. After training, the respective best models are evaluated on the test splits of each dataset. All true negatives, that is, incorrect responses that the model correctly identifies as such, form the basis for the adversarial search. To avoid an over-estimation of the attacks’ success, we exclude incorrect answers already misclassified by the model and thus do not require modification.

Datasets & Hyperparameters

As discussed in the "Automatic Short Answer Grading" Section, automatic short answer grading is often viewed as a textual entailment or paraphrase detection task. For this reason, we also included such tasks from the popular GLUE and SuperGLUE benchmarks in our evaluation. In total, we experiment with the following four datasets, allowing us to investigate our attack’s applicability to a broad range of domains:

SciEntsBank (SEB) is a common ASAG benchmark providing questions, reference and student answers from various domains (Dzikovska et al, 2013). The answers stem from primary and middle school classes in the USA. We select the 3-way variant of this dataset, where answers are labeled as correct, incorrect or contradictory. The dataset contains three test sets: unseen answers for training questions (UA), unseen questions (UQ) and questions belonging to unseen domains (UD). The best performing BERT model (found after 3 epochs) used a batch size of 32 and a learning rate of \(2e-5\). The best performing T5 model (found after 7 epochs) trained with a batch size of 8, gradient accumulation over 4 batches and an Adafactor optimizer using relative steps and initial warmup (Shazeer & Stern, 2018). All reported T5 models use the same optimizer settings.
Recognizing Textual Entailment (RTE) is a task included in the GLUE and SuperGLUE benchmark. We selected this dataset because the limited amount of data proves to be challenging even for pre-trained transformer-based models. The data set contains sequence pairs of texts and hypotheses, and the model predicts whether the hypothesis can be inferred from the text. Recognizing textual entailment is quite similar to automatic short answer grading, where student answers should entail the reference answer (Dzikovska et al, 2013). The text pairs are labeled as entailment and not_entailment, corresponding to correct and incorrect in SciEntsBank. Since the test set for this benchmark is not public, we report the performance on the development set instead. The best performing BERT model (6 epochs) trained with a batch size of 32 and a learning rate of \(1e-5\). The best T5 model (6 epochs) was found using a batch size of 8 and gradients accumulated over 8 batches.
Multi-Genre Natural Language Inference (MNLI) is also a textual entailment task and part of the GLUE benchmark, containing pairs of premises and hypothesis (Williams et al, 2018). In contrast to RTE, the data set is categorized with three labels: entailment, contradictory and neutral. While the labeled test set is not publicly available, two development sets are provided, of which one was used as test set in our experiments. The best performing BERT model (2 epochs) utilized a batch size of 64, a learning rate of \(2e-5\) and mixed-precision training (FP16). The hyperparameters of the T5 model remained unchanged.
The Microsoft Research Paraphrase Corpus (MRPC) aims to teach models to detect paraphrases (Dolan & Brockett, 2005). It makes up a part of the GLUE benchmark as well. Here, sequence pairs are labeled as 1, if the second sequence is semantically equivalent to the first one, and 0 otherwise. Detecting paraphrases is similar to grading short answers, where student answers should be semantically equivalent to the reference solution. Therefore, we can view instances labeled with 0 as incorrect and paraphrases as correct. The best BERT model (3 epochs) trained with a batch size of 32, a gradient accumulation over 2 batches, a learning rate of \(2e-5\) and mixed precision. The best T5 model (3 epochs) used a batch size of 8 and gradient accumulation over 4 batches.

Human Evaluation

While calculating the attack’s success rate is easily done, other quality dimensions are harder to measure. For example, since automatic metrics and models have difficulties capturing the meaning of utterances (Bender & Koller, 2020; Reiter, 2018), we need to rely on human judgment to determine whether our generated samples adhere to the class equivalency constraint. That is, whether the answers are still incorrect after our modification. Similarly, we require human opinions to estimate how easily adversarial examples are detected. While there are attempts to detect adversarial attacks automatically, they are most often bypassable with tweaks to the algorithm (Carlini & Wagner, 2017). Ultimately, we also expect a human grader to have the final say, making their judgment the most important to students. Asking humans to evaluate given texts is a well-known task in Natural Language Generation (NLG). Therefore, we will defer to NLG guidelines when evaluating our manipulated student responses. As judgements are often subjective, it is recommended to collect at least 3 different annotations per text to increase the evaluation’s reliability (Van Der Lee et al, 2019). Thus, we need at least 3 human graders for each experimental condition.

For this purpose, we conducted an online survey with 7 experienced graders. We selected graders based on their teaching and grading experience, English skills and availability. All annotators possessed university degrees and routinely graded short answer tasks for university courses—mainly in the computer science domain. Therefore, they should have the general education required to assess the primary and middle school science questions contained in the ASAG benchmark dataset SciEntsBank. We also included the reference answers in the questionnaire and were available to answer questions about the material to ensure the understanding necessary for grading. The graders filled out the questionnaire independently from each other.

The annotators had diverse backgrounds, hailing from India, Iran, Syria, Slovenia and Germany. While none of them were native English speakers, all of them spoke English fluently. Two of the annotators were female and five were male. We randomly assigned annotators to either the control (N = 4) or the experimental (N = 3) condition. In the control condition, annotators viewed 30 unmodified student answers and rated the answers’ naturalness, correctness and suspiciousness on 5-point Likert scales. Here naturalness refers to how likely a text was produced by a human, considering only form (Howcroft et al, 2020). Correctness refers to how accurately and completely the question is answered. Suspiciousness or mistrust capture how much a person believes the student is trying to cheat an automatic grading system.

After piloting this study, we chose to include explanations with examples for each level on the scale to increase the annotators’ understanding. The exact questions, as well as the hints, can be seen in Fig. 2. When annotators thought the student was cheating (by scoring at least 4 on the mistrust scale), they were also asked whether they would take action based on their opinion. This conditional Yes/No question can be seen in Fig. 3. The experimental group answered the same questions for the adversarially modified but otherwise identical answers.

The answers were randomly sampled from the successful adversarial examples that fooled the model, discovered using the SciEntsBank data set on the T5 model so that each question would only appear once in the survey. Thus, each test set resulted in a list of questions with a random student answer and a random adversarial perturbation. To save our experts’ time, we each selected the 10 shortest that did not reference external material, such as graphs or tables. Since this only left 8 questions stemming from the unseen questions test set, we oversampled the unseen answers test set to compensate. Annotators were informed that some of the responses may have been manipulated to fool an automatic grading system.

In compliance with the guidelines on ethical studies with human participants, we informed our annotators of the study’s risks and benefits, gave our contact information and stated that the study was voluntary and could be aborted at any time. Additionally, we ensured that all given opinions were anonymized prior to analysis and publication. We did not impose any time constraints on filling out the questionnaire. However, the questionnaire was designed to take 45–60 min. We deemed 60 min to be the upper time limit justifiable considering the annotation task’s complexity and the required concentration. Since we estimated that annotators would need 1–2 min per answer, we limited the number of answers to be evaluated to 30. On average, annotators required 53.14 min to complete the survey.

Results

This section presents our hypotheses, compares the effectiveness of our attack to the state-of-the-art attack TextFooler (Jin et al., 2020) and provides a deeper analysis of the models’ brittleness. Finally, we offer the results of our human evaluation and analyze the agreement between our expert graders.

Predictions

The following expectations (E) and hypotheses (H) motivate our experiments. Expectations will be explored descriptively while hypotheses will be tested.

E1

We expect our attack to perform competitively compared to the state-of-the-art attack TextFooler in terms of accuracy degradation.

E2

We expect our attack to exploit spurious correlations between adjectives and adverbs and the target class. Thus, adjectives and adverbs that successfully fool a model should appear more often in correct than incorrect student responses in the modelâ€™s training set.

E3

Our attack is primarily successful on low-confidence predictions, that is, predictions where the class probability assigned by the model, is considerably smaller than one.

H4

Manipulations generated by our attack do not make incorrect student responses appear more correct to humans.

H5

Humans perceive manipulated responses as less natural compared to unmodified student responses.

H6

Humans do not perceive manipulated responses as more suspicious compared to unmodified student responses.

Comparison to State-of-the-Art Attack TextFooler

First, we want to compare how well our attack can degrade a model’s performance compared to the state-of-the-art. We choose Jin et al. (2020)'s TextFooler approach to represent the state-of-the-art for two reasons. First, it has a high success rate compared to other attacks. Second, it is open-source, allowing for quick and easy reproduction of the authors’ approach. Table 2 shows our attack’s and TextFooler’s performance on the datasets introduced in "Datasets & Hyperparameters" We target BERT and T5 models with our attack and the same BERT model with TextFooler. We do not evaluate TextFooler on T5, as the attack utilizes the prediction score for the target class, which we do not have readily available in a text generation model.

Table 2 Comparison of our attack to TextFooler (TF)

Full size table

As expected, the models’ base performance without adversarial manipulation varies from dataset to dataset, with small datasets, such as RTE and MRPC, and challenging tasks, such as generalizing to unseen questions or domains, lagging in terms of accuracy. Interestingly, the absolute loss in accuracy caused by each attack seems relatively stable across tasks and datasets, even when the original performance varies.

TextFooler takes less calculation time than our attack on every dataset. The lesser time is expected since they use the target label’s prediction scores to find important words in a sequence that they can then manipulate. In contrast, our attack assumes such information to be inaccessible to students and, therefore, does not tailor its manipulations to significant words. This difference is also reflected in our model finding more adversarial examples as it tries more possible combinations per student answer. Even though our search is less guided, our attack seems to be slightly more effective at dropping models’ accuracy on the ASAG task, degrading the accuracy by an additional 0.4—3.8 percentage points across the SciEntsBank test splits. However, since TextFooler outperforms our attack on the other tasks (by 2.9—8.1 percentage points), we conclude that the attacks’ performance is dataset-dependent. Across all models and datasets, our attack deteriorates a model’s accuracy by 8 to 22 percentage points.

Interestingly, our attack seems to be equally or more effective on T5 than BERT, even though T5 is a newer model. Especially for the data splits SEB UQ and MRPC, where T5 originally outperforms BERT, this indicates that at least some of T5’s performance gain is due to unreliable statistical features.

Source of the Model’s Brittleness

Next, we want to investigate possible reasons for the attack’s success. Knowing why the model’s predictions are brittle may allow educators to develop appropriate defense mechanisms or reveal potential warning signs. Since we are mainly interested in our attacks behavior in automatic grading scenarios, the rest of our analyses will focus on the SciEntsBank dataset. First, we will investigate the distribution of adjectives and adverbs in the training data. We expect that successful adjectives and adverbs found with our attack are more often associated with correct student responses (E2).

In general, the dataset contains slightly more incorrect responses (2462) than correct ones (2008). On average, correct responses are slightly longer than incorrect answers, with 13.4 words per answer compared to 11.7 words per answer. Correct answers also average more adjectives (1.1) and adverbs (0.6) per answer than incorrect ones (0.8 and 0.5, respectively). We mainly observed two patterns when plotting the occurrences of the most successful adjectives and adverbs in each class. Either the adjectives and adverbs were much more common in correct student responses, or they hardly appeared in the training set. Figure 4 illustrates both patterns for the 10 adjectives causing the most misclassifications on the unseen answers test split. Some rare words seem to be synonyms of words common in correct responses, like “complete” and “entire”. Others are also expected to be close in the embedding space, such as “completely”—one of the top ten adverbs. Only one of the most successful insertion words appeared notably more often in incorrect student responses. The adjective “better” occurred 15 times in incorrect responses and only 4 times in correct answers. Thus, we conclude that our evidence supports E2 for most adjectives and adverbs, but not all.

Next, we investigate the model’s confidence when classifying adversarial examples. To be specific, we analyze the class probabilities given by a softmax of BERT’s final outputs. We plot them before and after the adversarial insertion in Fig. 5. For reference, we also provide the confidence scores for all incorrect student responses correctly classified by the model. We can see that soon-to-be adversarial examples elicit lower confidence than most predictions before the attack. Most test answers are classified with a confidence score between 0.8 and 1, while the model estimates most soon-to-be adversarial examples to be incorrect with a probability between 0.45 and 0.65. Since we have three classes in the dataset, a class needs at least a probability of 0.33 to be selected. After the attack, adversarial examples tend to elicit similar confidence—but for the target class. These observations are in line with our expectation E3. We will further discuss the ramifications of our results in "Recommendations".

Human Evaluation

The goal of the following survey was to investigate our attack’s effect on the naturalness, correctness and suspiciousness of student answers. Figure 6 shows the distribution of scores assigned to the answers in the control and experimental group. The means and standard deviations for each question can be found in Table 3.

Table 3 Krippendorff’s \(\alpha\), mean (M) and standard deviation (SD) of the graders ratings

Full size table

To test the hypothesis that our attack does not increase the actual correctness of responses (H4), we test for inferiority employing the two one-sided tests (TOST) procedure as discussed by Wellek (2002). We select the non-parametric Mann–Whitney U test since our data is ordinal and average the scores assigned by the various graders in a group into a more reliable and independent measurement of each answer’s correctness. As suggested by Lakens (2017), we chose -\(\infty\) as lower bound to test for inferiority instead of equivalence and 0.5 as upper bound. Our observations are consistent with H4 (\({n}_{1}=\) \({n}_{2}=\) 30, \({U}_{control}=\) 597.5, \({U}_{adv}=\) 302.5, \(p=\) 0.015). Thus, human graders generally awarded less or equal points to manipulated answers, indicating that our attack does not make the student answers correct. It only tricks the automatic model into predicting them as such, hence adhering to the class equivalency constraint of adversarial examples.

Next, we assess whether our attack decreases the naturalness of answers (H5) using a left-tailed Mann–Whitney U test. Here, our collected data is also consistent with H5 (\({n}_{1}=\) \({n}_{2}=\) 30, \({U}_{control}=\) 627, \({U}_{adv}=\) 273, \(p=\) 0.004, \(Z=\) -2.6174, \(r=\) 0.34). This result indicates that human graders perceive student answers with inserted adjectives and adverbs as less natural. We hypothesized that graders would be able to sense the manipulation but not identify it as a cheating attempt (H6).

Equivalently to our inferiority test conducted on the responses’ correctness, we utilize two one-sided Mann–Whitney U tests to test whether our attack increases the mistrust of human graders. We also selected -\(\infty\) as lower bound and 0.5 as upper bound. We found that human graders in the experimental group generally thought the students were cheating less or as often as in the control group (\({n}_{1}=\) \({n}_{2}=\) 30, \({U}_{control}=\) 576, \({U}_{adv}=\) 324, \(p=\) 0.031). A similar trend can be observed when asking whether graders would take action based on their suspicions. In the control group, graders reported the intention of acting 14 times (N = 120). Conversely, graders only wanted to act 5 times (N = 90) on the adversarial examples. Graders declined to speak with the student, superior or take disciplinary action for all other answers they rated with at least 4 on the mistrust scale. Examples for the most suspicious responses can be seen in Table 4. The examples also illustrate a concerning phenomenon that one of the annotators reported (translated from German):“Generally, I find it difficult to differentiate between bad English and unnatural responses."

Table 4 Examples of the most suspicious responses from the control (top) and the adversarial group (bottom)

Full size table

Inter-Annotator Agreement

As discussed in the "Human Evaluation" Section, human judgements can be subjective and inconsistent. For this reason, it is common in the NLP field to employ multiple annotators and report their agreement. The inter-annotator agreement provides a measure for how consistent judgements are across annotators. Similar to related work, we select Krippendorff’s Alpha to estimate our annotators’ agreement. As can be seen in Table 3, \(\alpha\) is relatively low compared to the broadly applied benchmark of 0.67 (Krippendorff, 2018). For the highly subjective and open mistrust question, a low agreement is to be expected. The annotators were informed that some student answers might have been manipulated to fool automatic grading models but not schooled on how such a manipulation could look like. The low agreement (\(\alpha =\) 0.13) and slight systematic disagreement (\(\alpha =\) -0.11) indicate that the annotators developed individual theories of what cheating would entail in an automatically graded environment.

Additionally, there was a moderate negative Spearman’s rank correlation (\(\rho\)) between mistrust and naturalness (\(\rho =\) -0.41) as well as mistrust and correctness (\(\rho =\) -0.51) in the control group. In contrast, the correlations in the experimental group were much weaker with \(\rho =\) 0.20 and \(\rho =\) 0.07, respectively. This indicates that graders suspect poorly written and wrong answers in the absence of other clues. We will further discuss this behavior and possible ramifications in "Recommendations".

While low inter-annotator agreement is a phenomenon commonly observed in natural language evaluation (Amidei et al, 2019), we were surprised to see \(\alpha\) below 0.3 for naturalness. As recommended by (Amidei et al, 2019), we calculate \(\rho\) for each annotator pair to gain more detailed insight compared to \(\alpha\)’s holistic score. In the control group, one of the annotators is an outlier with pairwise \(\rho\)’s of 0.14, 0.07 and -0.02. The rest of the annotators average a moderate to strong correlation of \(\rho\) = 0.57 (Corder & Foreman, 2011; Dancey & Reidy, 2007). We decided against excluding the outlying annotator from further analysis. Their judgment on the other questions was more in line with the majority, indicating a divergent but potentially valid interpretation of naturalness instead of a systematic disregard for the task. In the experimental group, the average \(\rho\) is 0.47.

For correctness, the agreement levels are \(\alpha =\) 0.51, \(\rho =\) 0.6 in the control group and \(\alpha =\) 0.55, \(\rho =\) 0.61 in the experimental group. Our observed agreement is expected, considering the generally high inter-grader variability of scores assigned in short answer grading tasks (Starch & Elliott, 1913).

Discussion & Conclusion

In summary, we have introduced an adversarial attack strategy developed explicitly for automatic short answer grading scenarios. It first identifies promising adjectives and adverbs during formative assessment in preparation for employing them during the summative assessment. Our proposed attack reduces a model’s accuracy by 8 to 22 percentage points. We demonstrate the attack’s applicability to various domains and datasets, where inserting a single adjective or adverb is unlikely to change an input’s actual class. Thus, the attack is suited for academic disciplines where the factual correctness of responses is essential and may be unsuited to language learning scenarios where linguistic expression is vital.

Further, we conducted a human expert evaluation to measure our attack’s influence on the student answers’ correctness, naturalness and suspiciousness. In our experiments, the attack did not significantly increase the correctness or suspiciousness but significantly reduced the perceived naturalness of student responses. However, the decrease in naturalness is most likely due to the imperfection of the automatic insertion process. When students discover adjectives and adverbs the model associates with correct responses, they are likely better at incorporating them more naturally into their responses. Finally, we analyzed the adjective and adverb distribution in the training data and the model’s confidence to investigate possible reasons for the model’s vulnerability. We found that successful adjectives and adverbs appeared more often in the target class or hardly occurred in the training set. Additionally, adversarial examples tended to elicit a lower confidence score in the model than answers that were not vulnerable to this attack.

The following section offers recommendations for educators looking to employ automatic short answer systems in practice. The recommendations are based on our findings and general knowledge about adversarial attacks. Finally, we will discuss the limitations of our experiments and future work in "Limitations & Future Work".

Recommendations

Know Thy Dataset

This is especially important as more and more off-the-shelf models become available for various tasks. This development makes it easy to treat machine learning models as black boxes without considering the possible consequences of their training process. However, a training data analysis can reveal statistical correlations that lead to unreliable prediction features. In our experiments, our attack exploited correlations between adjectives/adverbs and the target class. Beyond our work, non-robust features have been demonstrated for many popular datasets (Ilyas et al, 2019). One can also utilize adversarial attacks during training to automatically uncover unreliable features. This is also known as adversarial training and is one of the most promising defenses against adversarial attacks (Shafahi et al, 2019), even though it is still limited in its effectiveness since it is typically accompanied by a loss in accuracy on clean data and tends to lack generalizability to novel attack strategies. Moreover, knowledge of potential biases in the dataset can help mitigate discrimination of populations that are not well represented in the data (Mehrabi et al, 2021).

Beware of Low Confidence Predictions

The probabilities assigned to each class can be a valuable indication of whether the prediction is trustworthy. While confidence scores are by no means infallible, they can be a warning sign for when a student’s answer should be referred to a manual grader. In our experiments, many of the generated adversarial examples could have been caught this way.

Train Personnel on what to Expect

While automatic grading models are making great strides towards human-like performance on some datasets, we would still recommend employing humans in the grading loop. They can double-check low confidence predictions and perform quality control checks. However, it is vital to educate human control graders on how cheating attempts can look like in the automatic grading age. In our inter-annotator agreement analysis, we observed graders developing individual theories of what made student responses suspicious. Their mistrust would also correlate with how unnatural and incorrect they perceived student answers to be. So, in the absence of other clues or knowledge, the graders in our study would falsely suspect low-performing students and students with poor language skills. We believe that educating human graders on different kinds of attacks and how they express themselves in responses could mitigate such discrimination. In general, any detection method would have to be carefully implemented to avoid disadvantaging minorities not well represented in the data.

Balance Transparency and Exposure of Vulnerabilities

It is crucial that students comprehend their grades. Understanding why a particular grade was given is essential to foster acceptance and enable learning from feedback. Here, making the model’s decision process transparent to students is a powerful approach to increase understanding. However, transparency may also reveal exploitable weaknesses, such as unreliable features. Having access to the model’s inner workings enables more powerful and efficient adversarial attacks. Therefore, one can argue that keeping grading models secret is sensible. Moreover, one may implement measures that make it harder for adversaries to glean information from querying the model. For example, one can limit the number of times students can receive feedback from the model in a time span to humanly reasonable levels, thus, hindering automatic probing.

Limitations & Future Work

Finally, we will point out a few limitations of our experiments and ideas for future work. This paper focused on the effects of one adversarial attack strategy. As the space of possible adversarial manipulations is quite large, it will be exciting to see how well other strategies perform. We then plan to utilize developed attacks in adversarial training to make grading models more robust and explore the models’ usability and security in practice. Here, one could also investigate other effects of adversarial attack assessment strategies, such as their impact on responses the student would have answered correctly without the adversarial modification. Moreover, we assumed that potential attackers would purposefully aim to fool the model into accepting incorrect responses. It would also be interesting to investigate grading models’ robustness to non-malicious writing styles and mistakes, such as common typos or varying verbosity levels.

Additionally, our experiments could be expanded to other automatic short answer grading architectures. So far, we have explored the attack’s effectiveness on transformer-based models on various datasets. While the existence of adversarial vulnerabilities is generally believed to be a result of neural networks exploiting unreliable correlations in the training data instead of being a bug of a particular architecture or hyperparameter setup (Ilyas et al, 2019), we can not rule out that other grading models may be significantly less sensitive to our particular attack. Especially classical machine learning models based on engineered features are likely to require attacks tailored to their feature sets.

Lastly, we mainly see two factors restricting the generalizability of our human evaluation. First, the number of samples annotated was not large enough to reliably detect minor effects. Especially for the mistrust hypothesis, a follow-up study with a larger sample size would have to be conducted to rule out the attack making responses slightly more suspicious. Considering our graders took almost an hour to rate 30 responses, we think more annotators and multiple annotation sessions would make sense.

Second, all of our graders stem from engineering fields and work at a university. It would be interesting to see whether our observations also hold for other fields and other education institutions. Especially American school teachers may be better at differentiating manipulated answers from poorly written ones. While our annotators were accustomed to grading English short answers in their daily lives and speak English proficiently, they were not native speakers. Moreover, they stem from various countries, such as India and Slovenia, and may speak different English dialects. This probably impacted the evaluation of naturalness, as indicated by the low inter-annotator agreement, but we expect only a minor effect on the correctness and mistrust scales.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Code Availability

Our code is available https://github.com/SebOchs/adversarial_insertions.git.

Notes

References

Akhtar, N., & Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6, 14410–14430. https://doi.org/10.1109/ACCESS.2018.2807385
Article Google Scholar
Alexandron, G., Ruipérez-Valiente, J. A., Lee, S., & Pritchard, D. E. (2018). Evaluating the robustness of learning analytics results against fake learners. In V. Pammer-Schindler, M. Pérez-Sanagustín, H. Drachsler, R. Elferink, & M. Scheffel (Eds.), Lifelong Technology-Enhanced Learning (pp. 74–87). Springer International Publishing. https://doi.org/10.1007/978-3-319-98572-5_6
Chapter Google Scholar
Alexandron, G., Yoo, L. Y., & Ruip´erez-Valiente JA, Lee S, Pritchard DE,. (2019). Are mooc learning analytics results trustworthy? with fake learners, they might not be! International Journal of Artificial Intelligence in Education, 29(4), 484–506. https://doi.org/10.1007/s40593-019-00183-1
Article Google Scholar
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B. J., Srivastava, M., & Chang, K. W. (2018). Generating natural language adversarial examples. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp. 2890–2896. https://doi.org/10.18653/v1/D18-1316.https://aclanthology.org/D18-1316. Accessed 02 May 2023
Amidei, J., Piwek, P., & Willis, A. (2019). Agreement is overrated: A plea for correlation to assess human evaluation reliability. In: Proceedings of the 12th International Conference on Natural Language Generation, Association for Computational Linguistics, Tokyo, Japan, pp. 344–354. https://doi.org/10.18653/v1/W19-8642. https://aclanthology.org/W19-8642. Accessed 02 May 2023
Austin, M. J., & Brown, L. D. (1999). Internet plagiarism: Developing strategies to curb student academic dishonesty. The Internet and Higher Education, 2(1), 21–33. https://doi.org/10.1016/S1096-7516(99)00004-4
Article Google Scholar
Baker, R., Walonoski, J., Heffernan, N., Roll, I., Corbett, A., & Koedinger, K. (2008). Why students engage in “gaming the system” behavior in interactive learning environments. Journal of Interactive Learning Research, 19(2), 185–224.
Google Scholar
Baker, R. S., Mitrović, A., & Mathews, M. (2010). Detecting gaming the system in constraint-based tutors. In: Proceedings of the 18th international conference on User Modeling, Adaptation, and Personalization, pp. 267–278.
Baker, R. SJd., Corbett, A. T., Koedinger, K. R., Evenson, S., Roll, I., Wagner, A. Z., Naim, M., Raspat, J., Baker, D. J., & Beck, J. E. (2006). Adapting to when students game an intelligent tutoring system. In M. Ikeda, K. D. Ashley, & T. W. Chan (Eds.), Intelligent Tutoring Systems (pp. 392–401). Springer Berlin Heidelberg.
Chapter Google Scholar
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 1, 391–402. https://doi.org/10.1162/tacl_a_00236. https://www.aclweb.org/anthology/Q13-1032. Accessed 02 May 2023
Behjati, M., Moosavi-Dezfooli, S. M., Baghshah, M. S., & Frossard, P. (2019). Universal adversarial attacks on text classifiers. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7345–7349). IEEE.
Belinkov, Y., & Bisk, Y. (2018). Synthetic and natural noise both break neural machine translation. In: International Conference on Learning Representations. https://openreview.net/forum?id=BJ8vJebC-. Accessed 02 May 2023
Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185–5198
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
Google Scholar
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117. https://doi.org/10.1007/s40593-014-0026-8
Article Google Scholar
Camus, L., & Filighera, A. (2020). Investigating transformers for automatic short answer grading. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education (pp. 43–48). Springer International Publishing.
Chapter Google Scholar
Carlini, N., &Wagner, D. (2017). Adversarial examples are not easily detected: Bypassing ten detection methods. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, ACM, pp. 3–14.
Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., & Mukhopadhyay, D. (2021). A survey on adversarial attacks and defences. CAAI Transactions on Intelligence Technology, 6(1), 25–45.
Article Google Scholar
Corder, G.W., & Foreman, D. I. (2011). Nonparametric statistics for non-statisticians.https://doi.org/10.1002/9781118165881
Dancey, C.P., & Reidy, J. (2007). Statistics without maths for psychology. Pearson Education. https://www.pearson.com/uk/educators/highereducation-educators/program/Dancey-Statistics-Without-Mathsfor-Psychology-7th-Edition/PGM1768952.html. Accessed 02 May 2023
Davies, C., Ebbels, S., Nicoll, H., Syrett, K., White, S., & Zuniga-Montanez, C. (2023). Supporting adjective learning by children with developmental language disorder: Enhancing metalinguistic approaches. International Journal of Language & Communication Disorders, 58(2), 629–650. https://doi.org/10.1111/1460-6984.12792
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://www.aclweb.org/anthology/N19-1423. Accessed 02 May 2023
Ding, Y., Riordan, B., Horbach, A., Cahill, A., & Zesch, T. (2020). Don’t take “nswvtnvakgxpm” for an answer –the surprising vulnerability of automatic content scoring systems to adversarial input. In: Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), pp. 882–892. https://doi.org/10.18653/v1/2020.coling-main.76. https://aclanthology.org/2020.coling-main.76. Accessed 02 May 2023
Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005). https://aclanthology.org/I05-5002. Accessed 02 May 2023
Dzikovska, M., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., & Dang, H. T. (2013). SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, pp. 263–274. https://www.aclweb.org/anthology/S13-2045. Accessed 02 May 2023
Ettinger, A., Rao, S., Daumé III, H., Bender, E. M. (2017). Towards linguistically generalizable nlp systems: A workshop and shared task. In: Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pp. 1–10.
Filighera, A., Steuer, T., & Rensing, C. (2020a). Fooling automatic short answer grading systems. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education (pp. 177–190). Springer International Publishing.
Chapter Google Scholar
Filighera, A., Steuer, T., & Rensing, C. (2020b). Fooling it - student attacks on automatic short answer grading. In C. Alario-Hoyos, M. J. Rodríguez-Triana, M. Scheffel, I. Arnedillo-Sánchez, & S. M. Dennerlein (Eds.), Addressing Global Challenges and Quality Education (pp. 347–352). Springer.
Chapter Google Scholar
Franklyn-Stokes, A., & Newstead, S. E. (1995). Undergraduate cheating: Who does what and why? Studies in Higher Education, 20(2), 159–172.
Article Google Scholar
Galhardi, L. B., & Brancher, J. D. (2018). Machine learning approach for automatic short answer grading: A systematic review. In: Ibero-American Conference on Artificial Intelligence (pp. 380–391). Springer.
Gao, H., & Oates, T. (2019). Universal adversarial perturbation for text classification. arXiv preprint arXiv:191004618.
Garg, S., & Ramakrishnan, G. (2020). BAE: BERT-based adversarial examples for text classification. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp. 6174–6181. https://doi.org/10.18653/v1/2020.emnlp-main.498. https://aclanthology.org/2020.emnlpmain.498. Accessed 02 May 2023
Ghavidel, H. A., & Zouaq, A., & Desmarais, M. C. (2020). Using BERT and XLNET for the automatic short answer grading task. In: CSEDU (1), pp. 58–67.
Howcroft, D. M., Belz, A., Clinciu, M. A., Gkatzia, D., Hasan, S. A., Mahamood, S., Mille, S., van Miltenburg, E., Santhanam, S., & Rieser, V. (2020). Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In: Proceedings of the 13th International Conference on Natural Language Generation, Association for Computational Linguistics, Dublin, Ireland, pp. 169–182. https://aclanthology.org/2020.inlg-1.23. Accessed 02 May 2023
Huang, X., Kroening, D., Ruan, W., Sharp, J., Sun, Y., Thamo, E., Wu, M., & Yi, X. (2020). A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Computer Science Review, 37, 100270. https://doi.org/10.1016/j.cosrev.2020.100270. https://www.sciencedirect.com/science/article/pii/S1574013719302527. Accessed 02 May 2023
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial examples are not bugs, they are features. In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol 32). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/e2c420d928d4bf8ce0ff2ec19b371514-Paper.pdf. Accessed 02 May 2023
Iyyer, M., Wieting, J., Gimpel, K., & Zettlemoyer, L. (2018). Adversarial example generation with syntactically controlled paraphrase networks. In: Proceedings of NAACL-HLT, pp. 1875–1885.
Jaramillo-Morillo, D., José, R. V., Sarasty, M. F., & Ramírez-Gonzalez, G. (2020). Identifying and characterizing students suspected of academic dishonesty in spocs for credit through learning analytics: Revista de universidad y sociedad del conocimiento. International Journal of Educational Technology in Higher Education, 17(1). https://doi.org/10.1186/s41239-020-00221-2
Jin, D., Jin, Z., Zhou, J. T., & Szolovits, P. (2020). Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8018–8025. https://doi.org/10.1609/aaai.v34i05.6311. https://ojs.aaai.org/index.php/AAAI/article/view/6311. Accessed 02 May 2023
Jordan, A. E. (2001). College student cheating: The role of motivation, perceived norms, attitudes, and knowledge of institutional policy. Ethics & Behavior, 11(3), 233–247.
Article Google Scholar
Klein, H. A., Levenburg, N. M., McKendall, M., & Mothersell, W. (2007). Cheating during the college years: How do business school students compare? Journal of Business Ethics, 72(2), 197–206.
Article Google Scholar
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. SAGE Publications. https://books.google.de/books?id=FixGDwAAQBAJ. Accessed 02 May 2023
Kumar, S., Chakrabarti, S., & Roy, S. (2017). Earth mover’s distance pooling over siamese lstms for automatic short answer grading. In: IJCAI, pp. 2046–2052.
Lakens, D. (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362.
Article Google Scholar
Lanier, M. M. (2006). Academic integrity and distance learning. Journal of Criminal Justice Education, 17(2), 244–261. https://doi.org/10.1080/10511250600866166
Article Google Scholar
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.
Article Google Scholar
Li, L., Ma, R., Guo, Q., Xue, X., & Qiu, X. (2020). BERT-ATTACK: Adversarial attack against BERT using BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp. 6193–6202. https://doi.org/10.18653/v1/2020.emnlp-main.500. https://aclanthology.org/2020.emnlpmain.500. Accessed 02 May 2023
Lipnevich, A. A., & Smith, J. K. (2009). “I really need feedback to learn:” Students’ perspectives on the effectiveness of the differential feedback messages. Educational Assessment, Evaluation and Accountability, 21(4), 347. https://doi.org/10.1007/s11092-009-9082-2
Article Google Scholar
Lun, J., Zhu, J., Tang, Y., & Yang, M. (2020). Multiple data augmentation strategies for improving performance on automatic short answer scoring. Proceedings of the AAAI Conference on Artificial Intelligence, 34(09), 13389–13396. https://doi.org/10.1609/aaai.v34i09.7062. https://ojs.aaai.org/index.php/AAAI/article/view/7062. Accessed 02 May 2023
Lv, G., Song, W., Cheng, M., & Liu, L. (2021). Exploring the effectiveness of question for neural short answer scoring system. In: 2021 IEEE 11th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 168–171. https://doi.org/10.1109/ICEIEC51955.2021.9463814
Marvaniya, S., Saha, S., Dhamecha, T. I., Foltz, P., Sindhgatta, R., & Sengupta, B. (2018). Creating scoring rubric from representative student answers for improved short answer grading. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Association for Computing Machinery, New York, NY, USA, CIKM ’18, p 993–1002. https://doi.org/10.1145/3269206.3271755
Maverick, G. V. (1969). Computational analysis of present-day american english. International Journal of American Linguistics, 35(1), 71–75. https://doi.org/10.1086/465045
McGee, P. (2013). Supporting academic honesty in online courses. Journal of Educators Online, 10(1), 1–31.
Article Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6). https://doi.org/10.1145/3457607
Mohler, M., Bunescu, R., Mihalcea, R. (2011). Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, pp. 752–762.
Muldner, K., Burleson, W., Van de Sande, B., & VanLehn, K. (2010). An analysis of gaming behaviors in an intelligent tutoring system. In: International Conference on Intelligent Tutoring Systems (pp. 184–193). Springer.
Muldner, K., Burleson, W., Van de Sande, B., & VanLehn, K. (2011). An analysis of students’ gaming behaviors in an intelligent tutoring system: Predictors and impacts. User Modeling and User-Adapted Interaction, 21(1–2), 99–135.
Article Google Scholar
Murdock, T. B., & Anderman, E. M. (2006). Motivational perspectives on student cheating: Toward an integrated model of academic dishonesty. Educational Psychologist, 41(3), 129–145. https://doi.org/10.1207/s15326985ep41031
Article Google Scholar
Northcutt, C. G., Ho, A. D., & Chuang, I. L. (2016). Detecting and preventing “multiple-account” cheating in massive open online courses. Computers & Education, 100,71–80. https://doi.org/10.1016/j.compedu.2016.04.008.https://www.sciencedirect.com/science/article/pii/S0360131516300896. Accessed 02 May 2023
Palazzo, D. J., Lee, Y. J., Warnakulasooriya, R., & Pritchard, D. E. (2010). Patterns, correlates, and reduction of homework copying. Physical Review Physics Education Research, 6, 010104. https://doi.org/10.1103/PhysRevSTPER.6.010104. https://link.aps.org/doi/10.1103/PhysRevSTPER.6.010104. Accessed 02 May 2023
Peters, C., Arroyo, I., Burleson, W., Woolf, B., & Muldner, K. (2018). Predictors and outcomes of gaming in an intelligent tutoring system. In: International Conference on Intelligent Tutoring Systems (pp. 366–372). Springer.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67.
MathSciNet Google Scholar
Reiter, E. (2018). A structured review of the validity of BLEU. Computational Linguistics, 44(3), 393–401.
Article Google Scholar
Ren, S., Deng, Y., He, K., & Che, W. (2019). Generating natural language adversarial examples through probability weighted word saliency. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1085–1097.
Ren, Y., Lin, J., Tang, S., Zhou, J., Yang, S., Qi, Y., & Ren, X. (2020). Generating natural language adversarial examples on a large scale with generative models. In: ECAI 2020 (pp. 2156–2163). IOS Press.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Semantically equivalent adversarial rules for debugging nlp models. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers, pp. 856–865).
Riordan, B., Horbach, A., Cahill, A., Zesch, T., & Lee, C. M. (2017). Investigating neural architectures for short answer scoring. In: Proceedings of the 12thWorkshop on Innovative Use of NLP for Building Educational Applications (pp. 159– 168).
Rowe, N. C. (2004). Cheating in online student assessment: Beyond plagiarism. Online Journal of Distance Learning Administration, 7(2), 1-10. https://ojdla.com/archive/summer72/rowe72.pdf. Accessed 23 Jul 2023
Roy, S., Narahari, Y., & Deshmukh, O. D. (2015). A perspective on computer assisted assessment techniques for short free-text answers. In: International Computer Assisted Assessment Conference (pp. 96–109). Springer. https://doi.org/10.1007/978-3-319-27704-2_10
Ruiperez-Valiente, J. A., Alexandron, G., Chen, Z., & Pritchard, D. E. (2016). Using multiple accounts for harvesting solutions in moocs. In: Proceedings of the third (2016) ACM conference on learning@ scale, pp. 63–70.
Saha, S., Dhamecha, T. I., Marvaniya, S., Sindhgatta, R., & Sengupta, B. (2018). Sentence level or token level features for automatic short answer grading?: Use both. In: International Conference on Artificial Intelligence in Education (pp. 503–517). Springer.
Sahu, A., & Bhowmick, P. K. (2020). Feature engineering and ensemble-based approach for improving automatic short-answer grading performance. IEEE Transactions on Learning Technologies, 13(1), 77–90. https://doi.org/10.1109/TLT.2019.2897997
Article Google Scholar
Samanta, S., & Mehta, S. (2017). Towards crafting text adversarial samples. arXiv preprint arXiv:170702812.
Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., & Goldstein, T. (2019). Adversarial training for free! In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, vol 32. https://proceedings.neurips.cc/paper/2019/file/7503cfacd12053d309b6bed5c89de212-Paper.pdf. Accessed 02 May 2023
Shazeer, N., & Stern, M. (2018). Adafactor: Adaptive learning rates with sublinear memory cost. In: J. Dy, & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsm¨assan, Stockholm Sweden, Proceedings of Machine Learning Research, vol 80, pp. 4596–4604. http://proceedings.mlr.press/v80/shazeer18a.html. Accessed 02 May 2023
Song, L., Yu, X., Peng, H. T., & Narasimhan, K. (2021). Universal adversarial attacks with natural triggers for text classification. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, pp. 3724–3733. https://doi.org/10.18653/v1/2021.naacl-main.291. https://aclanthology.org/2021.naacl-main.291. Accessed 02 May 2023
Starch, D., & Elliott, E. C. (1913). Reliability of grading work in mathematics. The School Review, 21(4), 254–259. https://www.journals.uchicago.edu/doi/pdf/10.1086/436086. Accessed 02 May 2023
Sultan, M. A., Salazar, C., & Sumner, T. (2016). Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1070–1075.
Sung, C., Dhamecha, T. I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. In: International Conference on Artificial Intelligence in Education (pp. 469–481). Springer. https://doi.org/10.1007/978-3-030-23204-7_39
Tan, C., Wei, F., Wang, W., Lv, W., & Zhou, M. (2018). Multiway attention networks for modeling sentence pairs. In: IJCAI, pp. 4411–4417.
Tan, H., Wang, C., Duan, Q., Lu, Y., Zhang, H., & Li, R. (2020). Automatic short answer grading by encoding student responses via a graph convolutional network. Interactive Learning Environments, 0(0), 1–15. https://doi.org/10.1080/10494820.2020.1855207
Tribushinina, E., & Dubinkina, E. (2012). Adjective production by russian-speaking children with specific language impairment. Clinical Linguistics & Phonetics, 26(6), 554–571. https://doi.org/10.3109/02699206.2012.666779
Article Google Scholar
Uto, M., & Uchida, Y. (2020). Automated short-answer grading using deep neural networks and item response theory. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education (pp. 334–339). Springer International Publishing.
Chapter Google Scholar
Van Der Lee, C., Gatt, A., Van Miltenburg, E., Wubben, S., & Krahmer, E. (2019). Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 355–368.
Wallace, E., Feng, S., Kandpal, N., Gardner, M., & Singh, S. (2019a). Universal adversarial triggers for attacking and analyzing nlp. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2153–2162.
Wallace, E., Rodriguez, P., Feng, S., Yamada, I., & Boyd-Graber, J. (2019b). Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7, 387–401.
Article Google Scholar
Walonoski, J. A., Heffernan, N. T. (2006a). Detection and analysis of off-task gaming behavior in intelligent tutoring systems. In: International Conference on Intelligent Tutoring Systems (pp. 382–391). Springer.
Walonoski, J. A., & Heffernan, N. T. (2006b). Prevention of off-task gaming behavior in intelligent tutoring systems. In: International Conference on Intelligent Tutoring Systems (pp. 722–724). Springer.
Watson, G., & Sottile, J. (2010). Cheating in the digital age: Do students cheat more in online courses? Online Journal of Distance Learning Administration, 13(1), n1.
Google Scholar
Wellek, S. (2002). Testing statistical hypotheses of equivalence. Chapman and Hall/CRC.
Book Google Scholar
Whitley, B. E. (1998). Factors associated with cheating among college students: A review. Research in Higher Education, 39(3), 235–274.
Article Google Scholar
Williams, A., Nangia, N., & Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In: 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2018, Association for Computational Linguistics (ACL), pp. 1112–1122. https://doi.org/10.18653/v1/N18-1101
Willis, A. (2015). Using nlp to support scalable assessment of short free text responses. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 243–253).
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:191003771. https://arxiv.org/abs/1910.03771. Accessed 02 May 2023
Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17(2), 151–178.
Article Google Scholar
Yuan, X., He, P., Zhu, Q., & Li, X. (2019). Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems.
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76(2), 280–303.
Article Google Scholar
Zhang, H., Zhou, H., Miao, N., & Li, L. (2019). Generating fluent adversarial examplesfor natural languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5564–5569.
Zhang, W. E., Sheng, Q. Z., Alhazmi, A., & Li, C. (2020a). Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology, 11(3). https://doi.org/10.1145/3374217
Zhang, Y., Lin, C., & Chi, M. (2020b). Going deeper: Automatic short-answer grading by combining student and question models. User Modeling and User- Adapted Interaction, 30(1), 51–80.
Article Google Scholar

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. This work is funded by the Hessian State Chancellery in the Department of Digital Strategy and Development in the Förderprogramm Distr@l (Förderprodukt: Digitale Innovations- und Technologiefrderung, Förderlinie: 2A Digitale Innovationsprojekte).

Author information

Authors and Affiliations

Multimedia Communications Lab, Technical University of Darmstadt, Darmstadt, Germany
Anna Filighera, Tim Steuer & Thomas Tregel
Technical University of Darmstadt, Darmstadt, Germany
Sebastian Ochs

Authors

Anna Filighera
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Ochs
View author publications
You can also search for this author in PubMed Google Scholar
Tim Steuer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Tregel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Filighera.

Ethics declarations

Human Rights Statements and Informed Consent

All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1964 and its later amendments. Informed consent was obtained from all patients for being included in the study.

Animal Rights

This article does not contain any studies with animal subjects performed by the any of the authors.

Conflicts of Interest/Competing Interests

Anna Filighera, Sebastian Ochs, Tim Steuer and Thomas Tregel declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Filighera, A., Ochs, S., Steuer, T. et al. Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs. Int J Artif Intell Educ 34, 616–646 (2024). https://doi.org/10.1007/s40593-023-00361-2

Download citation

Accepted: 22 June 2023
Published: 26 July 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s40593-023-00361-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs

Abstract

Similar content being viewed by others

Fooling It - Student Attacks on Automatic Short Answer Grading

Fooling Automatic Short Answer Grading Systems

Towards Generating Counterfactual Examples as Automatic Short Answer Feedback

Explore related subjects

Introduction

Adversarial Attack Design Considerations

Related Work

“Gaming" Educational Systems

Automatic Short Answer Grading

Adversarial Attacks in NLP

Methods

Adversarial Word Insertion

Experiment Setup

Datasets & Hyperparameters

Human Evaluation

Results

Predictions

E1

E2

E3

H4

H5

H6

Comparison to State-of-the-Art Attack TextFooler

Source of the Model’s Brittleness

Human Evaluation

Inter-Annotator Agreement

Discussion & Conclusion

Recommendations

Know Thy Dataset

Beware of Low Confidence Predictions

Train Personnel on what to Expect

Balance Transparency and Exposure of Vulnerabilities

Limitations & Future Work

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Human Rights Statements and Informed Consent

Animal Rights

Conflicts of Interest/Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation