Corpora
Each of the short-answer corpora available for research was collected to serve a specific research interest, and as a result, there are substantial differences in corpus characteristics. For example, SAG corpora may contain a single set of grade annotations or multiple grades by different annotators. The goal of our study is to consider as many major SAG corpora as possible; however, given the research question, we can only include corpora with multiple human grade annotations. Notably, this means we exclude the popular SemEval-2013 data (Dzikovska et al., 2013), which was created for a shared task with the intention of bringing together research on SAG with research on textual entailment (Dagan et al., 2009), but which only provides a single grade annotation for each item.
Furthermore, we also disregard the pioneering data from Mohler et al. (2011). The double annotation available for this corpus has been found to be quite inconsistent (Mieskes & Padó, 2018). The reason for this is probably that no grading guidelines (in the form of rubrics or reference answers) were available (Mohler et al., 2011). This differs from the annotation process for the other SAG corpora, and is an extraneous reason for grader disagreement that we cannot hope to model.
This leaves us with the corpora shown in Table 3: The training portion of ASAPFootnote 2, a corpus of English-language school test items for science and English, and ASAP-de, a collection of German answers to translations of three of the ten questions in English ASAP (Horbach et al., 2018). Further, we use CREE (Meurers et al., 2011b) and CREG (Meurers et al., 2011a), an English-language and a German-language collection of language learner answers to reading comprehension questions. Since the standard CREG version (CREG-1032) by design only contains answers that were graded in agreement, we extracted additional answers to the questions documented and annotated in CREG-1032 from the larger CREG-23k corpus version (which also contains answers with disagreeing annotations).
CSSAG (Padó & Kiefer, 2015) is a collection of German answers to university-level test items for a programming class. Finally, we include the English-language Powergrading corpus, short PG, (Basu et al., 2013) containing answers to questions from the US immigration quiz collected through crowd sourcing.
Grader agreement
Our target variable is grader agreement. As discussed above, we want to capture the agreement between the graders down to the level of the individual answer. For use as the target variable, we need to model it in a way that is as comparable between corpora as possible.
One important source of incomparability between corpora is the grading scheme (the number of levels available to the annotators), which can vary by question. In our data, only CREE, CREG and PG use the same grading scheme for all answers (namely binary grades, i.e., correct/incorrect). The two ASAP variants and CSSAG vary the grading scheme by question. ASAP uses either a three-point or a four-point scale, while the majority of CSSAG is graded on a three- or five point scale. More grading levels potentially lead to more disagreement between the annotators, but when they choose adjacent categories, the disagreement is less serious than for a grading scale with fewer levels. Therefore using the absolute difference in assigned grades is misleading.
We attempt to standardize the grading scheme by investigating binarized percentage of annotator agreement. This relativizes the differences between grader decisions on fine-grained grading scales while preserving major disagreement and decisions on coarser scales. In all corpora, we defined agreement to be present if the annotators assigned scores that differed by less than one half of the available points. For a four-point scale, this allows graders to differ in one point, but a difference in two points counts as disagreement. On a two-point scale, the grades have to be the same to count as agreement.
The choice of one half of the available points is, in principle, a parameter of our study. However, with a more lenient definition, hardly any cases of disagreement remain, while a stricter definition throws the differences in grading granularity among corpora into sharper relief.
As with binning procedures generally, this process loses some information. However, a binary measure of agreement represents the least common denominator of the corpora we consider, since three of the corpora in our study (CREE, CREG and PG) only use a two-point grading scale. The other corpora use between three and seven points, depending on the individual question. Thus, more fine-grained measures of agreement are only applicable to substantially smaller subsets of the our data set. A second, more theoretical motivation is that our choice of binarized agreement is equivalent to Fleiss’ \(\kappa \) when \(\kappa \) is computed at the level of individual data points for two annotators (viz., 1 for agreement and 0 for disagreement, Mieskes, 2009).
Table 2 Class distributions for inter-grader agreement for various corpora Table 2 lists the distribution of binarized agreements and disagreements in the different corpora. We also show the log odds between classes to quantify the prevalence of disagreement. High log odds mean that one class is strongly predominant, log odds near 1 show an equal distribution of classes. The log odds vary between 3.15 (PG, least amount of disagreement) and 1.47 (CSSAG, highest amount of disagreement), with most values around 2. The table shows no effect of the size of the grading scale on binarized consistency—the corpora that use multiple-point scales and might show inflated consistency due to the binarization are in second, fifth and sixth place ranked by consistency, while a corpus with a binary grading scale is most consistent.
Properties
This section discusses the properties of the selected SAG data collections that are presumably relevant with regard to grader agreement. We first present properties on the corpus level, then properties that vary with each question and finally properties that vary with each answer.
Corpus level
Table 3 shows the corpora in our study and some of their properties. On the corpus level, one obvious difference between the data sets is the assessment task. This describes the underlying goal of the assessment questions answered by the students. Assessment is either focused on content or on (second) language skills. With the exception of ASAP, assessment task is located on the corpus level. Content assessment (CA) corpora are ASAP-de, CSSAG and PG. CREE and CREG are pure language assessment (LA) corpora. For ASAP, half the questions are content assessment (CA) questions, and half the questions are language assessment (LA) questions (see also Sect. 3.3.2).
Table 3 Properties of short answer corpora (CA content assessment, LA language assessment, en/de English/German, Stand. Standardized testing, Classr. Classroom testing Assessment task co-varies with another important variable: With the exception of CREE and CREG, which are corpora of language learner answers only, all respondents are assumed to be native or near-native speakers of the target language. Therefore, while ASAP contains LA questions, these are more complex than the LA questions in CREE and CREG, since they are aimed at (near) native speakers of the target language instead of learners. Conversely, the assumption that CA respondents are native speakers does not always hold. In principle, language ability should not be a factor in SAG grading reliability, because language correctness is not taken into account during grade assignment: Mistakes by non-native respondents should not adversely influence their grades. However, Weiss et al. (2019) found a disproportionate impact of error rate (but not linguistic complexity) on final grades when investigating teacher grading of German written final exams. Therefore, language errors and lack or presence of linguistic complexity might conceivably have biased graders in our data sets, as well. In the absence of information on students’ language ability, we do not investigate this further.
Another very visible difference is the target language. We have three German and three English corpora that are mostly unrelated; the pair ASAP and ASAP-de is currently the best attempt to control the factor of language in SAG corpora. We do not expect the language to influence grader consistency in a systematic fashion—there is no reason why speakers of English should consistently grade English answers more strictly or leniently than speakers of German grade German answers.
Another variable of potential influence is the collection context of the corpus. The ASAP data was collected from standardized testing in US schools and is accordingly graded by the test provider’s trained annotators according to standardized rubrics. CREE, CREG and CSSAG assemble answers from small-scale ad-hoc testing, while data for ASAP-de and PG were collected by crowdsourcing. For all of these corpora, grading was done by the teachers, the researchers or research assistants.
Corpus sizes vary with collection context. In standardized testing, large amounts of data are available, and ASAP is the largest corpus in our data set. Ad-hoc testing yields the smallest corpora. We do not model corpus size as a predictor variable but account for it in the estimation of our model (cf. Sect. 3).
Finally, pseudonymized student identities sometimes appear as a corpus-level feature, when each student is given an individual ID code so their performance can be tracked across different questions in the corpus (this is the case in CSSAG, for example). For most corpora, however, student ID numbers appear to be re-initialized for each question, placing the property on the question level, which we discuss next.
Question level
As discussed directly above, assessment task is strictly speaking a question-level variable in our study, since the ASAP corpus contains both content assessment and language assessment questions. All other corpora contain data for a single task.
Another question-level property that we consider is question difficulty level. We assume that harder questions have more complex answers that are in turn harder to grade consistently. For content assessment, difficulty can be described with the Cognitive Process dimensions of Bloom’s Taxonomy (Anderson & Krathwohl, 2014). There are six levels describing different cognitive processes necessary to answering questions. The first three levels are relevant here (none of the more advanced levels are present in the corpus data). They are remember (pure factual retrieval), understand (demonstrating comprehension of concepts, e.g. by explaining, comparing or classifying) and apply (using knowledge to solve a new problem). For language assessment, the taxonomy by Day and Park (2005) is more appropriate, since it is targeted directly at specifying different tasks relevant for reading comprehension questions in language teaching. The taxonomy also comprises six levels, of which only the three most basic levels are relevant. They are literal comprehension (repeating information from the text), reorganization (combining several explicit statements from the text), and inference (reasoning about information from the text). In case several tasks are required to answer a question, the category several can be assigned. CREE already contains Day & Park difficulty level annotation.
The average answer length per question differs widely between questions. We investigate this property to consider length effects on grading.
Finally, we look at answer set homogeneity, that is, the average similarity among answers for one question. Previous research has shown that this property is correlated with question difficulty (Padó, 2017) and varies between CA and LA corpora (Padó, 2016). We expect that it should be easier to grade consistently if all the answers to a question are similar and the same grading decision can be re-applied multiple times. Annotator agreement will thus be improved if both annotators consistently apply the same grade to the homogeneous answer set. However, note that each of two annotators might also consistently apply a different grade. The result would then be maximal inconsistency between the annotators on the question level.
Answer level
On the answer level, we consider the answer similarity to the other answers for the same question. We include this property to complement the question-level factor of answer homogeneity. On the question level, answer homogeneity tells us how similar all answers to one question are on average. On the answer level, answer similarity indicates whether the answer question is very similar to other answers to the same question, or whether it is an outlier. Outliers might be easy to grade if they are clearly wrong or just a statement like “I don’t know”, or they might be hard to grade if they misinterpret the question or are highly original. Conversely, typical answers might be easier to grade because the same grading decision made elsewhere can just be re-applied.
A related concept, question–answer similarity, might also seem promising as a property at first glance, assuming that answers containing words that occur in the question address the correct topic. However, it is well known that students use repetition of question words as a filler strategy, so question–answer similarity is not informative. Therefore, we do not consider this property (cf. Mohler et al., 2011).
The final relevant property is the correctness of an answer. Correct answers are arguably easier to grade consistently by human graders since they can, for many difficulty levels, be compared against a reference answer or a positive statement in a scoring rubric. However, such matching strategies are not equally effective for all difficulty levels—the more creative the student has to be, e.g. in reorganization, the smaller the benefit for correct answers. We therefore expect an interaction between difficulty level and correctness (see Sect. 4.4).