As mentioned in the previous section, the tool must analyse the code and classify the comments. For the static code analysis, we refer to Listing 3 (part of the solution given to the assignment presented in Section “Application Scenario and Educational Impact”).
Extracting the command (line 1), respective output (lines 3–6), and comment (line 8) is straightforward, as is the comparison of the command and output with the solution given by the professor (e.g., whether it is the same command or whether it has produced the same output).
The comment is passed to the NLP engine, which extracts the features that are then used by the R engine for classification. For this purpose, a supervised model was trained to build a classifier that could determine whether a comment was correct or did not give the correct answer. In general, the goal of this classification step was to capture the similarity between the student’s answer and correct answer, and a high similarity led to a pass judgement. Two real examples from the dataset are reported below, translated into English.
Correct answer: (Student) the p-value is less than 0.05; hence, the result on the sample is statistically significant and can be extended to the population. Incorrect answer: (Student) the p-value is > 0.05; hence, the result is statistically significant, or in other words, the wound appearance does not depend on the type of surgery. (Gold) Given that the p-value is < 0.05, I can generalise the result observed in my sample to the population; hence, there is a statistically significant difference in the appearance of the wound between Surgery A and Surgery B.
The task was extremely challenging as the sentence pairs in the dataset demonstrated a high level of word overlap, and the discriminant between a correct and incorrect answer could be only the presence of “<” instead of “>”, or a negation.
Data
To train the classification model, we built the dataset available at Angelone et al. (2019), which contains a list of comments written by students with a unique ID, their type (if given for the hypothesis or normality test), their ‘correctness’ in a range from “0” to “1”, their fail/pass result, accompanied by (i) the gold standard (i.e., the correct answer) and (ii) an alternative gold standard. All students’ answers were collected from the results of real exams; all gold standard answers, the correctness score, and pass/fail judgment were assigned by the professor who evaluated such exams.
To increase the number of training instances and achieve a superior balance between the two classes, we also manually negated a set of correct answers and reversed the corresponding fail/pass result, thereby adding a set of (iii) negated gold standard sentences. The sentences of all students and gold standard sentences were in the Italian language. In summary, the dataset contained 1,069 student/gold standard answer pairs, 663 of which were labelled as “pass” and 406 as “fail”.
Feature Description
The NLP engine returns two different types of features to assess the similarity between two sentences. The first set of features consists of the sentence embeddings of the two solutions to represent the semantics of the two texts in a distributional space. The second consists of distance-based features between the student’s comment and the correct answer to assess the similarity of the two sentences at the lexical and semantic levels.
For the creation of the embeddings, we relied on fastText (Bojanowski et al. 2016; Grave et al. 2018), a library developed at Facebook to learn word representations and convert them easily into sentence representations if text snippets are given in the input, as in our case. One of the main advantages of fastText is the fact that several pre-trained models for different languages are made available from the project website, without the requirement to retrain such computationally intensive models. More importantly, it can address rare words using subword information, such that the issue of unseen words is mitigated. The representation of the sentences is then obtained by combining vectors encoding information on both words and subwords. For our task, we adopted the precomputed Italian language modelFootnote 1 trained on Common Crawl and Wikipedia. This was particularly suitable for our task as Wikipedia includes scientific and statistics pages, making the specific domain of the exam under consideration well represented by the model. The embeddings were created using continuous bag-of-words with position-weights, a dimension of 300, character n-grams of length 5, and a window of size 5 and 10 negatives.
We relied on recent work on natural language inference to choose how to combine the student’s answers and correct answer representations into a feature vector. Indeed, our task could be cast in a manner similar to inference, as the student’s answer should be entailed by the correct answer to obtain a pass judgement. Therefore, we opted for a concatenation of the embeddings of the premise and hypothesis (Kiros and Chan 2018; Bowman et al. 2015), which has been proven effective for the task of textual inference. Because each sentence is represented as embeddings of 300 dimensions, the result obtained through concatenation was 600-dimensional. This representation was then input directly to the classifier.
In addition to sentence embeddings, we extracted a set of seven distance-based features, which should capture the lexical and semantic similarity between the students’ answers and correct answers. We preprocessed the answers by removing stopwords (e.g. articles or prepositions) and transcribing the mathematical notations into natural language (e.g., “>” as “maggiore” (greater)). The text was then processed with the TINT NLP Suite for Italian (Aprosio and Moretti 2018) to obtain part-of-speech (PoS) tagging, lemmatisation, and affix recognition.
The output from TINT was then used to compute the following distance-based features:
-
Token/Lemmas overlap: A feature representing the number of overlapping tokens/lemmas between the two sentences normalised by their length. This feature captures the lexical similarity between the two strings.
-
Presence of negations: This feature indicates whether a content word is negated in one sentence and not in the other. For each sentence, we identify negations according to the ‘NEG’ PoS tag or the affix ‘a-’ or ‘in-’ (e.g., “indipendente”), and then consider the first content word occurring after the negation. We extract two features, one for each sentence, and the values are normalised by their length.
Four additional distance-based features were computed using the sentence embeddings generated in the previous step and the single word embeddings, obtained again with fastText (Bojanowski et al. 2017):
-
Cosine of sentence embeddings: We computed the cosine between the sentence embeddings of the students’ answers and that of the correct answers. When represented in the same multidimensional space, the embeddings of two sentences with similar meanings were expected to be closer.
-
Cosine of (lemmatised) sentence embeddings: the same feature as the previous item, with the only difference being that the sentences were first lemmatised before creating the embeddings.
-
Word mover’s distance (WMD): WMD is a similarity measure based on the minimum amount of distance that the embedded words of one document must move to match the embedded words of another document (Kusner et al. 2015). Compared with other existing similarity measures, it functions well when two sentences have a similar meaning despite having limited words in common. We applied this algorithm to measure the distance between the solutions proposed by the students and the correct solutions. Unlike the previous features, this measure was computed by considering the embedding of each word composing a sentence.
-
WMD (lemmatised): The same feature as the previous item, with the only difference that the sentences were lemmatised before creating the embeddings.
In the classification experiments, we grouped and compared the features as follows:
-
Sentence embeddings + distance features: We concatenated the 600 dimensional vector encoding the student and correct answer with the seven distance features described previously;
-
Only sentence embeddings: the classification model was built using only the 600-dimensional vectors, without explicitly encoding the lexical and linguistic features of the student’s comment and the correct answer.
As a baseline, we also computed the classification results obtained without sentence embeddings, using only the seven distance features.
Parameter Setting
The classifier was implemented through a tuned SVM (Scholkopf and Smola 2001). We initially found the best C and γ parameters using grid-search tuning (Hsu et al. 2016), through a 10-fold cross-validation, to prevent over-fitting the model and better address the dataset size, which was not large. The best parameters were C = 104 and γ = 2− 6 for the complete setting (i.e., embedding + distance features). With the same approach, we also tuned the classifier when the input was only the concatenated sentence embeddings as features (i.e., without distance-based features), thus finding the best parameters of C = 103 and γ = 2− 3.
Results
Evaluation of Answer Classification
The tuned models presented in the previous subsection yielded the results summarised in Table 1.
Table 1 Accuracy, balanced accuracy, F1 score, and Cohen’s K The results indicate only a slight improvement in performance when using distance-based features in addition to sentence embeddings. This outcome highlights the effectiveness of using sentence embeddings to represent the semantic content of the answers in tasks where the students’ answers and gold standards were similar. In fact, the sentence pairs in our dataset indicated a high level of word overlap, and the only discriminant between a correct and incorrect answer was sometimes only the presence of “<” instead of “>”, or a negation. We manually inspected the misclassified and correctly classified pairs to verify whether there were common patterns shared by the two groups. From a surface point of view, no differences could be observed: the average sentence length, lexical overlap between pairs, and presence of negations were equally present in the misclassified and correctly classified pairs. The only remarkable difference was the presence, among false negatives, of pairs that belonged to the Pass class yet were manually graded as “partially correct”, mainly because the student’s wording was not fully precise, whereas the professor could infer that the meaning of the answer was correct. Such cases were also not necessarily clear-cut for a human grader and mostly led the classifier into error.
The baseline classifier, obtained using only the seven distance-based features, yielded an F1 score of 0.758 for the Pass class and 0.035 for the Fail class, indicating that these features contribute to a superior assessment of matching pairs yet do not provide useful information concerning the other class. For comparison, we also experimented with feature vectors obtained using BERT (Devlin et al. 2019), which processes words in relation to all the other words in a sentence, rather than one-by-one in order. BERT models can therefore judge the full context of a word by considering the words that precede and follow it. We tested these models because they have achieved state-of-the-art performance in natural language inference tasks (Zhang et al. 2018), which inspired the choice to concatenate the students’ answers and correct answers in the proposed approach. They were also successfully applied to short-answer grading on English data, with larger datasets than ours (Liu et al. 2019; Sung et al. 2019). In our experiments, we adopted the Base Multilingual Cased model,Footnote 2 covering 104 languages with 12 layers, 768 dimensional states, 12 heads, and 110M parameters. We fine-tuned the classifier using a maximum sequence length of 128, a batch size of 32, and a learning rate of 2− 5, executing it for five epochs. The model, however, did not converge, and no stable results were achieved, likely because of the limited dimension of the training set. We concluded that it could be possible to adopt this kind of transformer-based approach only if more training data were collected.
Comparison of Manual vs Automated Grading
This subsection reports on the quality of the automated grading with respect to manual grading. Specifically, the automated grading tool was used to grade the solutions of a set of assignments used in one year of exams, from December 2018 to September 2019. The results were then compared with the grades assigned by the professor. The comparison consisted of:
We analysed 122 solutions belonging to 11 different assignments, all with the same structure as the assignment in Section “Application Scenario and Educational Impact”. The solutions were submitted in ten different assessment rounds. The average grade was 24/30 (s.d. 7); 18% of the solutions were considered unsatisfactory to pass the exam.
By comparing the manual and automated grades, the intraclass correlation coefficient was measured as 0.893, with 95% confidence intervals of [0.850,0.924]. Such a value can be interpreted as an excellent indication of agreement between the two sets of grades (Cicchetti 1994).
Figure 3 displays the manual/automated grades and linear regression line with confidence intervals, where the automated grade is the independent variable and the manual grade is the dependent variable. The linear regression model was statistically significant (p < 0.001) and resulted in an R2of0.829. This result indicates that the data were close to the regression line, and therefore, there was an excellent linear relationship between what was measured automatically and what was assessed by the professor. Furthermore, by observing the regression line (MANUAL = 3.54 + 0.902 ⋅ AUTOMATED), it is clear that the automated grading tool was more “conservative” than the professor, i.e., it returned lower grades on average. Finally, compared with the results achieved with the first implementation of the tool (see Section “Background”), we measured an increase of 0.089 points in R2.
By transforming the numerical grades into dichotomic pass/fail outcomes, the agreement measured with Cohen’s K was 0.71. This value is close to satisfactory. However, ten solutions that were considered insufficient and another two that were considered sufficient were instead considered as the opposite by the professor. These exams were the points in Fig. 3 in the top-left and bottom-right portions of the Cartesian plane.
The final step performed was the calibration of the entire grading process, i.e., to determine the best values for the weights wm, wd, and wc of (1) such that, on average, the automated grade equalled the manual grade. A maximum-likelihood estimation (Rossi 2018) was then performed, which returned the following values: wm = 0.050630, wd = 0.014417, and wc = 0.026797. Using these weights, the prediction of the manual degree from the automated degree is depicted in Fig. 4.