GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation

del Gobbo, Emiliano; Guarino, Alfonso; Cafarelli, Barbara; Grilli, Luca

doi:10.1007/s10115-023-01892-9

GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation

Regular Paper
Open access
Published: 19 May 2023

Volume 65, pages 4295–4334, (2023)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation

Download PDF

Emiliano del Gobbo¹,
Alfonso Guarino²,
Barbara Cafarelli¹ &
…
Luca Grilli¹

3803 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

Automatic short answer grading (ASAG), a hot field of natural language understanding, is a research area within learning analytics. ASAG solutions are conceived to offload teachers and instructors, especially those in higher education, where classes with hundreds of students are the norm and the task of grading (short)answers to open-ended questionnaires becomes tougher. Their outcomes are precious both for the very grading and for providing students with “ad hoc” feedback. ASAG proposals have also enabled different intelligent tutoring systems. Over the years, a variety of ASAG solutions have been proposed, still there are a series of gaps in the literature that we fill in this paper. The present work proposes GradeAid, a framework for ASAG. It is based on the joint analysis of lexical and semantic features of the students’ answers through state-of-the-art regressors; differently from any other previous work, (i) it copes with non-English datasets, (ii) it has undergone a robust validation and benchmarking phase, and (iii) it has been tested on every dataset publicly available and on a new dataset (now available for researchers). GradeAid obtains performance comparable to the systems presented in the literature (root-mean-squared errors down to 0.25 based on the specific tuple $\langle $dataset-question$\rangle $). We argue it represents a strong baseline for further developments in the field.

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Article Open access 25 March 2022

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Article Open access 28 October 2019

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

Article 26 February 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the past, there has been a growing number of students enrolled at universities worldwide.^{Footnote 1} Large courses have thousands of students participating, especially when using virtual classrooms. In introductory computer science and software engineering courses, classroom sizes with up to 1700 students are no longer an exception, with growth by factor five in the last ten years. The free Stanford Massive Open Online Course (MOOC) “Intro to Artificial Intelligence,”^{Footnote 2} is started in 2011, quickly reaching 160,000 students [1]. After witnessed a period of decline, MOOCs are coming back,^{Footnote 3} also due to the COVID-19 pandemic outbreak. As a result of COVID-19, higher education all over the world has moved to deliver courses online [2,3,4,5], facing issues similar to MOOCs.

Large lectures pose a problem for instructors when grading textual exercises, and the shifting of learning contexts from physical to virtual classrooms has made the evaluation of the students even more difficult. In addition, in [6] the authors highlighted the influence of favoritism and the emotional mindset on the assessment procedure. Therefore, automatic short answers grading (ASAG) systems have been introduced both in scientific research [7] and at the service of commercial solutions.^{Footnote 4} Moreover, automatically scoring short student answers is important for building intelligent tutoring systems. In general, computer-aided assessment systems are particularly useful because scoring by humans can become monotonous and tedious [8]. Automatic scoring systems can help teachers save lots of time from duplication of marking student’s homework.

One of the primary challenge in ASAG—especially with closed-ended questions—is to deal with variations in surface representations of key concepts in student answer and reference answer pairs [9] In some cases, the student answers are syntactic and lexical variations of model answers. The answer pairs may contain synonyms, polysemous words and statements that are paraphrases of each other (see Fig. 1).

The problem of finding semantic similarity of a pair of strings has been well studied in NLP literature [10]. Consequently, a variety of such similarity measures have been used—when feasible, based on the type of questions/assignments—in different ASAG systems, e.g., [11, 12]. Other systems, e.g., [13], focused on questions with no reference answers (that often appear in reading comprehension assessments) providing a grade by integrating both domain-general and domain-specific information.

Limits

As we will see more in detail in Sect. 2, the literature on ASAG mainly shows the following limitations: (i) There are a few publicly available datasets on ASAG, and they are generally very limited in size, (ii) works available generally did not performed experiments on all the available datasets and in most cases just on a single one, (iii) there are no tools or software for end-users, hence no clues on how teachers and instructors could interact with such systems and on their user satisfaction, and (iv) experiment’s validation conducted in previous literature is limited because authors have generally not split the datasets by question (even if in the same subject, two questions may be totally different and requiring two completely diverse—lexically and semantically—answers) or have performed just straightforward strategies such as splitting between training and testing sets. Lastly, (v) there is no consensus on the metrics to use, the experiments to perform and the way they should be done.

The aim of the work

In this paper, we focus on developing an ASAG framework, namely GradeAid, to support (i) instructors in evaluating students questionnaires (see Fig. 2 for a sketch of our proposal), (ii) students’ learning by providing numerical feedback. Our attention is put toward making it usable and available to the potential stakeholders (students, instructors). On this page, we want to answer the following research questions:

What is the best approach for ASAG in GradeAid? In this respect, we aim to study which kind of features/text characteristics it should take into account and which machine learning method is most suitable for ASAG.
How does the solution perform in different scenarios (e.g., heterogeneous datasets)?

By answering these research questions, this paper aims to fill the gaps of the literature highlighted in the previous paragraph, and therefore offering to researchers in this field a new dataset for benchmarking, a methodological framework for experimenting and validating ASAG solutions as well as the code used so to ease other researchers in comparison with our tool. The code would serve as a “scaffold”, a basis to build and validate future ASAG systems. The impact foreseen is to ease the development of ASAG systems through offering more comparability and reproducibility to ASAG systems.

The proposed approach

Our idea is to exploit both lexical and semantic features of answers to provide a final grade. First, we have gathered different publicly available datasets for ASAG that are ASAP, SAG, SciEntBank, Cu-NLP which will be presented and discussed deeply in Sect. 3.1. For each dataset, we split the data into coherent subsets representing one question and the related students’ answers. The proposed approach integrates the traditional methodology of natural language processing tasks belonging to the bag-of-words approach, with the most modern technology based on state-of-the-art deep learning methodology involving the semantic of texts. Therefore, each answer is represented as the fusion between its semantic and lexical characteristics. The mixing of the two approaches has been useful (as seen in, e.g., [14] fusing lexical and embedding features) for feature augmentation and provided a richer feature matrix to be processed with machine learning approaches. Further details and motivation behind this implementation are available in Sect. 5. For the validation phase, given the size of the datasets, we employ the well-known leave-one-out cross-validation (LOO-CV). From the experiments carried out, we obtained that single regressors show better results than ensemble of them. Moreover, the best regressors where adaptive boosting, random forest and support vector regressor for all the datasets experimented, and we have obtained up to 0.25 root-mean-squared error (scores for students’ answers from 0 to 5).

Our method’s performance is comparable with (and sometimes better than) state-of-the-art ones. Furthermore, we have collected a new dataset for ASAG in Italian language and performed experiments with our method on this new dataset that proved its goodness achieving up to 0.42 root-mean-squared error. To give concreteness to our proposal, we embedded our method into GradeAid framework.

Methodological contributions

The primary contributions of our work can be summarized as follows:

We have provided GradeAid, a framework for ASAG whose python-like pseudo-code is available in Appendix B, and its full code will be available at in a GitHub repository^{Footnote 5} shortly;
GradeAid is novel because it jointly considers lexical features and semantic features of the short answers, where the lexical features are computed with TF-IDF method, and semantic features are computed with BERT Cross-Encoder and condensed in a similarity score. Words (and the exact choice of words by students) are very important in several HE subjects, and thus, we had to include lexical analysis in our GradeAid. Second, the similarity between students’ answers and reference answers is crucial for the overall assessment as well as for specific subjects. GradeAid, in this version, is our proposal bringing together both clues and provides the solution in a not computationally demanding fashion.
Compared to the literature, we have performed more robust experiments on ASAG with our method on all the datasets publicly available in the literature; the results show that our method’s performance is comparable with (and sometimes better than) state-of-the-art ones.
We have introduced a new dataset for ASAG^{Footnote 6} that has been used as testbed for our solution. The dataset will be published in the near future, for all researchers interested into.

Contribution to ASAG research field

We remark that, from the literature review, further expanded in Sect. 2, this research field lacks of standardized and well-defined approach in comparing the proposed methodologies performances: Researchers use different datasets and performance indicators to benchmark their methods and often do not share the code or pseudo-code or software to reproduce their tests. With this work, we intend to provide:

a standardized set of datasets (based of the most commonly used in the literature), available in a public repository, and open for further contributions from other authors;
a standard target of evaluation approach of ASAG solutions;
a standard indicators’ set to benchmark other ASAG solutions’ performance.
public repository for the proposed method, open to be improved by other researchers.

Therefore, the milestone of this paper is to lay the foundation for a developing more open, transparent and robust ASAG solutions.

Organization

The rest of the paper is structured as follows. Section 2 presents related works in the literature and a schematic comparison among such works with the one presented here. Section 3 offers an overview on methods and materials employed. Section 4 formulates and details the problem faced in this work. Section 5 presents GradeAid framework with its components and the methods evaluated. Section 6 shows obtained results and offers a throughout discussion about insights got. Finally, Sect. 7 concludes with final remarks and future developments.

2 Related work

In this section, we report and offer details about previous literature showing contact points with the work here presented.

The issue of ASAG has been studied for a long time; indeed, different reviews and surveys concerning ASAG are available in the literature. In the following, we briefly discuss about main goal and findings of such studies in chronological order (from older to newest).

In [7], the authors surveyed ASAG systems, analyzing 80 papers published between 1996 and 2014. They mainly focused on the advancement of methods and approaches. The authors found that statistical methods were the most used to tackle automatic grading, and natural language processing techniques were widely adopted for extracting lexical, morphological, semantic and syntactic features from data. Moreover, they observed that this body of works was emerging; still there were barriers in the advancement of the research due to the impossibility of publishing the datasets employed for privacy reasons. Later, the authors in [15] pushed forward the analysis, including 44 papers published between 2003 and 2016. They studied datasets employed, machine learning (ML) techniques adopted, commonly used features and quality of results. The authors found that the most used ML method was SVM (no deep learning methods were used) and answers are represented through bag of words and word2vec mainly.

In [16], the authors reviewed works on ASAG published before 2019. Their aim was understanding how they work, mainly with regard to features employed, and what challenges in education they face and address. One of the drawbacks found was related to features extraction: Vast majority of proposals rely on feature engineering that can be time-consuming, since features need to be carefully handcrafted and selected to fit the appropriate model.

More recently, the authors in [17,18,19] reviewing a narrower set of articles detected an increasing use of deep learning techniques in last years with quality results which could favor the real applications.

For the sake of clarity, here we zoom into recent articles on ASAG which we will compare to. In particular, using Scopus^{Footnote 7} and Web of Science^{Footnote 8} databases, we have downloaded the latest articles (from 2018). The query used is available in Appendix C. After applying the method proposed by [20], and the quality filtering proposed by [21], we gathered a total of 32 articles about automatic short answers (or essays) grading (or scoring). Such articles have all employed a dataset of answers written in English language.

In the following, we first describe each work highlighting key differences with the present one, and then, we will compare on different criteria such papers.

First of all, we can distinguish between works dealing with ASAG as classification problem, and other ones that faced a regression problem, like we do here. We will take out of the deeper analysis the work by De Clercq et al. [22]. In such a paper, the authors proposed an analysis of handcrafted features for ASAG and their correlation with the final score assigned to a student’s answer. They did not develop a new method, or experimented with ML models, and hence, they had a different focus with respect to all other articles here surveyed.

2.1 ASAG as classification problem

This section is devoted to present the works that have approached to ASAG as a classification problem.

Tay et al. [23] proposed a new method based on SKIPFLOW mechanism that models relationships between snapshots of the hidden representations of a LSTM network as it reads. Subsequently, the semantic relationships between multiple snapshots are used as additional features for scoring students answers. They obtained an average quadratic weighted Kappa (QWK)^{Footnote 9} = 0.764. They compared the solution with different models, also a commercial one, namely EASE. It is unclear if the solution is portable in real-world scenarios with many more parameters to tune than the other works. In fact, the authors compare the parameters’ impact, observing that small variations in training time and parameters of the LSTM lead to performance degradation (lower than competitors) on scoring. The solution proposed only works in the English language.

Cai et al. [24] proposed a RNN architecture trained on a fusion between word embedding features (via GloVe) and handcrafted features (spelling mistakes, article length, etc.) for scoring students’ answers. They evaluated different RNN models and measured the correlation of handcrafted features with final scores. They obtained a QWK up to 0.80. Surprisingly, the size of deep learning models did not significantly affect the results. Deeper RNNs might overfit the small training data and yield inferior results than shallower networks. RNN with 50 hidden neurons might be powerful enough to learn latent information and features for this ASAP dataset. Furthermore, the usage of both handcrafted and GloVe features improves QWK. The solution has not been compared with others in literature and only works in English.

Chen et al. [25] proposed a novel CNN with a final ordinal regression layer for scoring. The model is trained on one-hot encoding of answers’ words (after segmentation and tokenization tasks). They obtained a QWK of 0.826. With respect to using LSTM and CNN models alone, the accuracy of the proposed solution has been greatly improved when combining CNN and Ordinal Regression. However, it is unclear if the method can work with languages other than English.

Chimingyan et al. [26] fine-tuned word2vec model for the embedding of students’ answers and added handcrafted features (grammar errors, answer length, average words length, etc.) for the automatics scoring through LSTM network. They compared LSTM with logistic regression, obtaining an accuracy of 0.32 and QWK of 0.94. They found that LSTM performs much better than logistic regression (QWK = 0.65) in scoring students answers. They found that grammar errors profoundly affect the final score as they are the most important among the handcrafted features. The method developed works only with the English language. Furthermore, as for other related works, the researchers did not compare their solution against other methods in the literature.

Wiranto et al. [27] introduced a method based on transfer learning Siamese dependency tree-LSTM network for scoring students’ answers. The model is trained with GloVe embedded answers and relevant words thereof combined with their synonyms generated through autoencoders. The experimental phase foresees different tests with and without synonyms, deleting words, adding noisy words, with transfer learning and without transfer learning. They obtained a QWK and accuracy of 0.8574 and 70.80%, respectively. Furthermore, they found that the proposed method generally has the best evaluation QWK and accuracy with transfer learning and without data augmentation. However, the validation method is not precise.

Hussein et al. [28] developed a framework that scores students’ answers and essay traits (ideas, style, organization, conventions). They explored multiple deep learning models for the automatic essay scoring task. The authors obtained a QWK of 0.851. The results show that the prediction of the traits scores enhances the efficiency of the prediction of the overall score. However, the authors foresee the use of essay traits prediction for giving feedback to students, but such a scenario has not been evaluated. Moreover, other models compared are faster than the developed ones.

In [29], the authors collect and perform an experiment on, a small dataset of answers in the Slovenian language, which are graded into three labels. Experiments encompass two-label and three-label settings. Answers are represented with word2vec, and the final grade is given through a state-of-the-art neural network model. One of the limitations of this work is that they only performed experiments with neural networks (no other methods have been compared), and they adopted training and testing split only.

In [30], the authors developed a bi-LSTM inspired by [31] with a lexical attention mechanism to score students’ answers. The model is trained with word2vec embedded answers. The attention mechanism relies on a feature capture layer that computes the weights for each word in the essay [31]. The method scored a QWK of 0.83. Their method can show critical words and sentences that lead to the final assigned/predicted score; thus, it has more interpretability than other deep learning approaches found in the literature. The model has not been evaluated against others in the literature. It does not consider the sentiment of the essay (the attitude of the students toward the essay). The model has not been tested on other datasets.

For the ASAG task, some researchers started using the state-of-the-art method for natural language processing, BERT. In Lun et al. [32], the authors proposed a method involving the usage of BERT. In particular, they fine-tuned the BERT base model to classify the students’ answers. Besides, the authors adopted data augmentation strategies: back-translation, correct answer as reference answer and swap content. The performance achieved from this model is measured using the accuracy and F1 on the SciEntBank dataset. The performance shows a peak $accuracy = 0.8277$ for the binary classification. The limitation of this method is that the score provided is just binary, not a more linear score of goodness to fit to the reference answer. Moreover, the fine-tuning of BERT model is a resource-intensive task. Similarly, in [33], the researchers used the SciEntBank dataset and BERT for automatic grading, by fine-tuning the model, in this case, with a softmax layer to provide the final score. In addition, the authors used XLNET (Extra Long Network) to compare performances. Here, the BERT approach provided better results over XLNET, with a MeanF1 score close to 80% for binary tasks. The authors only split the dataset one time randomly with 20% of data used for validation, therefore not performing a more robust testing with cross-validation. Also in [34], the authors proposed a method relying on the BERT model. In this case, the BERT basic model has been further trained using domain-specific knowledge of textbooks and further fine-tuned with questions-answer. The further training increased performance on test runs on proprietary and not publicly available datasets, achieving a MeanF1 = 83.95%. However, this approach is not directly reproducible since the dataset is not publicly available.

Lastly, an interesting approach was conceived in [35], where the authors represented the students’ answers and the reference ones as a graph. Each sentence is represented as a node, and it is connected with an edge to another type of node representing the uni-gram and bi-gram included in the very sentence. Then, the method used to grade the answer is based on a graph neural network that accounts for similarities between the reference answer’s graph and the student’s one. Furthermore, the authors provide a deep analysis of the graph features more correlated with the final grade. However, they have used only the ASAP dataset, and their experiments involve a straightforward approach of splitting the dataset into training and testing, then reserving 10% of the training set for validation purposes.

2.2 ASAG as regression problem

This group of works is closer to the one presented here because the problem of ASAG has been approached as a regression one. Therefore, to better highlight differences and similarities, we first describe each element found in this group, and then, we sketch and summarize significant points of each of them (and ours) in Table 1. The table focuses on the following aspects:

Dataset, i.e., the set of datasets employed;
Features, i.e., how text answers have been represented;
Method, i.e., how scores have been computed (regressors, similarity methods, and so on);
Validation, i.e., whether authors have only performed training and testing (TT) or cross-validation (CV), and/or split the dataset by question (QS), and/or cross-validation with nested features computation—as we did here—(CV-FC), or, lastly, validation not clear from the paper (NA);
Code, i.e., the availability of the code.

Hassan et al. [36] compared different embedding techniques for students’ answers, i.e., word2vec, GloVe, fasttext and Skip thoughts. The proposed model employs embedding on students’ answers and reference answers to generate vector representation of answers. Cosine similarity measure between vectors of students’ answers and reference answer is used as a feature vector to train ridge regression model for predicting students’ scores. The method achieved RMSE of 0.821. Paragraph embedding techniques resulted in being the most effective for short answers grading. The authors have not compared different regression methods for obtaining the final score. The proposal has not been evaluated on diverse datasets.

Prabhudesai et al. [14] proposed a method to score students’ answers based on (i) word embedding via GloVe combined with handcrafted features (e.g., number of words, length of answer, number of unique words, average word length, etc.) and (ii) customized LSTM that takes as input embeddings, handcrafted features and the reference/golden answer. Moreover, they compare several architectures for LSTM, like plain, deep and bidirectional (also with and without handcrafted features) to find the most suitable for the scoring problem. Lastly, authors compare their solution with those previously available in the literature. They obtained a MAE of 0.618 and a RMSE of 0.889. Their work shows how including handcrafted features improves results, especially with regard to MAE (of about 0.25). However, the approach not evaluated on other datasets (like ASAP).

Sahu et al. [37] faced the problem of scoring both as regression and classification task. They experiment a rich set of text similarity features, CBoW and TF-IDF to train machine learning regressors and classifiers whose output is then conveyed into a fusion model for the final scoring. The proposed method has also been evaluated on ASAP dataset. They achieved RMSE = 0.793 and F1 from 0.50 to 0.93. The stacked ensemble method exhibits better performance than single methods. Moreover, the alignment-based and lexical overlapping features in addition to the semantic similarity features employed in the proposed system contribute a lot toward enhancement in labeling performance as compared to previous systems in the literature. The paper does not compare further ensemble methods with the one proposed. Very short answers are not graded correctly, and in some cases, the system fails to capture sequence information of answers.

Gomaa et al. [38] propose ans2vec, a method based on skip-thought mechanism to perform students and reference answer embedding and measure their similarity. The proposed method has also been evaluated on SciEntBank dataset. They achieved a RMSE of 0.91 and a F1 score from 0.54 to 0.58. The result is comparable with others in the literature without using any handcrafted feature. The method is not applicable on languages other than English due to the dependency from ans2vec method.

Besesio et al. [39] proposed a series of experiments for ASAG on the ASAP datasets, but they have rescaled the scores to three integer values, i.e., 1, 2 and 3. The authors have performed experiments, using cross-validation, with word2vec, with handcrafted features and with a fine-tuned BERT model. They found that the combination of all three representations of answers (students’ and reference) minimizes the error on the regression. Differently from most related works, here the hyper-parameters of the BERT model are transparent to the research community, but the final method to provide the score is hard to get from the paper.

In [40], the authors proposed a novel fusion model which takes: (a) the string of the reference answer and the students answers to get the first output; (b) the string of the question and student answer to get the second output. Finally, the model feeds the two outputs obtained above into the a features fusion layer filtering out multiple facets of semantic features, followed by an output layer for generating the final score. The proposed method has also been evaluated on ASAP and SciEntBank datasets. The method achieved a RMSE of 0.678 and a F1 from 0.62 to 0.768. The proposed model overcome previous developed scoring systems on the datasets employed for the evaluation. The authors found that the proposed system performs poorly on long-tailed datasets. Lastly, the authors compare their solution with a narrow set of previous works, overlooking many contributions we have identified in the literature.

Tulu et al. [41] proposed an automatic scoring system based on SemSpace [42], a novel embedding approach to learn word vectors by weighting semantic relations, used to feed Manhattan LSTM network (which is precisely used for similarity tasks [43]). The proposed method has also been evaluated on CU-NLP dataset. It obtained RMSE = 0.040 on SAG dataset. The proposed method also showed its reliability and consistency with CU-NLP dataset. Authors show that their solution overcome previous ones in the literature (by at least 0.80). However, SemSpace cannot handle mis-spelled words or grammar errors.

Overall, as anticipated in the introduction of the paper, we remark that the literature on ASAG mainly shows the following limitations:

there are a few publicly available datasets on ASAG, and they are generally very limited in size. Besides those listed in Sect. 3.1, no other work has released the dataset employed;
works available generally did not performed experiments on all the (few) available datasets and in most cases just on a single one;
there are no tools or software for end-users, hence no clues on how teachers and instructors could interact with such systems and on their user satisfaction;
experiments’ settings are not uniform: Validation conducted in previous literature is limited because authors have generally not split the datasets by question or have performed just straightforward strategies such as splitting between training and testing sets.

Table 1 Comparison on different criteria among works identified in literature and this work

GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation

Abstract

Similar content being viewed by others

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

1 Introduction

2 Related work

2.1 ASAG as classification problem

2.2 ASAG as regression problem

3 Materials and methods

3.1 Dataset(s)

3.2 Short answers representation

3.3 Machine learning methods for ASAG

3.4 Metrics for ASAG performance assessment

4 Problem description

5 GradeAid: our framework

5.1 Experiment

5.2 Results

6 Discussion

6.1 Additional experiments

6.2 Motivation for translation in STITA experiment

6.3 Motivation for integrating lexical and semantic features

6.4 Motivation for keeping the datasets’ data distribution

7 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Hyper-parameters tuned for each regressor

Appendix B: Running the experiments: pseudo-code of GradeAid

Appendix C: Query for searching previous literature on ASAG

Appendix D: Mean squared error obtained in the experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation