1 Introduction

One of the most common definitions of reading describes it as the recognition of words that allows the connection between information and prior knowledge to construct meaning from a written message (Mohseni Takaloo & Ahmadi, 2017). There is a text structure called aphorism that can be used to evaluate the inference skill which reflects expertise in reading comprehension. Aphorism is a term used to describe a truth of general importance conveyed in short and pithy sentences whose characteristics fit perfectly when evaluating reading comprehension through short answer inferences. Written evaluations like short answer questions (SAQ) are necessary in order to help identify students’ skills and knowledge in reading comprehension. Nevertheless, grading SAQs is an expensive task because the evaluator needs to read every student’s answer taking into account different syntactic and semantic aspects before the grade assignation. Thus, it is a time-consuming task (Mohler et al., 2011; Drolia et al., 2017). Moreover, it is difficult for evaluators to consistently assess the aforementioned aspects simultaneously, because of subjective perception elements at play in the human evaluator, which impact grading variance (Gomaa & Fahmy, 2011).

From the reasons previously stated, the development of an effective Automatic Short Answer Scoring (ASAG) system contributes to a high quality educational environment, and, in fact, it has received considerable attention in recent years (Heilman & Madnani, 2015).

1.1 Automated short answer scoring (ASAG)

The consistency of ASAG systems, which provide standardization, have been proven to be much higher than human evaluation (Haley et al., 2007). The purpose of ASAG systems is to save time in assessment in those fields that make use of short answer questions to evaluate students (Walia et al., 2019). A related task, called Automated Essay Scoring (AES), is being used in education as well, but it works on longer answers, where categories like spelling analysis, grammar and coherence are more relevant than for ASAG systems. For example, in ASAG, responses typically have an average length from 1 to around 50 words, whereas AES answers average between 90 and 600 words (Shermis, 2014, 2015; Basu et al., 2013). In ASAG, students’ answers are typically evaluated against a target or model answer, and as such, grammar or coherence are not relevant in many approaches (Magooda et al., 2016). The interest in ASAG and AES systems has increased in recent years thanks to the accelerated integration of technology with learning and education (Maduabuchi & Emechebe, 2016; Sychev et al., 2020), with several educational platforms (e-learning) and institutions promoting the use of digital technology-based tools to support learning (Pollo Cattaneo et al., 2016).

The development of an effective Automatic Short Answer Grading (ASAG) system relies heavily on having access to a well-designed dataset that includes a sufficient number of diverse short written answer examples. It is important to ensure that the dataset represents the population of interest and contains samples in the target language. By having a diverse dataset, the performance level of the target group can be evaluated accurately. To enhance the performance of the ASAG system, deep learning language models can be utilized through transfer learning to identify patterns and capture the semantic nuances of the language. However, most publicly available datasets are restricted to the English language, which presents a significant obstacle for the development of ASAG systems in other languages.

1.2 Research question

In countries where students are Spanish native speakers, the construction of a Spanish-grounded ASAG system is necessary to support undergraduate students in their reading comprehension skills improvement. In the context of Colombia, there is a low performance level in reading comprehension in undergraduate students which is reflected in PISA tests. Thus, the construction of an ASAG system for Spanish with the purpose of supporting undergraduate students in the improvement of their reading comprehension skills can enhance the development of Information and Communication Technologies tools in the region. This can offer an e-learning environment that provides students a self-learning approach to improving and practicing their skills every day, anytime and anywhere. To validate the system, the grade predictions of the system are statistically compared with actual assessments of experts, in order to evaluate how the system assess student answers. Given the problem, the following research question can be considered: How effective is a deep-learning-based ASAG system for grading the reading comprehension skills of undergraduate students?

1.3 Reading comprehension and NLP

The inference skill on reading comprehension is the hardest, because the reader has to find hidden patterns between the lines of the text. This skill is necessary in order to understand what aphorisms communicate. To evaluate the correctness of the interpretation of an aphorism, it is possible to use artificial intelligence approaches. Several techniques for the evaluation of written text have evolved with respect to preprocessing and score prediction techniques. The branch of artificial intelligence that encompasses everything related to the representation or understanding of human language is called NLP (Natural Language Processing) (Young et al., 2018). The NLP field is concerned with giving computers the ability to understand the human language through text and speech (IBM Cloud Education, 2020). In recent years, NLP has become more usual (Kumar & Boulanger, 2021). Some applications of NLP are automatic machine translation, text classification, search engines and chatbots (Eisenstein, 2019). An important task in natural language processing is Automated Short Answer Grading (Adams et al., 2016), which contributes to the automated assessment of students and the implementation of more sophisticated self-learning systems.

The task of representing words and documents has been the core of almost every NLP application (Almeida & Xexéo, 2019; Camacho-Collados & Pilehvar, 2020). Embeddings are dense vector representations of words or sentences capable of mapping syntactic and semantic relationships onto a vector space. Word embeddings are categorized into two types: count-based embbedings whose representation is derived from word counts and frequencies, and prediction-based embeddings derived from a word’s context. The second type arises from the neural language model approach (Adamuthe, 2020). The count-based approach includes techniques like TFIDF (TFIDF–BT, 2010) and Bag of Words. Today the most frequently utilized embeddings belong to the prediction-based family (Gutiérrez & Keith, 2018) such as GloVe (Pennington et al., 2014), Word2Vec (Mikolov et al., 2013) and FastText (Bojanowski et al., 2017). The aforementioned are unsupervised approaches based on the hypothesis that words, whose occurrence arises in the same contexts, tend to have similar meanings. These word embeddings represent the first significant evolution of NLP in the last decade. However, recent studies have found approaches in which a text is represented by a vector of features, considering the words it contains and the order in which they appear. In order to capture the sequential relation between words, it is considered the fact that a word can have different vector representations based on its context. This diverges from previous models like Word2Vec in which a word is assigned a static representation.

The first model that used this technology was ELMo (Peters et al., 2018). It used a deep bi-directional LSTM (Long Short-Term Memory) model to create dynamic word embeddings. Deep contextualized word representations are computed from the power of sequential deep neural networks such as LSTM, GRU (Gated Recurrent Units) and other more advanced configurations of RNN (Recurrent Neural Network) (Schmidhuber & Hochreiter, 1997; Cho et al., 2014) and Transformers (Ghavidel et al., 2020) whose mechanism, called Attention, outperforms previous models in different natural language tasks (Gong & Yao, 2019).

1.4 Sentence embeddings

Sentence embeddings are vector representations that capture the semantic information of sentences. This approach is very useful when comparing or deriving the relation between two passages. Sentence embeddings are commonly used in ASAG because of the need to obtain the relation between a target answer and a given student answer.

1.4.1 Skip-thought vector

Skip-Thought is an unsupervised model based on a encoder-decoder RNN with GRU architecture that converts sentences into vectors (Kiros et al., 2015). Essentially, this conversion is achieved by the prediction of neighboring sentences given a central sentence. This approach is an adaptation of the Skip-Gram model used in the Word2Vec embedding model. Skip-Thought was tested by authors on several applications including paraphrase detection and semantic relatedness. It obtained suitable results. The model was trained using a large corpus made of 74 million sentences taken from a dataset called BookCorpus which was built from books written by unpublished authors on 16 different literary genres (Zhu et al., 2015), e.g.: Romance (2,865 books), Teen (430 books), Fantasy (1,479 books), Science Fiction (786 books). To determine the relatedness between two sentences, the authors were instructed to perform the following operation using their respective vector representations u and w.

$$\begin{aligned} u\odot w = r \end{aligned}$$
(1)
$$\begin{aligned} \mid u - w \mid = v \end{aligned}$$
(2)
$$\begin{aligned} r^\frown v \end{aligned}$$
(3)

Equation (1) is the element-wise product between the two vectors, while (2) is the absolute value of the difference between both vectors. Those two results are then concatenated as indicated in (3). This final vector represents the sentence pair composed by u and w (Kiros et al., 2015).

1.4.2 Bidirectional Encoder Representations from Transformers (BERT)

BERT is a model that uses a variation of Transformer, given that it only utilizes the encoder part because its goal is to generate a language model (Devlin et al., 2018). A transformer is a seq2seq model that can capture the relationships between words regardless of their position (Camacho-Collados & Pilehvar, 2020). As such, sequential information is not longer important. There are two main features that characterize Transformer architecture: Attention mechanism and Agnosticism of sequential data.

The architecture of BERT is available in many versions depending on the application. The base version has 12 Transformer layers, 12 heads of attention and 768 hidden layers in contrast to the large version which has 24 Transformer layers and 1024 hidden layers. Furthermore, there are Fine-Tuned versions that can be used for specific tasks such as Classification, Question Answering and Named Entity Recognition.

The original version is pretrained in English, but there are Spanish (Canete et al., 2020) and Multilingual versions as well. The BERT model innovates through two strategies to define a non-directional prediction goal: Masked Language Model (MLM) and Next Sentence Prediction (NSP). The MLM strategy consists of the selection of 15% of the words in each sentence. Then from this selected group, 80% is replaced with a [MASK] token, 10% with the original token and the other 10% with a random word. In this way, the model can predict the original value of the masked words, based on the context provided by the non-masked words. On the other hand, the NSP strategy consists of the selection of a pair of sentences as input, 50%, of the time the second sentence corresponds to the subsequent sentence in the original document, but in the other 50%, the second sentence corresponds to a random sentence.

The vector representation obtained from the previous models can be used as input features for a predictor model, which can be trained to predict a score given the comparison between two answers (Gomaa & Fahmy, 2019), one corresponding to the student’s answer and the other corresponding to a target answer (denoted as correct). In this approach, it is important to consider the robustness of the dataset, which contains questions, model answers and student answers with their respective rating given by human experts. In particular, the work in Burrows et al. (2015) describes a general pipeline, shown in Fig. 1, for an ASAG system. This diagram serves as a framework to guide the workflow of this type of systems.

Fig. 1
figure 1

A pipeline for automated short answer scoring (Burrows et al., 2015)

In this study, we analyze the performance of two ASAG approaches that integrate BERT and Skip-thought sentence embeddings with different setups, in order to capture the most important semantic information in Spanish short answer aphorism-based questions to feed a grade regressor. The main purpose of this research is to build a baseline that can serve for future work with the proposed dataset. The main contributions of our work can be summarized as follows: (1) A Spanish dataset for ASAG task with questions based on aphorisms to assess reading comprehension; (2) The analysis of the BERT and Skip-Thought performance in ASAG task using different configurations; (3) Methodological steps for future ASAG implementation in other languages. This paper is organized as follows: Section 2 briefly introduces the related work to this study. Section 3 describes the approach, Section 4 describes the experimental setup. Analysis and experimental results are discussed in Section 5 and finally Section 6 concludes the paper and considers future work.

2 Related work

The different proposed approaches for the solution of the problem of ASAG have been classified in Burrows et al. (2015) and Ziai et al. (2012) into four main groups, taking as reference the pipeline shown in Fig. 1: Concept Mapping, Information Extraction, Vector Space Model and Machine Learning. The machine learning approach is the most popular, since it has demonstrated superior results in capturing patterns and relationships between words on text by using deep learning neural networks.

In Drolia et al. (2017), the authors propose the use of features such as bag of words (BOW), parts of speech (POS) count, orthography, the fluency and dexterity of writers based on count sentences and paragraphs average sentence length, to compare the test essay against a model essay by computing their similarity through Latent Semantic Analysis (LSA). Then, they use the five features previously mentioned to train a linear regression learning model for rating the essays.

On the other hand, the authors in Magooda et al. (2016) propose a set of representation techniques and similarity measures to be used to find the best results for automatic short answer grading. For representing the answers, word embeddings like GloVe and Word2Vec are used in combination with operations between vectors to find a suitable representation. Then a neighbor block determines the similarity between both answers (student answer and model answer). These similarity values are used as features for a Support Vector Regression (SVR) which maps the similarity score from 0-1 to a grade from 0-5. Three datasets are used to test their model: University of North Texas, Cairo University and SemEval 2013 datasets, obtaining Pearson correlations of 0.55 and 0.84, and a RMSE of 0.91 and 0.89 respectively.

In Gomaa and Fahmy (2019), Skip-Thought embeddings are used to convert answers into vector representations to capture the semantic relationships between words. The authors implement a system for short answer scoring where a vector that contains the relatedness between target answer and student answer is the input of a scoring system implemented through a Linear Regressor. The results are compared with previous implementations using Pearson’s correlation and RMSE. The implementation obtains better performance compared with previous works, with a Pearson coefficient of 0.63 and a RMSE of 0.91 on the University of North Texas dataset (Mohler et al., 2011). The dataset contains 87 questions from ten different assessments with four to seven questions and two more exams with ten questions. Answers have a median length of 13 words, minimum length of 1 word, and maximum length of 53 words. A total of 2442 student answers are recorded, graded by two human judges on a 0.0 to 5.0 scale, with the final score obtained from the average of both judges.

Recently, Huang et al. (2018), proposed a model based on RNN with a fully connected layer on the output as framework to score reading comprehension on short answer questions. The authors use CBOW embeddings to obtain vector representation of the student answer. Transfer learning is performed to obtain an enriched vocabulary related to the specific task.

The vector representation of the student answer serves as input to a LSTM model with a single layer. After the RNN stage, the authors propose a MLP classifier. The metric used to evaluate the system is QWKappa. The best result was 0.9847 with a dataset of 2579 answers written in Chinese.

On the other hand, the authors in Sung et al. (2019) compare the performance of a BERT-based ASAG system with a modified version of BERT. This modified version was retrained with a corpus related to the specific topics of questions. The authors apply transfer learning with textbooks of topics such as Philosophy, Government and Psychology. The results showed that the BERT model with augmented training data obtained better performance than BERT with only fine-tuning. The metric used in this study was the macro-averaged F1.

A similar study using a Transformer-based architecture is proposed by Ghavidel et al. (2020). The authors propose an approach free from manually engineered features using BERT and XLNET models. The general approach is using the pair (model answer - student answer) as input. This pair is transformed by the model to a contextualized vector representation, then, these features feed a fully connected classifier with a Softmax activation function output layer. The results obtained show that the approaches achieve better and competitive results compared with previous works, obtainning an accuracy of 79.8%. The authors state that these approaches are different from others because they only use the pair (model answer - student answer) as input without any preprocessing.

Unlike previous studies, the authors in Lun et al. (2020) propose a method based on multiple data augmentation strategies for improving performance on Automatic Short Answer Grading. Three strategies are proposed: Generate a second twin dataset using backtranslation, set correct answers as model answers to generate new samples and swap content by using the above mentioned second twin dataset. The authors used a dataset called SemEval-2013 which contains reference answers and students answers for 197 questions in 15 different science domains. There are different versions of the dataset labeled with 2, 3 and 5 different labels. The combination of these techniques shows a suitable performance compared with previous works obtaining an accuracy of 82%, 76% and 70% respectively for each dataset version.

The work in this paper is to compare the performance on a Spanish dataset of two ASAG approaches based on different sentence embeddings and configurations. Previous work helps us to identify the most recent techniques employed during the last five years such as RNN and Transformers in order to build a baseline for reading comprehension ASAG models in Spanish language. RNNs and Transformers have been most successful models for ASAG, but have not been applied to a Spanish corpus, thus framing the relevance of the present work as filling that void.

3 Proposed approach

This research proposal aims to find the best approach to build a suitable ASAG system with a dataset of reading comprehension questions/answers in the Spanish language and open the further implementation in other languages through its methodology. The general pipeline of the proposed approach is illustrated in Fig. 2. The construction of a dataset from a test taken by students is necessary. The pipeline is composed of several stages: Dataset construction, Sentence Embedding, Grade Predictor, and Evaluation. The main problem is the selection of a suitable sentence embedding to represent the student-target answer-question in such a way that the semantic component can be captured in an effective way. Furthemore, hyperparameters selection of the grade regressor is important for fine-tuning.

Fig. 2
figure 2

General Pipeline of proposed approach

3.1 Dataset description

To train a grader model is necessary the collection and building of a training dataset of answers labeled with a score given by experts. To this end, twenty (20) exercises based on Spanish language aphorisms were selected with the criteria of having an objective target answer, which allows for the effective evaluation of the inference skill. The target answer for each question was defined by a consensus of a group of language professors where two co-authors were participants.

For the construction of the dataset, a reading short answer question-based test was built with the 20 selected exercises. This test was conducted in May 2021 and taken by a total of 199 college students at Universidad del Norte. A diverse group of students was considered to avoid skew in the quality of college students answers. Figures 3 and 4 show the distribution of participating students taking into account their career program and semester.

Fig. 3
figure 3

Distribution of participating students by career

Fig. 4
figure 4

Distribution of participating students by semester

Fig. 5
figure 5

Spanish aphorisms dataset distribution

After dropping duplicates and badly formatted answers, the resultant dataset has 3772 answers. Then, as it is suggested by current practices in the field, two experts evaluated every single answer with a score between 0 (wrong) and 5 (correct), and a binary evaluation of correct/wrong (1/0). The dataset is available in a github repository.Footnote 1 Figure 5 shows the distribution of the collected answers with the score given by an average of the two experts’ evaluation. The two experts are language teachers with graduate degrees in the subject matter and who have been working with reading comprehension assessment with undergraduate students for more than 15 years. We consider the average grade of two annotators as the standard against which the system’s output is compared.

The annotators were given no explicit instructions on how to assign grades other than using a range from 0 to 5. This grading standard was chosen based on previous work such as Mohler et al. (2011), where 0 means that an answer is completely incorrect and there is a continuous domain of grades up to the maximum grade 5, which means that an answer is completely correct. As is seen in Table 1, both annotators gave almost the same grade 35.5% of the time, and grade less than 2 points apart 46.7% of the time. The full breakdown can be seen in Table 1. The average grade given by Grader 1 is 3.42 and the average grade given by Grader 2 is 2.69. The dataset is biased towards central scores close to 3.0. We claim all the above issues correctly mirror real-world grading tasks. Table 2 shows two question examples with their respective target answer, three student answers and their respective grade provided by each human expert grader. Furthermore, it is possible to verify that one annotator tends to be stricter than the other most of the time.

Table 1 Annotators analysis
Table 2 Sample questions with short answers provided by students and graded by the two human experts

3.2 Sentence embedding

To implement the sentence embedding, two deep learning approaches were considered: Skip-thought and BERT. BERT is a model based on Transformers, basically a combination of fully connected neural networks and mechanisms called Self-Attentions, which allow important fragments of text related to a specific word to be captured regardless the sequence order. On the other hand, Skip-thought is a bi-GRU encoder-decoder model which uses neighbor sentences as decoder output to map the input sentence into a vector representation, considering the sequential nature of the words. Each approach will be described below.

3.2.1 BERT Approach

The BERT approach is further divided into six (6) versions. Three models pretrained with different languages are tested to select for the best performance. One model is a pretrained BERT model in Spanish, the other is a pretrained model in a Multilingual approach which allows 104 languages including Spanish. The last is the original BERT version in English. In order to use the original version, the dataset is translated to English, for this purpose, DeepL,Footnote 2 a machine translation service is used, because its recognized performance and accuracy based on contextualization. The main purpose in using this version is to have a strong reference to previous work, most of which uses English datasets.

Another factor is the text input, two options were explored:

  • Feeding the model with a single sentence that contains the concatenation of Question (Q), Student Answer (ST) and Target Answer (TA).

  • Feeding the model with two separated sentences: Student Answer (ST) and Target Answer (TA).

Figure 6 shows the setup established in the BERT approach. The variable L denotes language, and the variable K denotes the input format. Tables 3 and 4 show the possible values of the K and L variables respectively and their correspondence.

Fig. 6
figure 6

BERT Approach pipeline

Table 3 K Variable’s possible values

3.2.2 Skip-thought approach

In this approach, a pretrained model in English is used to obtain the numerical representation of sentences. Figure 7 shows the Skip-Thought approach. In this case, in order to use this model it is necessary to translate every student-target answer pair. In this case, we used DeepL machine translation service as well, obtaining the same dataset translated to English used with BERT. Then, every pair of sentences is converted into a vectorial representation by using Skip-Thoughts sentence embedding. The dimension of the output vectors is 4800, so, after applying the operations described in (1), (2) and (3) between both vectors, a set of 9600 features is obtained to feed a grade predictor.

Table 4 L Variable’s possible values
Fig. 7
figure 7

SkipThought Approach pipeline

3.3 Grade predictor

The proposed grade regressor stage is a Multilayer Perceptron (MLP). The regressor will predict a score between 0 and 5. Figure 8 shows the general architecture of the grade regressor. The architecture used with the BERT model is composed by two dense hidden layers of 768 neurons, the first one has a Tanh activation function and the second one a linear activation function, with a dropout of 0.1. The output layer has only a single neuron, thus, the task of regression is performed. On the other hand, a MLP of two hidden layers and 50 neurons each was chosen to be used with the Skip-Thought Model, this decision is taken after performing a set of experiments with different hyperparameter configurations and choosing the configuration with the best performance. More details are presented in Section 4.2 below.

Fig. 8
figure 8

Grade Predictor

The MLP regressor receives the vector representation of the answered exercise as input to train the model by using a backpropagation algorithm for neurons weigths update. The algorithm uses a MSE function loss to minimize the error between actual grades given by experts and predicted grades. After the output layer, the predicted score passes through a Quantizer, whose purpose is to act as a hard layer, the predicted scores are approximated to usual values given by experts, for example, a predicted score of 4.45 is approximated to 4.5.

3.4 Evaluation

The evaluation metrics are the Pearson’s correlation (r) and the Root Mean Squared Error (RMSE) because both are widely used in the state-of-art to measure performance. Pearson’s correlation coefficient measures the statistical relationship between two continuous variables. The correlation coefficient is defined by (4), where n is sample size, \(y_i\) and \(x_i\) are individual sample points, and \(\overline{y}\) and \(\overline{x}\) are sample means. The Pearson correlation ranges from -1 to 1, a positive and high correlation indicates a better performance in the regression task.

$$\begin{aligned} r_{xy} = \frac{{}\sum _{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum _{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}} \end{aligned}$$
(4)

On the other hand, the RMSE, which is defined by (5), measures differences between values predicted by a model and the observed values. The RMSE is always non-negative and a value of 0 indicates a perfect fit to the data. These metrics are widely used in the field of Automated Short Answer Grading, to compare the predicted scores with the human experts scores.

$$\begin{aligned} RMSE = \sqrt{\frac{\sum _{t=1}^{T}(\hat{y_{t}}-y_{t})^2}{T}} \end{aligned}$$
(5)

In (5), the variable T is equal to the number of observations, \(\hat{y_{t}}\) is the predicted sample and \(y_t\) is the actual sample.

4 Experimental setup

To choose the best approach, a set of experiments were carried out to test the performance of the proposed alternatives. The experiments performed to test each approach are described below.

4.1 BERT approach setup

For the BERT approach, six (6) different combinations of models and input configurations were tested. The main difference between this approach and previous ones is that this model can receive more than one sequence of text as input. Three different pretrained BERT models were tested with a single sentence input (Question/Student Answer-Target Answer) and then with pair sentences input (Student Answer and Target Answer). In this last setup, the model intends to evaluate the relationship between both sentences while taking advantage of the Next Sentence (NS) prediction feature of BERT. In order to use the original BERT in English, it is necessary to translate the dataset, in this way, we performed translation to English by using a machine translation service called DeepL. These translations are validated by a group of experts, finding that all passages have coherence with the original content. The experiments were performed using a Google Colab environment with the following characteristics: 12 GB RAM and a Tesla P100-PCIE-16GB GPU. The model is implemented using Pytorch and the Transformers python library by HuggingFace. BertForSequenceClassification model is used with the flag \(num\_labels\) setted up to 1. In this way, the model acts as regressor. This approach was trained and tested with a dataset split of 90% for training and 10% for testing. The optimization algorithm was Adam, with a learning rate of \(2^{-5}\) and epsilon value of \(1^{-8}\). Each BERT model variant was trained with 15 epochs and batches of 32 samples. The pretrained BERT models were fine-tuned with the regressor layer at the top.

4.2 Skip-thought approach setup

The Skip-Thought-based maps the textual content of the Student Answer and the Target Answer. A set of operations are then performed to derive a vector representation that contains both answers. Then, the grade Regressor is fed with the Skip-ThoughtFootnote 3 output vector that represents the semantic relation between the Target Answer and the Student Answer. To use the Skip-Thought model, the dataset is translated with the same tool used with the English version of BERT.

For the training process, the Skip-Thought was split into 90% for training and 10% for testing as well. The Adam algorithm was used to perform the back-propagation and optimize the learning of the neural network. The experiments were performed using a laptop with the following characteristics: Windows 10 operating system, Intel Core i5 processor and 12GB RAM. A virtual anaconda environment is created with all the necessary python packages. The MLP regressor is implemented using the Scikit-learn python library. The neural network was trained and tested with a single hidden layer, given that insignificant improvement was found by increasing the number of layers. The hyperparameters of the Multilayer perceptron such as Learning Rate, Activation Function, Number of neurons and Max iterations are modified to find the best performance as shown in Section 5.

5 Analysis and discussion

This section shows the results obtained by the different experiments with each proposed approach. Table 5 shows a set of experiments iterating the multilayer perceptron hyperparameters to find the best tuning. It can be inferred that the Skip-Thought model obtains the best performance with a Pearson correlation of 0.65 and a RMSE of 0.78 using a learning rate of 0.001, a logistic activation function, 50 neurons and 200 maximum iterations. The correlation is positive and above 0.5. Therefore, a moderate correlation exists between the predicted grades and the actual grades and the RMSE is the lowest among the performed experiments. The hyperparameters with the highest effect for the metrics improvement are learning rate and activation function.

Table 5 Experiments carried out with the Skip-Thought approach
Table 6 Experiments carried out with the BERT Approach

On the other hand, Table 6 shows the results of experiments for the BERT approach. The experiments yielded that the BERT-1-EN (Student-Target answers pair with an English pretrained BERT) obtained a Pearson correlation of 0.80 and a RMSE of 0.62 which outperforms the results obtained by the Skip-Thought approach in both metrics. BERT-2-EN (Q-SA-TA with a English pretrained BERT) obtained a Pearson correlation of 0.73 and a RMSE of 0.72. For the multilingual models, BERT-1-MU and BERT-2-MU obtained very close performances. Hence it can be concluded that for the pretrained multilingual BERT model, the variation between a Next Sentence Prediction and MASK with single sentence approaches do not have an important impact on the grade regression performance. However, a lower RMSE is obtained using BERT-1-MU, therefore better performance is attained for this variant. In the case of the Spanish pretrained BERT model, BERT-2-ES obtained a Pearson correlation of 0.78 and a RMSE of 0.66, and BERT-1-ES obtained a Pearson correlation of 0.83 and a RMSE of 0.59 which corresponds to the best performance acquired in comparison with the other variants.

From experiments with BERT variants it is possible to conclude that the BERT models with a Student-Target answers pair input performed better than the Q-SA-TA approach. Furthermore, deeper analysis showed that the Q-SA-TA input approach perceived a weakness when the student answers were similar to the question, or used words contained in the question. Then, in this case it is not possible to assess the students’ reading comprehension in an appropriate manner because the student has access to the respective question.

Fig. 9
figure 9

BERT-1-ES Loss Function

Fig. 10
figure 10

BERT-1-ES Difference between predicted and actual grades

From the results, it is relevant to highlight that the second-best approach was the BERT-1-EN, which indicates that using BERT architecture for ASAG with our dataset is more effective when models are trained specifically for the language being assessed and take advantage of Next Sentence Prediction (NSP). This is more evident when we compare BERT-1-EN with BERT-1-MU.

The model pretrained in the native language of the dataset yields the best performance. After identifying the best model as BERT-1-ES, a deeper analysis regarding the accuracy and distribution of predicted scores will be performed. Figure 9 shows the loss function plot of BERT-1-ES. It is essential to notice that the lower loss validation curve is achieved on the last epoch. Since the multilingual models BERT-1-MU and BERT-2-MU obtained very close results, it can be concluded that for the pretrained multilingual BERT model, the variation between a Next Sentence prediction and MASK with a single sentence approach do not have a significant impact on grade regression performance. Nevertheless, a lower RMSE is attained using BERT-1-MU. Subsequently, a better performance is perceived for this variant. In the case of the Spanish pretrained BERT model, BERT-2-ES obtained a Pearson correlation of 0.78 and a RMSE of 0.66, and BERT-1-ES obtained a Pearson correlation of 0.83 and a RMSE of 0.59 which corresponds to the best performance obtained in comparison with the other variants. To evaluate and visualize the difference between the predicted and actual grades, Fig. 10 shows a distribution plot to compare them. Most predicted answer grades have an absolute difference with actual grades equal to zero, and the maximum difference between them is under the absolute value of 1.5, ignoring some atypical values. On the other hand, Fig. 11a and b show the statistical distribution of the absolute difference between actual and predicted grades, which allows us to perform a deeper analysis based on quartiles. 50% of answers are scored with a grade very close to the actual grade. 25% of the grades have an absolute difference under \(-\)0.2 and a 75% under 0.35. The system tends then to be stricter than human evaluators because 75% of graded answers have a grade equal or lower than actual grades. As the dataset contains a majority of samples with scores ranging from 2.5 to 4 as we can see in Fig. 5, the model may struggle to accurately predict scores of 5 due to a lack of sufficient examples to generalize from. To address this limitation, future work could focus on acquiring a better-distributed dataset that includes a larger number of samples in the higher score range. On the oher hand, the comparison between Fig. 11a and b show that the quantizer does not modify the statistical distribution of the difference between predicted and actual grades, thus, it is feasible to use it as a hard layer to generate grades in steps of 0.5 considering the predicted grade.

Fig. 11
figure 11

BERT-1-ES: Box-and-whisker comparison between regressor with quantizer layer and without it

Table 7 shows a set of graded answers from the test dataset.

The predicted grade is very close to the actual grade for each case, thus, there is evidence that the proposed model achieved generalization to grade new students’ answers. Figure 12 shows the predicted score for every sample belonging to the testing dataset represented by the orange dots; on the other hand, it is possible to compare with the actual scores for every test sample represented by the blue dots. Figure 12.a shows the results without quantizer and on the other side Fig. 12.b shows the results with quantizer.

Table 7 Predicted answer score examples

The findings obtained using the BERT model as sentence embedding are consistent compared with results obtained in previous studies like Lun et al. (2020) and Ghavidel et al. (2020), where accuracies of 79.8% and 82% were attained. Despite the fact that these earlier studies perform classification tasks and hence the accuracy metric is used, the comparison shows consistency, given that the Pearson correlation coefficient shows values close to 1, indicating direct correlation with the human annotations and thus a confidence with given actual grades. On the other hand, the results obtained using both proposed sentence embedding models can be compared with the results obtained by Gomaa and Fahmy (2019) where a Pearson Correlation of 0.63 and a RMSE of 0.91 are obtained using a linear regression model as grade predictor. This last author train the model with the University North Texas dataset which contains 2442 student answers from a Computer Science class. Table 8 shows the comparison of results of the best performance models, BERT-1-EN and BERT-1-ES with previous works.

Fig. 12
figure 12

BERT-1-ES: Predicted scores and Human Scores comparison

Table 8 Comparison with previous works

From the state of art it was not possible to find similar studies using datasets in Spanish language, therefore, the results obtained in this work can serve as a baseline for future research in automatic reading comprehension assessment, not only for Spanish but also methodologically for other languages. The distribution of grades obtained in the dataset reflects the general reading comprehension level of the experimental sample of undergraduate students from our university. Since most of the grades are at an intermediate level, there may be an imbalance at the extremes which could be corrected in the future by taking more samples. The balancing of the dataset may represent an improvement in the model since it will be able to predict a grade for the vast majority of possible answers, and it means, more generalization. The results and products obtained from the research can enhance the development of new ICT tools that promote the creation of practice environments for the improvement of inferencing skills in reading comprehension of different languages. Furthermore, it can encourage researchers in the field of NLP and ASAG to find new techniques and methods that can improve such systems and their applications in educational environments.

6 Conclusion

The evaluation and grading of reading comprehension short answer questions is a time-intensive, difficult task in no small part because different subjective aspects arise in the assessment. For the above reasons, it is desirable the design of a system for automating short answer scoring. In this paper, a dataset of students’ answers to reading comprehension Spanish questions based on aphorisms is built to develop automated short answer grading. To map the answers to a numerical representation, the Skip-Thought and BERT models are here considered as alternatives which make use of different configurations. Results showed that the BERT model pretrained in Spanish fed with a pair sentences (Student Answer - Target Answer) outperformed previous alternatives. Furthermore, the NSP (Next Sentence Prediction) using BERT with a model pretrained with a specific target language performed better than other approaches. The results obtained in this study using the ASAG Spanish dataset can be used as a baseline for future investigation. In future studies, a more balanced dataset could be obtained by adding a third annotator, or by trying to apply various data augmentation techniques. Additionally, other approaches for ASAG could be considered in order to test the built dataset.