1 Introduction

Artificial intelligence (AI) refers to the synthetic capabilities of computer science applications to perform tasks that usually require human intelligence (e.g., adaptation, learning, reasoning) (Sarker, 2022; Cooper, 2023). The recent technological advancements within the AI field have led to relevant changes in business and research, the economy, and society (i.e., mega-trends) that are predicted to continue (Estigarribia et al., 2022; Haluza & Jungwirth, 2023; Rasa et al., 2023).

The most relevant change that perfectly exemplifies the impact of the mega-trends above is the transformer architectures that contribute to managing long-term dependencies for natural language processing (NLP) (Tay et al., 2023). They are the basis of the innovative, cutting-edge large language models (LLMs) that have produced a huge buzz in several fields and industrial sectors (MacNeil et al., 2022a). In this line, ChatGPT achieved more than 1 million users within the first 5 days of its release.Footnote 1 Accordingly, LLMs have been used in economy and finance (Alshater, 2022), journalism (Pavlik, 2023), medicine (O’Connor & ChatGPT, 2023), and education (Sallam, 2023), among others. However, as other technological advancements, LLMs have experienced the community’s resistance, a common evolutionary and social psychology phenomenon (Tobore, 2019).

Regarding the learning field, during the twenty-first century, education has experienced a profound change in methods and content. Specifically, a flexible and multidisciplinary environment is sought, where the student can actively participate in their learning process, promoting more autonomous and ubiquitous studying thanks to AI advancements (Baidoo-Anu & Ansah, 2023; Li et al., 2023). The scientific community has researched the use of AI techniques like machine learning (ML) models for training purposes ever since their inception (Hochberg et al., 2018; Talan, 2021; Huang & Qiao, 2022), causing progressive advances towards autonomous high-quality learning (Han, 2018; Demircioglu et al., 2022). Mainly, AI has driven the technological transition in this field regarding the instructional applications, contents, platforms, resources, techniques, tools, and network infrastructure (Roll & Wylie, 2016). This transition also involves changes in the leading roles of the education systems, teachers, and students since this new digital education environment requires new digital competencies and reasoning patterns (Jensen et al., 2018; Zhou et al., 2023). However, although promising, the advances offered by AI are still far from becoming a standardized tool in the educational field due to its early state and the need for training in using these solutions to take the most advantage of them.

Of particular interest is the impact of those applications that leverage LLMs, framed within the generative AI field and based on ML techniques. They enable hands-on learning and are common practice in the classroom nowadays. Compared to previous AI solutions and traditional methodologies, which focused primarily on modifying the textual input using correction, paraphrasing, and sentence completion techniques, LLMs generate on-the-fly human-like utterances, hence its popularity, especially among students and teachers (Rudolph et al., 2023). Current advanced LLMs can enhance pedagogical practice and provide personalized assessment and tutoring (Sok & Heng, 2023). Consideration should be given to the cooperation between LLMs-based systems and humans, provided the experience and scientific knowledge along with the capabilities of the human-agents for creativity and emotional intelligence (Zhang et al., 2020; Korteling et al., 2021). Note that these AI-based systems present advantages in specific educational tasks as self-learning tools and virtual tutors. Specifically, they enable automatic answer grading (Ahmed et al., 2022), explanation generation (Humphry & Fuller, 2023), question generation (Bhat et al., 2022), and problem resolution (Zong & Krishnamachari, 2022). Furthermore, when used for text summarization (Phillips et al., 2022), they help synthesize content and improve the student’s abstraction capabilities. Their use as learning software in virtual assistants is highly relevant to flexible learning (Wang et al., 2022; Yamaoka et al., 2022). Furthermore, their language intelligence capabilities make them an appropriate tool for code correction (MacNeil et al., 2022b).

More in detail, LLMs are trained with massive textual data sets to create human-like utterances. They perform a wide variety of NLP taking advantage of fine-tuning and pre-training pipelines (Kasneci et al., 2023). Note the relevance of both the pre-training and prompt engineering development. The first concept refers to training LLMs with miscellaneous large data sets, while the second refers to specific fine-tuning on a particular task (Kasneci et al., 2023). Consequently, the quality of the LLMs output highly depends on the input data and prompt designed, aka prompt engineering (Cooper, 2023). The latter technique ranges from zero-shot learning, widely popular when applied to LLMs (Russe et al., 2024), to few-shot learning. Note that the model follows task instructions in zero-shot learning since the end-user provides no examples. In contrast, in few-shot learning, the model learns from the demonstrations available (i.e., few-shot text prompts).

Among the most popular LLMs, BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), GPT-3.5Footnote 2, GPT-4Footnote 3, and T5 (Raffel et al., 2020) deserve attention. BERT (Bidirectional Encoder Representations from Transformers) was released by Google in October 2018 (slightly after GTP-1 dated June 2018). It is a pre-trained transformer-based encoder model that can be fine-tuned on specific NLP tasks such as named entity recognition (NER), question answering, and sentence classification. Moreover, GPT-3, GPT-3.5, and GPT-4 (generative pre-trained transformer) models were released by OpenAI in 2020, 2022, and 2023, respectively. More in detail, GPT-4 is already deployed in ChatGPT application, which compared to other LLMs can generate context-aligned responses and interact naturally with the end-users as a peer. This model goes beyond producing reports and translating assessments by creating source code (Haleem et al., 2022) and responding to complex questions posed by the students in real time (George et al., 2023). It can also show creativity to some extent in writing (Baidoo-Anu & Ansah, 2023). T5 (Text-to-text Transfer Transformer) model was released by Google following the encoder-decoder transformer architecture in 2020. Even though its configuration is similar to BERT, it differs in some steps of the pipeline, like pre-normalization (Pipalia et al., 2020).

Given the widespread of AI-based solutions in our everyday lives and particularly the popularity of advanced NLP-based chatbots for learning purposes to generate and evaluate educational materials, this review seeks to provide a comprehensive overview of the systems that exploit LLMs and were explicitly designed for educational purposes (i.e., virtual tutors for question generation and assessment). Thus, involving students or teachers at the design or evaluation levels, excluding those works in which the application of the solution for educational use cases was feasible but not initially designed for that purpose. The ultimate objective is to promote the advancement of these existing solutions in a collaborative environment between academia (i.e., researchers and developers) and end-users (i.e., students and teachers). To the best of our knowledge, this is the first review in this regard. Note that there exist few review works that focused on specific related fields such as health care education (Sallam, 2023) or specific features like the responsible and ethical use of LLMs (Mhlanga, 2023) and their impact on academic integrity (Perkins, 2023).

The rest of this paper is organized as follows. Section 2 describes the methods and materials used in the review. Section 3 presents the discussion on the selected relevant works. Finally, Section 4 concludes the article and details future research.

2 Methodology

The review methodology followed is composed of two steps: (i) data gathering (Section 2.1) and (ii) screening and eligibility criteria (Section 2.2). Figure 1 details the methods and material used. More in detail, this review aims to gather knowledge to answer the following research questions:

  • RQ1: Which solutions based on LLMs are being developed (e.g., for assessment tasks) (i.e., excluding multidisciplinary solutions that were not specifically intended for learning assistance)?

  • RQ2: Which educational solutions based on LLMs involved students or teachers at any level of the development process (e.g., design, evaluation)?

Fig. 1
figure 1

Review pipeline

2.1 Data Gathering

The data were extracted using Google ScholarFootnote 4 with two search queries, specially designed to gather works within the educational field that leverage LLMs:

  1. 1.

    "education" AND "student" AND ("large language model" OR "GPT-3" OR "GPT-3.5" OR "GPT-4" OR "ChatGPT") -"review"

  2. 2.

    "education" OR "student" AND ("large language model" OR "GPT-3" OR "GPT-3.5" OR "GPT-4" OR "ChatGPT") -"review"

Both queries have been restricted temporally since 2020, and the second query was applied to the title content exclusively. Note that duplicated elements and works that do not use LLMs or do not indicate which model is exploited were not considered. The same applies to the works that assess the performance of LLMs. In the end, 342 records were identified.

2.2 Screening and Eligibility Criteria

This process was designed to identify works within the field of study that were written in English while at the same time discarding theoretical and review contributions (i.e., those that do not propose an LLM-based solution but review existing solutions or hypothesize on the impact of LLMs for educational purposes). The manual screening based on the above eligibility criteria resulted in 29 records. Note that this process distinguishes between published articles and conferences from pre-printed and non-peer-reviewed records. The criteria for selection and exclusion are presented in Table 1.

Table 1 Criteria for selection and exclusion

Figures 2 and 3 detail the distribution of the LLMs used and applications in the works selected. Firstly, the most popular model is BERT, followed by GPT-3, T5, and GPT-3.5. The low representativeness of the last GPT model contrasts with its popularity. The latter is due to the fact that the data gathering corresponds to the first quarter of 2023, that is, shortly after it was released. Thus, new works exploiting it are expected to be published shortly. Furthermore, the most common tasks these models perform in the selected works are as virtual assistants and question generation, as shown in Fig. 3, followed by answer grading and code explanation/correction. Note that most works were published in 2022, with few records in 2021, showing a growth trend in 2023.

Fig. 2
figure 2

Distribution of the LLMs in the records selected

Fig. 3
figure 3

Distribution of the tasks in the records selected (AG answer grading, CE code explanation, EG explanation generation, LS learning software, PR problem resolution, QG question generation, TS text summarization, VA virtual assistant)

3 Analysis and Discussion

Table 2 lists the articles published in journals and the proceedings of conferences, taking into account their application, the model used, and code and data availability. Note that just the works by Liu et al. (2022); Mendoza et al. (2022); Tyen et al. (2022); Zong & Krishnamachari (2022); Humphry & Fuller (2023); Nasution (2023) provide enough information for reproducibility, while Bhat et al. (2022); Essel et al. (2022); Mendoza et al. (2022); Moore et al. (2022); Phillips et al. (2022); Yamaoka et al. (2022); Nasution (2023) involved either teachers or students in the design or experimental plan.

Table 2 Selected articles published in journals or presented at conferences

Regarding answer grading applications, Ahmed et al. (2022) used the BERT model. They exploited a modified version of the model based on triplets and the Siamese network, specially designed to generate sentences through semantically meaningful embeddings. The data set used is the one presented by Mohler & Mihalcea (2009). The authors applied the question demoting technique as part of the preprocessing, thus removing from the answer those words also contained in the question. The authors performed the experiments with two different combinations of input data: (i) the reference and student answers and (ii) the concatenation of the question and the reference answer, plus the answer provided by the student. Evaluation metrics include Pearson’s correlation coefficient (PCC) and root mean square error (RMSE). The results are approximately 0.8 PCC and 0.7 RMSE. Moore et al. (2022) presented another answer grading solution based on GPT-3. Unlike Ahmed et al. (2022), the input data were gathered from an introductory chemistry course at the university level with almost 150 students. Moreover, the GPT-3 model was trained with the LearningQ data set (Chen et al., 2018), as in Bhat et al. (2022). Based on the assessment of the questions posed to experts in the chemistry field, the model was able to correctly evaluate 32% of the questions.

Few works exist on code explanation and general explanation generation, learning software, and problem resolution. Firstly, MacNeil et al. (2022b) proposed a GPT-3-based solution for code explanation based on 700 prompts. Note that it does not identify or correct errors. The main functionalities of the system encompass (i) execution tracing, (ii) identifying and explaining common bugs, and (iii) output prediction. However, no results were provided. Humphry & Fuller (2023) proposed a solution based on GPT-3.5 to write conclusion statements about chemistry laboratory experiments. The evaluation of the solution relied on a discussion of features like readability and orthographic correctness of the generated text. Unlike the works above, which focused on textual input data, Yamaoka et al. (2022) used the GPT-3 model to exploit social media data, particularly from Instagram, for learning purposes. The proposed pipeline comprises (i) detecting the relevant objects in the images, (ii) extracting keywords to generate sentences related to those keywords, and (iii) providing linguist information about the words that composed the sentence. The ultimate objective was to acquire new vocabulary. The experiments consisted of a small pilot study with three students from Osaka Metropolitan University. The only results reported were the average of unknown words, 2.2 in the generated sentences. Finally, Zong & Krishnamachari (2022) used GPT-3 to identify and generate math problems involving systems of two linear equations. The experiments consisted of (i) problem classification into five categories, (ii) equation extraction from word problems, and (iii) generation of similar exercises. The authors prepared the input data ad hoc. The accuracy of the results obtained in each of the three tasks above was 75% (averaging the five categories); 80% (with fine-tuning), and 60% (also averaging the five categories), respectively.

Regarding question generation, several representative examples were found in the literature. Bhat et al. (2022) used both GPT-3 and T5 models, GPT-3 for question generation combined with a concept hierarchy extraction model, and T5 for the evaluation in terms of learning usefulness of the generated questions. The input data consisted of textual learning materials from a university data science course. More in detail, the concept hierarchy extraction method exploited the MOOCCubeX pipeline (Yu et al., 2021), which extracts key concepts following a semi-supervised approach. Note that evaluation also involved computing the information score metric and manual assessment by human annotators. The experimental results obtained with the LearningQ data set (Chen et al., 2018) show that almost 75% of the generated questions were considered useful by the GPT-3 model, with an agreement slightly higher than 65% when compared to manual evaluation. Similarly, Dijkstra et al. (2022) created EduQuiz with GPT-3, a multi-choice quiz generator for reading comprehension exploiting the EQG-RACE data setFootnote 5 (Jia et al., 2021). The authors evaluated the performance of EduQuiz using standard metrics, BLEU-4, ROUGE-l, and METEOR. Results attained 36.11, 11.61, and 25.42 for these metrics, respectively. Additionally, Sharma et al. (2022) proposed a fine-tuning pipeline composed of context recognition and paraphrasing, filtering irrelevant output, and translation to other languages for question generation at different levels using the T5 model. The authors used the data set by Mohler et al. (2011) (an updated version of the data set used in Ahmed et al. (2022)). The evaluation metrics computed were BLUE (Papineni et al., 2002) and METEOR (Lavie & Agarwal, 2007). The results for the two metrics above were 0.52 and 57.66, respectively. Thus, compared with the question generation solution by Dijkstra et al. (2022), Sharma et al. (2022) obtained a more competitive METEOR value. Ultimately, Nasution (2023) used GPT-3.5 for question generation. To assess the generated questions’ reliability or internal consistency, the Cronbach’s alpha coefficient (Taber, 2018) was computed, resulting in 0.65. Answers from a survey performed to almost 300 students show that 79% of the generated questions were relevant, 72% were moderately clear, and 71% were of enough depth.

In contrast, Phillips et al. (2022) used GPT-3 to create summaries of students’ chats in collaborative learning. Moreover, this solution detected confusion and frustration in the student’s utterances. Input data was gathered from secondary school students in an ecosystem game. The authors briefly discussed how the system could provide advantageous knowledge to teachers about their interaction in a collaborative learning environment, but no further analysis or results were provided. Conversely, Prihar et al. (2022) proposed a learning assistant based on the BERT model and its variations (i.e., SBERT and MathBERT) to generate support messages from chat logs obtained from fundamental interactions between a live UPchieve tutor available at the ASSISTments learning platform and the students. Even though 75% of the generated messages were identified as relevant by manual human evaluation, these messages had a negative impact on the student’s learning process, as the authors explained.

The most common application uses LLMs as virtual assistants. Sophia & Jacob (2021) created EDUBOT exploiting Dialogflow. Its main limitation lies in the basic language understanding capabilities (i.e., low variability in the responses provided), particularly regarding the user’s emotions. Baha et al. (2022) developed Edu-Chatbot exploiting the Xatkit framework. The system comprises an encoder based on CamemBERT and a decoding module for student intent recognition. Unfortunately, the intent classification decoder is based on a pre-defined set of recognized actions (e.g., simple questions, animations, videos, and quizzes). Thus, the language intelligence of the solution is limited. Furthermore, no evaluation was performed. Calabrese et al. (2022) presented a virtual assistant prototype for Massive Online Open Courses (MOOCs). Their objective was to reduce the teaching load and maintain the quality of learning. Thus, its architecture allows the teacher to intervene in those questions that have not been resolved satisfactorily. More in detail, they used a personalized version of BERT. The questions answered by the teacher are included in an additional document and allow the BERT model to be improved. In contrast, Essel et al. (2022) involved 68 undergraduate students in evaluating the solution developed using FlowXO and integrated into WhatsApp. Qualitative evaluation on the end-user’s preferences of the virtual assistant instead of traditional interaction approaches with the teachers reached 58.8%. Additionally, Liu et al. (2022) presented a virtual assistant for online courses to resolve general and repetitive doubts about content and teaching materials. This system incorporates a sentiment analysis module to analyze the response’s satisfaction based on the student’s dialogue. They used two fine-tuning versions of BERT model with an accuracy of 82% and 90% for the correct detection of the content and student’s sentiment, respectively. Moreover, Mahajan (2022) created a system for students to improve their knowledge of the English language that allows them to obtain information on the meaning of words, make translations, and resolve pronunciation doubts. The authors exploited the RoBERTa model with an accuracy greater than 98% in communication intent detection. Similarly, Mendoza et al. (2022) created a virtual assistant intended for academic and administrative tasks but exploiting Dialogflow.Footnote 6 Cronbach’s alpha coefficients during the evaluation of the system exceeded 0.7. In contrast, Topsakal & Topsakal (2022) presented a foreign language virtual assistant based on the GPT-3.5 model combined with augmented reality. The authors claim that this combination attracted students’ attention and motivated them through entertaining learning thanks to gamification. In this case, the language model was used to establish a dialogue with the end-users. Unfortunately, no results were discussed. Moreover, Tyen et al. (2022) proposed a virtual assistant for second language learning with difficulty level adjustment in the decoder module and evaluated by experienced teachers. The system exploits RoBERTa fine-tuned with a Cambridge exams data set. The system attained Spearman and Pearson coefficients of 0.755 and 0.731, respectively. Finally, Wang et al. (2022) developed an educational domain-specific chatbot. Its goal is to reduce pressure on teachers in virtual environments and improve response times by easing communication between students and teachers. They used natural language understanding (NLU) techniques on variations of the BERT model for the classification of intents and response generation. It presented an accuracy of 88% in detecting intents. However, its values are lower than 50% regarding semantic analysis.

Table 3 lists the selected pre-printed or non-peer-reviewed works, taking into account their application, model used, and reproducibility feature. In this case, da Silva et al. (2022),Footnote 7 Zhang et al. (2022),Footnote 8 and Christ (2023)Footnote 9 provide enough information for reproducibility, while Zhang et al. (2022)\(^{8}\) involved either teachers or students in the design or experimental plan.

Table 3 Selected pre-printed or non peer-reviewed records

The distribution of applications is similar to the peer-reviewed records. Hardy (2021)Footnote 10 developed an automatic evaluation system for reading and writing exercises. The system uses the SBERT model, among others, to capture semantic data and provide valuable insights related to the student’s skills, using ASAP-AESFootnote 11 and ASAP-SASFootnote 12 data sets. Particularly, they exploited the passage-dependent sentence-BERT model trained using curricular learning (Graves et al., 2016). The results from the quadratic weighted kappa (QWK) metric reached 0.76 on average.

Regarding code correction, Zhang et al. (2022)\(^8\) presented MMAPR, an error identification and correction system for code development based on the OpenAI Codex model.Footnote 13 The system fixes semantic and syntax errors by combining iterative querying, multi-modal prompts, program chunking, and test-case-based selection of a few shots. Results obtained with almost 300 students reached 96.50% in corrected code rate with the few-shots-based approach. Phung et al. (2023)Footnote 14 presented a similar solution to MMAPR for code correction named PyFiXV. The main difference is that the Codex model, combined with prompt engineering, explains the detected errors. Moreover, the explanations are also validated in terms of suitability for the students. The system has been tested with TigerJython (Kohn & Manaris, 2020) and CodeforcesFootnote 15 data sets. The precision attained 76% in the most favorable scenario with the TigerJython data set.

Cobbe et al. (2021)Footnote 16 elaborated a data set of 8.5 K elementary school mathematical problems called GSM8K. Then, the GPT-3 model was used to generate comprehensible explanations of these problems, combining natural language and mathematical expressions. The authors trained verifiers to enhance the performance of the model beyond fine-tuning. Ultimately, they concluded that this approach enhanced the overall performance.

Subsequently, da Silva et al. (2022)\(^7\) developed an automatic questionnaire generation system using the T5 model and applying the fine-tuning technique, named QUERAI. The T5 model was evaluated with skip thought vectors (STV), embedding average cosine similarity (EACS), vector extrema cosine similarity (VECS), and greedy matching score (GMS) metrics, with results higher than 0.8 except for VECS. Summing up, the accuracy of the pay-per-subscription solution is 91%. Similar to da Silva et al. (2022),\(^7\) Raina & Gales (2022)Footnote 17 developed a multiple-choice question generation solution to generate both questions and the set of possible answers using apart from T5, the GPT-3 model, both trained with the RACE++ (Liang et al., 2019) data set composed of middle, high, and college level questions. The results obtained are similar between the two models with an accuracy of 80% (11 percentage points lower than the solution by da Silva et al. (2022)\(^7\)). Note that the authors also measured the number of grammatical errors and other features like diversity and complexity. The lowest values are related to the diversity of the questions generated. In the best scenario, the T5 model attained 60% accuracy, approximately. More recently, Christ (2023)\(^9\) used BERT to generate SQL-Query exercises automatically. Experiments with knowledge graphs and natural language building were also performed as a baseline. The authors concluded that the DistilBERT-based approach generates descriptions that are, on average, almost 50% shorter and with a 20% decrease in term frequency compared to the NLP baseline.

GPT-3 and the different adaptations of the BERT model are the most popular alternatives in the sample regarding answer grading, code explanation and general explanation generation, learning software, problem resolution, question generation, and text summarization. When it comes to their use as virtual assistants, the variety of models used increases. The current lower costs of the GPT-3.5 model will motivate a rapid increase in its use in the coming years. However, BERT robustness as an entity detector, with evaluation metrics above 80% in several of the discussed works, made it a reference for developing educational software tools. Unfortunately, most works reviewed do not provide the code or data used for their analysis, making reproducibility difficult. Finally, regarding the risks of exploiting LLMs for educational tasks, they are transversal (e.g., for automatic question generation and as virtual assistants, the two most popular applications identified). The lack of transparency of the models (i.e., the rationale behind their functioning, such as difficulty adjustment in question generation) could negatively impact the end-users. Regarding their use as virtual tutors, reinforcement learning from human feedback is essential to gain control over their operation and ensure fairness. Ultimately, the risk of poor accuracy must be palliated by including the probabilistic confidence of their response.

4 Conclusion

LLMs represent an undeniably mega-trend in the current century in many fields and industrial sectors. In the particular case of learning, these generative AI-based solutions have produced a considerable buzz. Accordingly, they enable hands-on learning and are commonly used in classrooms nowadays. Compared to previous AI solutions and traditional methodologies, which focused primarily on modifying the textual input, advanced LLMs can generate on-the-fly human-like utterances, enhancing pedagogical practice and providing personalized assessment and tutoring.

Given the popularity of LLMs, this work is the first to contribute with a comprehensive overview of their application within the educational field, paying particular attention to those that involved students or teachers in the design or experimental plan. From the 342 records obtained during data gathering, 29 works passed the screening stage by meeting the eligibility criteria. They were discussed, taking into account their application within the educational field, the model used, and code and data availability features. Results show that the most common tasks performed as virtual assistants are question generation, answer grading, and code correction and explanation. Moreover, the most popular model continues to be BERT, followed by GPT-3, T5, and GPT-3.5 models. In the end, this review identified 9 reproducible works and 8 solutions that involved either teachers or students in the design or experimental plan.

Due to the recent launch of the GPT-4 model within the ChatGPT application, new works are expected to be published soon and will be analyzed as part of future work. Moreover, as future work, we will study the ethical implications of LLMs (i.e., their transparency and fairness behavior caused by the training data and privacy) and how the solutions discussed can be integrated into the education curricula, as well as their shortcomings and risk to academic integrity (e.g., plagiarism concerns). Finally, attention will be paid to those works that propose innovative teaching practices with LLMs and explore the use of ad hoc solutions through personal language models in the field.