1 Introduction

Massive open online courses (MOOCs; e.g., Coursera, Udemy, and edX), which are flexible open-access learning resources, have gained popularity among learners. Learners can participate in these courses at their convenience through the Internet. The number of students enrolled in MOOCs is increasing every year. MOOCs enable large-scale interactive participation, and they provide various learning materials, such as videos, text, and exercises. The consistent increase in the number of MOOC users has enabled data on users to be collected [1]. Thus, the application of artificial intelligence to MOOCs to analyze a large quantity of learning data has drawn considerable attention. Researchers have described various problems associated with MOOCs. For example, although the self-regulated learning structures of MOOCs provide considerable flexibility, many learners still cannot complete the course because of their stress-free learning environment [2]. Several studies have developed recommendation systems for MOOCs. Many MOOC providers have attempted to encourage MOOC use among students by using recommendation systems. A recommendation system recommends customized resources to help students learn from various learning materials at their own pace [3, 4]. Consequently, research has increasingly focused on the integration of recommendation systems into MOOCs to provide personalized recommendations to learners [5].

Various methods, such as collaborative filtering (CF), have been proposed for constructing recommendation systems. In classical recommendation methods, learning exercises are recommended on the basis of similarities between students and their learning behaviors. Most studies have used strategies that involve recommending learning exercises in which students have previously provided wrong answers. Such strategies have several limitations. For example, they cannot be used to recommend exercises with the appropriate difficulty level. Whether students answer exercises correctly affects their level of engagement and satisfaction with a recommendation system.

In this study, we developed a recommendation system with a LINE bot (LINE is a social media tool commonly used in Taiwan). We proposed a learning exercise recommendation system based on a reinforcement learning (RL) algorithm. Our system accounts for “review” and “difficulty” objectives. Students generally follow in-class lessons and then complete exercises. The proposed system recommends exercises that students have not completed or have completed incorrectly. Because students learn knowledge gradually, the difficulty of the exercises cannot vary considerably [3]. Therefore, the system recommends personalized exercises with suitable difficulty levels and concepts. We selected “NTHU MOOCs,” which was developed by National Tsing Hua University in Taiwan, as the research platform. The contributions of this study are as follows:

  1. 1.

    To the best of our knowledge, the system is the first to use the actor–critic framework of RL to recommend personalized exercises.

  2. 2.

    The system can encourage certain learning behaviors in students and increase learning effectiveness.

2 Related work

2.1 Deep Q learning-based recommendation systems

Deep Q learning is a type of RL in which the deep learning is used to learn actions in an environment. Two types of deep Q learning network (DQN) architectures are used. In one architecture, only the state space is used as the input, and the Q values of all actions are the output. This architecture cannot handle a large action space. In the other architecture, the state and action are fed as inputs; therefore, this architecture does not need to store all Q values in the memory and can handle a large action space. However, because this architecture must compute the Q values of all actions, it has high computational complexity. To utilize the advantages of both architectures, Peters proposed the actor–critic framework [6]. Artificial intelligence (AI) systems can use this framework to learn how to delay the fall of a pole on a cart in the Cart–Pole challenge. Such systems can also use this framework to learn how to swing a baseball perfectly. RL has been used in various fields, such as gaming [7] and robotics [8]. It has also been used in recommendation systems to account for users’ feedback [9]. Peng employed a comparative method to analyze the correlation between recurrent neural networks and infer the coupling between recurrent neural networks, and they also used continuous attractors to evaluate the effectiveness of smart learning [10]. Nima and Ahmad [11] proposed a Q learning-based framework that uses Web data for recommendations. Hasan et al. developed a parking recommendation system based on Q learning that recommends nearby parking spots. DQNs are widely used in recommendation systems, and the actor–critic framework outperforms the other two architectures; therefore, we selected the actor–critic framework as the agent of our recommendation system.

2.2 Recommendation systems for online learning

Recommendation systems developed into an independent research field in the 1990s [12]. With increase in the number of options available to users, the importance of recommendation systems that facilitate decision-making has increased. Several methods have been proposed for the recommendation of learning exercises. Recommendation systems for MOOCs intelligently provide actions to learners [13]. Collaborating filtering (CF) is a traditional technique used in recommendation systems. It involves filtering out items that a user (student) might require on the basis of the learning processes of similar users. A common method of learning exercise recommendation involves identifying users with similar answering processes. Imran et al. [14] developed a framework for recommending learning materials on the basis of content similarity. To design their recommendation system, Rama et al. used the deep autocoder of feature learning; specifically, they embedded the features of the autocoder into a novel discriminative model of a deep neural network [15].

In content-based filtering (CBF), previously used content is analyzed, and a mechanism of the features based on these items is constructed. Sivaramakrishnan et al. proposed a hybrid Bayesian stacked auto-denoising encoder, which used interest analysis and matrix factorization to achieve collaborative filtering and provided high-quality recommendations through a personalized method [16]. Huang et al. [17] used CBF and natural language processing to filter out similar courses.

DQNs are used in MOOC recommendation systems. Huang et al. [18] proposed a DQN-based system for recommending learning exercises to students. In this system, states and actions are used as inputs. States are learning exercises that students have solved previously, and actions are the exercises recommended to students. RL models learn from feedback; however, obtaining feedback from the real world or a simulated environment is difficult. Therefore, Huang et al. used students’ exercise logs as feedback. They tested their model on a math course and Java course and compared it with other models. The DQN-based model outperformed the other models. In the math course, the model of Qi et al. achieved NDCG@10 and NDCG@15 values of 0.6114 and 0.7813, respectively. Moreover, in the Java course, this model achieved NDCG@10 and NDCG@15 values of 0.4538 and 0.5907, respectively. However, when states and actions are used as inputs, a considerable amount of time is required to compute the Q values of all actions. Therefore, we used the actor–critic architecture to construct a recommendation model based on course logs from prior years. The model sends recommendations through a LINE chatbot to new students so that feedback can be collected.

3 System architecture

This section introduces the architecture of the MOOC exercise recommendation system (MOOCERS) and describes the problem statement of this system. Figure 1 illustrates the architecture of the MOOCERS, which contains an exercise module and a recommendation module.

Fig. 1
figure 1

Architecture of the MOOCERS

3.1 NTHU MOOCs platform

The NTHU MOOCs platform [19] has been the main MOOC platform of National Tsing Hua University since 2019. This platform provides not only an efficient learning and teaching environment for online lecturers and learners but also valuable supplementary resources, such as self-study resources, a performance visualization function, and knowledge maps.

3.2 Exercise module

The exercise module determines the knowledge concept encoding, exercise type, and exercise difficulty and then concatenates these three vectors. Exercises can be single, multiple, or filling exercises and are transformed into vectors through one-hot encoding [20]. The knowledge concept encoding is also transformed using this method. With regard to exercise difficulty, students are divided into a high- and low-scoring group on the basis of their accuracy during exercises. The accuracy in the exercise is then calculated using the accuracy from the two groups. Finally, the exercise difficulty is computed using the final accuracy. The accuracy for an exercise is calculated as Formula (1):

$$R_{{{\text{correct}}}} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} C_{i} }}{{\mathop \sum \nolimits_{i = 1}^{N} A_{i} }}$$
(1)

where \(R_{{{\text{correct}}}}\) is the accuracy of an exercise, \(C\) is the number of correct answers provided by student \(i\) in an exercise, and \(A\) is the total number of answers provided by student \(i\) in the exercise. For example, if a student that provides five correct answers out of a total of 13 answers has an accuracy of 5/13 = 0.38 (i.e., 38%). If the difficulty level is directly calculated using each student’s accuracy, problems can emerge. When the percentages of correct answers for two questions are the same, the group of students with the higher number of correct answers cannot be determined.

Thus, high- and low-scoring groups are considered to calculate the degree of difficulty instead of the percentage of correct answers. The high-scoring group, low-scoring group, and exercise difficulty are defined as follows:

  • High-scoring group: students with the top 27% accuracy.

  • Low-scoring group: students with the bottom 27% accuracy.

  • Exercise difficulty: (average accuracy of the high-scoring group + average accuracy of the low-scoring group)/2.

On the basis of Kelley’s derivation for a normal distribution, the top and bottom 27% of accuracies are used for categorizing students [21]. The total accuracy in an exercise represents the overall accuracy when the average accuracies of the high- and low-scoring students are considered. After exercise difficulty is determined, the exercise difficulty is concatenated with the exercise type and exercise knowledge concept.

3.3 Recommendation module

3.3.1 Problem statement

In a MOOC environment, several students (U) and exercises (N) exist. Information on students’ exercise-answering process is obtained using the data collection application programming interface (API) introduced in Sect. 3.2. The exercise-answering process of a student is given by u = {(\(e_{1}\), \(p_{1}\)), (\(e_{2}\), \(p_{2}\)), …, (\(e_{S}\), \(p_{S}\))}, where u \(\in\) U and \(e_{n}\) \(\in\) N. Parameter \(P_{t}\) represents a student’s answer in exercise t. If the student provides the correct answer, then \(P_{t}\) is equal to 1; otherwise, \(P_{t}\) is equal to 0. A course has a total of N exercises, and each exercise is described using the tuple e = {d, k, t}, where d denotes to the exercise difficulty, k denotes the corresponding knowledge concept of the exercise, and t denotes the exercise type.

The Markov decision process (MDP) is used in the recommendation process. The MDP involves states, actions, and rewards and represented using the tuple (S, A, R, P), which is described as follows [20]:

  • State Space S: Parameter S represents the exercise-answering process of a student. State \(S_{t}\) = {\(S_{t}^{1}\),… \(S_{t}^{P}\)} \(\in\) S represents the exercise-answering process at time t. Each element in \(S_{t}\) is the concatenation of e = {d, k, t} and \(P_{t}\).

  • Action Space A: Parameter A represents all the exercises of the course. Action \(A_{t}\)  = {\(A_{t}^{1}\),… \(A_{t}^{N}\)} \(\in\) A contains all the exercises. Making an exercise recommendation at time t is equivalent to taking an action \(A_{t}^{1}\).

  • Reward R: When the recommender agent takes action \(A_{t}^{1}\) at time t, the student answers or clicks. The recommender agent then receives reward (\(S_{t}\), \(A_{t}\)) on the basis of the student’s feedback.

  • Transition P: Parameter P (\(S_{t + 1} S_{t}\), \(A_{t}\)) refers to the probability of transitioning from state \(S_{t}\) to state \(S_{t + 1}\) after taking action \(A_{t}\).

Rewards can be maximized by using recommendation policy \(\pi\): S A.

3.4 Reward function design

This section describes the design of the reward function, which plays a crucial role in the training of the proposed framework. Traditional recommendation systems follow a single rule, such as CF [17] or CBF [18], to recommend learning exercises. However, such systems cannot recommend suitable exercises for students, and students are not fully satisfied with them. Difficulty and knowledge concept coverage are crucial attributes of exercises [22]. Therefore, we combined three objectives, namely the review, difficulty, and learning objectives, in our reward function design [20].

(1) Design for the review objective. In Eq. (2), \(\gamma_{1}\) is for review. If a student provides the wrong answer in an exercise at a certain time but the recommender agent recommends an exercise with a completely different knowledge concept, it is assigned a negative reward.

$$\begin{array}{*{20}r} \hfill {\begin{array}{*{20}r} \hfill {R_{1} = \left\{ {\begin{array}{*{20}l} {\gamma_{1} ,} \hfill & {{\text{if}}\;P_{t} = 0\;{\text{and}}\;K_{t + 1} \cap K_{t} = \emptyset } \hfill \\ {\gamma_{2} ,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.} \\ \end{array} } \\ \end{array}$$
(2)

Parameter \(\gamma_{1}\) denotes a negative reward, and we set \(\gamma_{1}\) to − 2 in our training scenario. Parameter \(\gamma_{2}\) denotes a positive reward, and we set \(\gamma_{2}\) to 1 in our training scenario.

(2) Design for the difficulty objective. If a student only learns the definition of a limit in a calculus course but the recommender agent recommends the student an exercise about integration by parts, the student would find the exercise difficult. Therefore, the following squared loss function shown in formula (3) is used to meet the difficulty objective []:

$$\begin{array}{*{20}r} \hfill {\begin{array}{*{20}r} \hfill {R_{2} = - \left( {d_{t} \; - \;d_{t + 1} } \right)^{2} } \\ \end{array} } \\ \end{array}$$
(3)

where \(d_{t}\) is the difficulty of exercise A at time t and \(d_{t + 1}\) is the difficulty of exercise B at time t + 1.

(3) Design for the learning objective: The recommender agent recommends exercises to students through LINE. In formula (4), if a student clicks on an exercise, a positive reward is generated; otherwise, a negative reward is generated.

$$\begin{array}{*{20}r} \hfill {\begin{array}{*{20}r} \hfill {R_{3} = \left\{ {\begin{array}{*{20}l} {\gamma_{3} ,} \hfill & {{\text{if}}\;{\text{user}}\;{\text{clicks}}\;{\text{and}}\;{\text{answers}}} \hfill \\ {\gamma_{4} ,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.} \\ \end{array} } \\ \end{array}$$
(4)

Parameter \(\gamma_{3}\) denotes a positive reward, whereas \(\gamma_{4}\) denotes a negative reward.

The rewards merge into total rewards with balance coefficients \(\alpha_{1}\), \(\alpha_{2}\), and \(\alpha_{3}\) as follows:

$$\begin{array}{*{20}r} \hfill {\begin{array}{*{20}r} \hfill {R = \alpha_{1} \times R_{1} + \alpha_{2} \times R_{2} + \alpha_{3} \times R_{3} ,} \\ \hfill {\alpha_{1} ,\alpha_{2} ,\alpha_{3} \in \left[ {0,1} \right]} \\ \end{array} } \\ \end{array}$$
(5)

In Eq. (5), the values of \(\alpha_{1}\), \(\alpha_{2}\), and \(\alpha_{3}\) are between 0 and 1. This equation provides a flexible method for adjusting our recommendation system. Thus, different balance coefficients can be used to achieve different goals. For instance, to focus on difficulty, \(R_{1}\) and \(R_{3}\) can be set to 0. Alternatively, to focus on knowledge concepts, \(R_{2}\) and \(R_{3}\) can be set to 0.

3.5 LINE platform (API)

An API can provide a series of functions, such as APP and WEB. LINE is a popular instant messaging application with a comprehensive public API [23]. It can be installed on various platforms, such as smartphones, iPads, and personal computers. LINE provides a messaging API, which enabled us to develop a two-way communication service between a LINE chatbot and LINE users.

3.6 Actor–critic framework

The actor–critic framework combines the advantages of the critic- and actor-only methods [24]. The framework can produce continuous actions. The input of the actor is the current state, and the framework outputs the parameters of a state-specific scoring function. Next, the recommender agent scores these items and selects them. The critic then learns the value function (Q value) to determine whether the selected action matches the current state. Finally, the actor updates the policy parameters in the following iterations on the basis of the critic’s judgment. The actor–critic framework is appropriate for large action spaces, such as those in our exercise recommendation system. The critic method is used to learn the value function approximation [25]. Q (St, At), the function is a judgment if the action at well matches the St. The critic method is used to obtain an approximate solution to the Bellman equation [26] as Formula (6):

$$Q^{*} (S_{t} ,A_{t} ) = E_{{S_{t + 1} }} [r_{t} + \gamma \mathop {{\text{max}}}\limits_{{A_{t + 1} }} {\text{Q}}^{*} (S_{t + 1} ,A_{t + 1} )S_{t} ,A_{t} ]$$
(6)

In a real-world scenario, the state and action spaces are large, and computing the state–action pairs is highly difficult. In addition, not all state–action pairs appear in real-world scenarios. The MOOC environment considered in this study contains numerous students. However, some students might only complete a few exercises. Therefore, updating the relevant state–action pairs and calculating the transition probability for when students do not complete corresponding exercises are difficult. The action-value function is nonlinear, and deep neural networks provide accurate approximations of nonlinear functions. In accordance with the method in [27], we used a deep neural network to address this problem. The approximate function Q (S, A) Q (S, A; \(\theta\)). Our training procedure minimizes loss function (L (\(\theta_{\mu }\))) as formula (7):

$$\begin{gathered} L(\theta_{\mu } ) = E_{{S_{t} ,A_{t} ,R_{t} ,S_{t + 1} }} [(Y_{t} - Q(S_{t} ,A_{t} ;\theta_{\mu } ))^{2} ], \hfill \\ {\text{where}}\quad Y_{t} = E_{{S_{t + 1} }} [(R_{t} + \gamma Q^{{\prime }} (S_{t + 1} ,A_{t + 1} ;\theta_{\mu } )),A_{t} ] \hfill \\ \end{gathered}$$
(7)

4 Implementation

This section describes the algorithms of the proposed MOOCERS.

4.1 Data collection

By developing RESTful APIs, a connection was established between the databases and the course website. We conducted experiments with a dataset collected from a real MOOC environment to evaluate the performance of the proposed deep RL framework. The experimental dataset was collected from the NTHU MOOCs platform (Table 1). We collected data on a calculus course implemented in 2020 as training data to construct a recommendation model. We then used the trained recommendation model to send recommendations to students in a calculus course taught in 2021. To send these recommendations to the students, we asked them to connect with our LINE bot; approximately 700 students (approximately 60% of all the students) chose to connect with our bot. This course lasts for 12 weeks, and its topics range from limits to L’Hopital’s rule and improper integrals. The course comprises 143 exercises, each of which corresponds to a specific concept. The total number of concepts is 108.

Table 1 Courses

The exercise logs were randomly split into a training set and testing set at a ratio of 8:2, as expressed in Eq. (8).

$$D_{training} :D_{testing} = 0.8:0.2$$
(8)

\(D_{{{\text{training}}}}\) represents the data used for model training, and \(D_{{{\text{testing}}}}\) represents the data used for model testing.

Each student in the calculus course can use the exercise recommender agent, but its use is optional. The binding ratio for the “2021 Calculus (I)” course was approximately 60%.

Table 2 presents the data recorded for each exercise. Every user, course, chapter, and exercise were assigned a unique ID (i.e., “userId,” “courseId,” “chapterId,” and “exerId,” respectively). When a user completed a question, we recorded their answer, the correct answer, and their score, using “True” and “False” to denote correct and incorrect answers, respectively. The time spent answering each question was also recorded.

Table 2 Exercise data

4.2 Personalized recommendations

In RL methods, data are collected by constructing an agent in the environment. For example, classical applications, such as robotics, might be time consuming and costly [28]. Therefore, simulation is a suitable alternative for RL. However, simulated data cannot be used to obtain accurate recommendations in our system because it is more complex than robotics and gaming systems. Data for the system can only be obtained from real-world environments, such as MOOC platforms. In addition, the system can obtain rewards only from an explored state space. To obtain a reward from an unexplored space, the system must recommend an exercise to a student and obtain their feedback. To solve this problem, we used students’ exercise logs. The procedure is presented in Algorithm 1.

(1) First, memory space M = {\(M_{1}\), \(M_{2}\)…} is constructed to store a student’s state–action–reward pair ((\(S_{t}\), \(A_{t}\)), \(R_{t}\)).

(2) The current state (line 4), current actions (exercises), and rewards are observed and stored in the memory (line 7).

(3) The state space is updated by adding the action to the end of the state space. For instance, if the recommender agent recommends {\(A_{1}\), \(A_{2}\), \(A_{3}\)} to a student, when the student clicks the \(A_{1}\) exercise, the state space is then updated to {\(S_{1}\),\(S_{2}\),…,\(A_{1}\)} (Fig. 2).

Fig. 2
figure 2

Online memory algorithm

4.2.1 Training procedure

This section describes the training procedures of the algorithm and the parameter update process. The deep deterministic policy gradient algorithm was used for model training. This algorithm is presented in Algorithm 2 [20].

In every iteration (line 4), the recommender agent recommends a list of actions (exercises) \(A_{t}\) = {\(A_{t}^{1}\),… \(A_{t}^{K}\)} based on the current state \(S_{t}\) (line 5). The recommender agent then observes reward list \(R_{t}\) = {\(R_{t}^{1}\),… \(R_{t}^{K}\)} (line 6) and obtains new state \(S_{t + 1}\) (line 7) by using Algorithm 1. The transition probability is stored in memory M (line 8), and state \(S_{t}\) is updated to \(S_{t + 1}\) (line 9). Next, the recommender agent executes the parameter update procedure by sampling a minibatch of transitions (S, A, R, \(S^{\prime}\)) from M (line 10) and updating the parameters of the actor–critic network (lines 11–17). The network sizes of the actor and critic are presented in Table 3. For the actor, because the state of a student at time t comprises a series of exercise-answering processes, a GRU is used to process the series and predict the exercise that should be recommended; specifically, 105 nodes are used to represent each exercise, and the node with the highest score represents the exercise that should be recommended. For the critic, a final dense layer with a single node is used to predict the reward. The learning rate of the actor is 0.0001, and the learning rate of the critic is 0.001. When updating the network, only 0.1% of the actor and critic are updated (Fig. 3).

Table 3 Network sizes of the actor and critic
Fig. 3
figure 3

Deep deterministic policy gradient algorithm

4.2.2 Offline evaluation

A successful recommendation can be made in two situations depending on the reward setting: one in which the correct answer is obtained for \(e_{t}\) and the other in which an incorrect answer is obtained for \(e_{t}\) [20].

(1) Correct answer for \(e_{t}\).

When a student answers correctly in an exercise that covers a certain knowledge concept, the recommender agent can recommend more difficult questions with similar knowledge concepts. If the overlapping knowledge concepts between \(e_{t}\) and \(e_{t + 1}\) are at least half and the difficulty of \(e_{t + 1}\) is higher than that of \(e_{t}\), the recommendation is successful.

(2) Incorrect answer for \(e_{t}\).

When a student answers incorrectly in an exercise, the student must continue with exercises on similar concepts and levels of difficulty. If the difference in difficulty between recommended exercises (\(e_{t + 1}\)) and \(e_{t}\) is lower than 0.2 and the number of overlapping concepts is at least half \(e_{t}\), the recommendation is successful.

4.2.3 Exercise completion rate

The exercise completion rate (ECR) represents the ratio of exercises solved by a student to the total number of exercises. The ECR of each student is as formula (9):

$${\text{Completion}}\_{\text{Rate}}_{e} = \frac{{{\text{Number}}\;{\text{of}}\;{\text{solved}}\;{\text{exercises}}}}{{{\text{Total}}\;{\text{number of exercises}}}}$$
(9)

4.2.4 Hit rate

In typical settings for a recommender, the hit rate is the proportion of users for which the correct answer is included in the recommendation list. The hit rate is as Formula (10):

$${\text{Hit}}\_{\text{Rate}}_{{{\text{exercise}}}} = \frac{{{\text{Number}}\;{\text{of}}\;{\text{clicks}}}}{{{\text{Total}}\;{\text{pushed messages}}}}$$
(10)

Number of clicks represent the total number of students clicking on the recommended exercise, and total pushed messages represent the total number of exercises pushed by the recommender agent.

4.2.5 Recommendation through LINE

We examined the hit rate, the ECR, and students’ satisfaction with the system. Students were divided into a control group and an experimental group. The recommender agent recommended exercises to the experimental group, and the control group was recommended exercises randomly. Screenshots of pushed messages on LINE are presented in Fig. 4. For example, the recommender agent determined that exercises W1 Ex01 Q1 and W2 Ex23 25 Q1 were the most suitable exercises for students A and B, respectively. Students were recommended exercises through LINE, and they completed them after clicking on the links. After the exercises, they received their results.

Fig. 4
figure 4

Recommendation message received through a chatbot

5 Results and discussion

5.1 Accuracy of the proposed model

A recommendation system usually outputs a series of lists. Therefore, instead of proxy metrics such as the mean squared error, other suitable evaluation metrics should be used to determine the ranking quality. For recommendation systems, the commonly used evaluation metric is top@K ranking metrics [29, 30], including NDCG@k [31] and Hit Rate@K. The order of recommendations is crucial in the evaluation of a recommendation series, and DCG can be used to assign a penalty for recommendations that appear later in a list, as expressed in Eq. (11). Although DCG considers the order of a recommendation series, it does not indicate whether the series is appropriate. Therefore, NDCG, which is the ratio of the actual DCG to the ideal DCG, is calculated [Eq. (12)]. If the recommendation series is closer to the ideal series, NDCG is closer to 100%. Huang et al. [18] used a normal DQN framework to recommend exercises to students and achieved NDCG@10 scores of 0.6114 and 0.4538 in a math course and Java course, respectively, in offline testing. Because we considered offline training as a ranking task, we used NDCG@K for offline performance evaluation and hit rate for online performance evaluation. In Table 4, the NDCG@5 value of the deep RL model was 0.65; its NDCG@8 value was 0.665; and its NDCG@10 value was 0.684. NDCG@k indicated that the model selects the top k exercises on the basis of their Q values. It then calculates the scores with different reward settings. When the correct answer is obtained, if the number of overlapping concepts between et and et + 1 is at least half and the difficulty of et + 1 is higher than that of et, the recommendation is considered successful. When an incorrect answer is obtained, if the difference in exercise difficulty between et and et + 1 is lower than 0.2, the recommendation is considered successful.

Table 4 NDCG values obtained in this study

We will continue to collect feedback from users on the chatbot and hope to encourage students to continue completing the exercises.

$${\text{DCG}}_{k} = \mathop \sum \limits_{i = 1}^{k} \frac{{r_{i} }}{{\log_{2} \left( {i + 1} \right)}}$$
(11)
$${\text{NDCG}}_{k} = \frac{{{\text{DCG}}_{k} }}{{{\text{IDCK}}_{k} }}$$
(12)

5.2 ECR

Exercises are a crucial learning resource in MOOCs. Learning using MOOCs is flexible, and few students proactively complete exercises. For this reason, one goal of developing the system was to increase ECR. The ECR of the students who used our system (89.97%) was higher than that of the students who did not use our system (47.23%; Table 4; Fig. 5). We divided the students who used our system into the following four groups on the basis of their ECRs:

  1. 1.

    Very low group: ECR < 1%

  2. 2.

    Low group: 1% ≤ ECR < 25%

  3. 3.

    Middle group: 25% ≤ ECR < 50%

  4. 4.

    High group: 50% ≤ ECR.

Fig. 5
figure 5

ECRs of students who used and did not use the system by number of uses

The ECR of students in the high group increased from 68.69% to 94.72% (Fig. 6). The ECRs of the very low group and low group increased from 0.1% to 90.07% and from 17.46% to 87.07%, respectively. The ECRs of the students in the middle group increased from 56.7% to 89.26%. Therefore, our system considerably increased ECR, especially that of the very low group, low group, and middle group.

Fig. 6
figure 6

ECRs of groups by number of uses

5.3 Hit rate

In our experiment, we distributed messages four times, and 520, 640, 694, and 738 students accessed them through LINE each time, respectively. For the fourth message (Table 5), 40 (5.4%), 69 (9.3%), 88 (11.9%), and 105 (14.2%) students completed the recommended exercises within 4 h of receiving the message, on the day of receiving the message, on the day after receiving the message, and 2 days after receiving the message.

Table 5 Number of students in each group who completed the recommended exercises

5.4 Midterm score

When the students exhibited higher learning effectiveness, they spent more time on the NTHU MOOCs platform to complete exercises. Therefore, learning behavior is correlated with learning effectiveness, and students with a higher ECR tended to receive higher midterm scores (Fig. 7). The students who used and did not use the system exhibited average midterm scores of 64.7 and 58.2, respectively.

Fig. 7
figure 7

Midterm scores by ECR

5.5 Students’ assessment of the system

We distributed an online questionnaire to the students who used the system to obtain their feedback. A total of 227 valid questionnaires were collected. The questionnaire was based on the self-adjusting learning strategy proposed by Zimmerman [32, 33]. The students responded to the questions by using a Likert scale [34], which is the most widely used scale in survey research. The scale ranges from 1 to 5, with 1 representing “strongly disagree” and 5 representing “strongly agree.” The questionnaire contained 20 questions on usability, usefulness, attitude toward learning, perceived value, and metacognition.

Approximately 85% of the students provided a score of more than 4 points for usability, which indicates that they found the system easy to use and clear. Approximately 50% of the students provided a score of more than 4 points for usefulness, which indicates that the system recommended helpful exercises and increased the students’ willingness to complete exercises. Approximately 76% of the students provided a score of more than 4 points for attitude toward learning, which indicates that the system motivated the students to learn and enabled them to answer exercises more efficiently. Approximately 90% of the students provided a score of more than 4 points for perceived value, which indicates that they were satisfied with the system. Approximately 82% of the students provided a scored of more than 4 points for metacognition, which indicates that the system helped them understand key concepts and consider feasible learning methods.

6 Conclusion

To the best of our knowledge, the system is the first to use the actor–critic framework to provide personalized exercise recommendations on a MOOC platform. Our system recommends exercises with suitable difficulty levels and concepts for each student. The NDCG@10 value of the proposed model is 0.684. In contrast to other systems, our system was integrated into LINE to provide personalized exercise recommendations to each student; thus, the students did not need to log in to the system and could receive recommendations from LINE in the form of text messages. Evaluating the effectiveness of recommendations in the real world is difficult. We examined the hit rate of each recommendation provided by our system and found that 40% of the participating students answered the exercises in the final recommendation. The system can encourage certain learning behaviors in students. Students who used and did not use the system exhibited ECRs of 89.97% and 47.23%, respectively. The system can also increase learning effectiveness for students. The students who used and did not use the system exhibited average midterm scores of 64.73 and 58.21, respectively. The questionnaire revealed that 88.2% of the students who used the system intended to use it again and that 89.5% of the students were satisfied with it. In the future, we intend to combine our system with the natural language processing method to make our LINE bot fully interactive.