Dynamic multi-objective sequence-wise recommendation framework via deep reinforcement learning

Sequence-wise recommendation, where recommend exercises to each student step by step, is one of the most exciting tasks in the field of intelligent tutoring systems (ITS). It is important to develop a personalized sequence-wise recommendation framework that immerses students in learning and helps them acquire as much necessary knowledge as possible, rather than merely focusing on providing non-mastered exercises, which is referred to optimize a single objective. However, due to the different knowledge levels of students and the large scale of exercise banks, it is difficult to generate a personalized exercise recommendation for each student. To fully exploit the multifaceted beneficial information collected from e-learning platforms, we design a dynamic multi-objective sequence-wise recommendation framework via deep reinforcement learning, i.e., DMoSwR-DRL, which automatically select the most suitable exercises for each student based on the well-designed domain-objective rewards. Within this framework, the interaction between students and exercises can be explicitly modeled by integrating the actor–critic network and the state representation component, which can greatly help the agent perform effective reinforcement learning. Specifically, we carefully design a state representation module with dynamic recurrent mechanism, which integrates concept information and exercise difficulty level, thus generating a continuous state representation of the student. Subsequently, a flexible reward function is designed to simultaneously optimize the four domain-specific objectives of difficulty, novelty, coverage, and diversity, providing the students with a trade-off sequence-wise recommendation. To set up the online evaluation, we test DMoSwR-DRL on a simulated environment which can model qualitative development of knowledge level and predicts their performance for a given exercise. Comprehensive experiments are conducted on four classical exercise-answer datasets, and the results show the effectiveness and advantages of DMoSwR-DRL in terms of recommendation quality.


Introduction
With the rise of the Internet and online education, datadriven intelligent education services, such as massive open online course (MOOC) [1], give students an open access to the world-class instruction and unlimited learning materials (e.g., exercises). However, MOOC lacks the ability to capture learners' dynamic knowledge evolution, which directly leads to the inflexible of online learning contents and high dropout rates in practice.
In recent decades, personalized recommendation systems have received significant attention in academic areas, which helps learners acquire knowledge by analyzing individual learning differences rather than letting them self-seeking. The representative works of conventional recommendation methods, such as collaborative filtering (CF) [2], matrix factorization (MP) [3], and its variants [4], have been widely applied in intelligent e-learning scenarios. However, traditional recommendation methods only focus on the long-term static preferences of learners, ignoring the transfer of their preferences over time. Thus, traditional methods are hard to be practical in the education scenario, because they may result in unsatisfactory recommendations.
Recently, deep learning (DL) [5], which has indicated significant potential in various challenging sequential decisionmaking scenarios, such as learning path recommendation [6,7] and student performance prediction [8,9], has been becoming essential techniques in online education systems. Some researchers have strived to use the learning-related contextualized factors plus a long short-term memory (LSTM) network to optimize the top-N recommendation list. Moreover, the work in [10] presented a hybrid approach, which combines the DKVMN [11] model and the deep reinforcement learning technique to improve the exercise recommendation accuracy.
Although some considerable progress has been made, there are still certain limitations. On one hand, most existing exercise recommendation approaches focus on providing a recommended list for each student; in other words, their models are trained using students' offline logged exercising data and remain static after deployment [11,12]. In practical, offline evaluation cannot make sequential recommendations from instant feedback dynamically. On the other hand, solely focusing on the single objective, i.e., exercise difficulty factor cannot reach ideal recommendation results. More specifically, several features inherent in the sequence-wise recommendation task that previous work failed to recognize are as follows: (1) The repeated appearance of the knowledge concepts that students have mastered will cause the loss of pertinence of the recommended knowledge. (2) In practice, an exercise usually contains multiple knowledge concepts, that is to say, the monotonous knowledge type may also bring one-sided learning. (3) The targeted student gets bored with repeated knowledge concepts which they have gone through several times in past learning.
To address the limitations of previous works, we propose a dynamic multi-objective sequence-wise recommendation framework via deep reinforcement learning, i.e., DMoSwR-DRL. To address the first challenge, inspired by DKT [13], we employ the model to build a simulator as the environment, and use it to calculate real-time rewards according to the predicted performance of recommended exercises. To address the second challenge, this research discusses four objectives inherent to sequence-wise recommendation task that prior work failed to recognize: (1) difficulty: the difficulty of recommended exercises should correspond to the evolution of students' knowledge mastery. (2) Novelty: concepts that no answers or poor mastery should appear as often as possible in future learning. (3) Coverage: try to choose exercises that cover concepts as much as possible. (4) Diversity: the learning procedure that consists of more categories of exercises is considered better. Figure 1 depicts the proposed solution in our recommendation framework, which considers difficulty, novelty, coverage, and diversity as four optimization objectives.
To address the dynamic events, reinforcement learning (RL) [14] is an ideal proposal. In essence, the recommending process of exercise can be viewed as a Markov decision processes (MDPs), in which the agent should successively determine the right action, i.e., according to the four educational domain-specific objectives, the recommender (agent) automatically selects the best suited exercise at each step so as to optimize the predefined long-term objectives. As a wellunderstood method of the RL, the Q-learning [15] algorithm needs to create a Q-table to estimate all exercise transitions. Clearly, there are too many exercises to choose in our task, and thus, the larger space-action pair may cause poor convergence issue. We look for inspiration in actor-critic (AC) network [16] to solve the above challenges. The main idea behind it is a combination of value function and policy gradient algorithm. In general, the actor network is responsible for generating an action according to the student's state, and the critic network learns how to evaluate the agent actions for interacting with the environment and assists the actor to find the optimal action strategy. Thus, these two networks work together and train simultaneously. The actor network, which does not need an optimal action sample, seeks the optimal action based on the exploration mechanism and feedback of the critic network [17]. Consequently, the AC model can be applied, as an efficient algorithm, to the field of online exercise recommendation. In summary, the novelties of this work are threefold: 1. Methodology. We formulate the exercise recommendation task as a Markov decision process and employ an actor-critic algorithm automatically determines which exercise to recommend next and whose parameters are dynamically updated during the recommendation. In particular, this approach eliminates the need to estimate and store all state-action pairs. 2. State representation. To effectively leverage historical exercising data for the online scenario, we specially design a novel state representation component for the DMoSwR-DRL, which can greatly help the agent perform effective reinforcement learning by explicitly model the interactions between the students and exercises. 3. Objective. DMoSwR-DRL, to our knowledge, is the first one that simultaneously trade-off four domain objectives including difficulty, novelty, coverage, and diversity, supporting dynamic multi-objective sequence-wise recommendation. 4. Validation. We build an online simulator as the environment and several metrics to measure the effectiveness of all rewards, and extensive experiments are conducted on four real-word datasets show that DMoSwR-DRL achieves satisfactory performance.

Exercise recommendation
Previous studies on exercise recommendation task are classified into three categories, which are collaborative filtering, knowledge-based modeling, and hybrid approaches.

Collaborative filtering
One typical collaborative filtering (CF) method has been widely adopted in recommender systems, which recommends items by considering the similarity among students or exercises [2]. For instance, Toledo et al. [18] built an online programming recommendation platform for students, which jointly performs collaborative filtering and similarity measure. Matrix factorization (MF) model, as an advanced CF approach, seeks to decompose the student-exercise matrix into a product of two lower dimensionality matrices, each of which represents students' preference or exercises' properties [19,20]. Although MF methods achieve more desirable recommendation performance, it is worth noting that it employs a static view during the recommendation process and ignores the dynamics characteristics of personalized recommendation scenarios.

Knowledge-based modeling
In the domain of educational psychology, many scholars use machine learning and data mining technology to generate a reasonable recommended list of exercises by capturing student knowledge states from the historical exercising logs [21]. Jiang et al. designed a new exercise recommendation model, which uses a weight diagram of knowledge concept relationship, and generated personalized exercise recommendations based on students' knowledge mastery [22]. The work in [23] first acquires the degree of learner's mastery level from their previous exercise records using cognitive diagnostic technique. Then, it uses probability matrix decomposition toward recommendation.

Hybrid approaches
Recently, hybrid recommendation methods have achieved tremendous success, and represent a combination of deep learning techniques with conventional methods in different ways. Zhu et al. [23] proposed a jointly deep recommendation model based on probabilistic matrix factorization and cognitive diagnosis, termed as PMF-CD, which adopted prior parameters to improve the explainability of recommendation. However, this method requires experts to annotate the exercise knowledge concept correlation matrix. In addition, Wu et al. [24] designed a novel recommendation approach that can recommend the exercise of a given difficulty without setting the difficult level of each exercise. Most previous deep recommendation methods ignore the essential relationships between knowledge points, Lv et al. [25] proposed a weighted knowledge graph recommendation framework, wherein takes the knowledge concept weighted by the ability of a student as entities, and an arrowed edge between two knowledge concepts represents their prerequisite relationship. The authors in [26] used the learning-related contextualized factors plus a personalization mechanism to enhance the students' knowledge. The algorithm is capable to the learner's contextual information for enhancing the performance of recommendations, but it is inefficient, because it requires subjective Q-matrix created by the manual engineering of domain experts. The work in [10] presented a hybrid approach, which combines the DKVMN [11] model and the deep reinforcement learning technique to improve the exercise recommendation accuracy. The work in [20] applies contextual multi-armed bandits' algorithm to recommend exercises, for maximizing a student's immediate success, i.e., her performance on the next exercise.

Reinforcement learning for recommendation
As a distinguished direction, deep reinforcement learning (DRL) algorithms have achieved remarkable breakthroughs in many applications, such as Atari games [27] and Go [28].
With the spur of RL technique, recent several typical recommendation tasks (e.g., sequential recommendation [29,30], interactive recommendation [31,32], point-of-interest recommendation [33,34], and diversified recommendation [35,36], etc.) have been researched quite extensively. Unlike other traditional recommendation models, DRL formulates the recommendation task as sequential decision problem between the users and the agent. Chen et al. [3] developed a generative adversarial network to model the entire user's sequential behavior, where a combinatorial recommendation policy can be obtained by cascading DQNs. The work in [37] designed a hierarchical reinforcement learning model with multi-goal abstraction for consumers, which combines long-term conversion and short-term click to explore user's hierarchical purchase interest. Indeed, designing a good state representation has been shown to be a pivotal factor in enhancing reinforcement learning performance [38]. As such, careful design of the state representation of the environment should be a concern. To extend this idea, the work in [30] proposed that both user's positive and negative feedback should be incorporate into state representation modeling. Liu et al. [39] employed actor critic (AC) framework to construct DRL-based for interactive recommendation and recommend items based on a combination of four state representation mechanisms.
Overall, the above studies suffer from the problems as follows. On one hand, they only roughly model the student's states by conventional recurrent neural network, but ignore the effects of other information, such as knowledge concept, exercise difficulty, etc., which can also make beneficial recommendations to students. On the other hand, the conflict among these objectives is another obstacle for enhancing the quality of the recommendation, and no existing study can address this issue.

Preliminaries
In this section, we present main preliminaries of DMoSwR-DRL. First, we introduce the problem definition of exercise recommendation. Then, we introduce the basic notations and framework in detail. Subsequently, we summarize the theoretical and practical implications of our study.

Problem definition
In the exercise recommendation scenario, we record a student's response logs s {(e 1 , r 1 ), (e 2 , r 2 ), ..., (e t , r t )}, where e t ∈ E refers to the one-hot vector of the t-th exercise, and r t ∈ [0, 1] indicates the corresponding exercise binary response (1 means the exercise was answered correctly; 0 means the opposite). Formally, we employ e i {k, d} to denote a certain exercise e i , where k represents the knowledge concepts involved in exercise i, and d represents the difficulty of exercise i. We denoted the whole recommendation sequence as L e 1 , e 2 , ..., e |L| . Especially, we denote the association between exercise set and concept set as a binary relation Z ⊆ E × K , where if the j-th knowledge concept k j does not appear e i , then e(k) 0; otherwise, e(k) 1.
The exercise recommendation scenario involves multiple conflicting optimization objectives. Difficulty and novelty are crucial factors, which focus on reviewing no-mastered knowledge to fix their knowledge holes. Unlike them, coverage and diversity reflect the performance of other aspects of the recommended model. Consequently, the research goal of our work is as follows: an advanced exercise recommendation framework is needed that can simultaneously optimize the multiple conflicting objectives including Difficulty, Novelty, Coverage, and Diversity, supporting adaptive recommendation step by step.

Basic notations
Formally, the exercise recommendation problem can be defined as a 4-tuple MDP S, A, R, T . In our recommendation scenario, the environment is composed of student set and exercise bank, the agent is our DMoSwR-DRL model. We describe the state, action, reward, and other details below.

State S:
The state s t is defined as a particular student's past exercising records, which contains the sequence of students' practice process from time step 1 to t s t {(e 1 , r 1 ), (e 2 , r 2 ), ..., (e t , r t )}.
Action A: During the learning process, the agent is in state s t , selects action a t according to the predefined selection policy, that is, recommend an exercise e i to the student.
Reward R: After the agent taking action a t on state s t , i.e., recommending an exercise to target student, the student answers this exercise. The agent receives immediate reward  Transition probability T: Transition probability T (s t+1 |s t , a t ) defines the probability of state transition from s t to s t+1 when agent takes action a t . We assume that the MDP satisfies T (s t+1 |s t , a t , ..., s 1 , a 1 ) T (s t+1 |s t , a t ).
For our sequence-wise recommendation task, the recommender agent aims to explore the environment and determine an optimal policy π : S → A to select the best suited exercise at each step, such that the long-term multi-objective cumulative reward is maximized. In the process of recommendation, we can observe a new interaction learning record of each recommended learning exercise instantly. Figure 2 presents the overview of our ER framework. The environment consists of student set and exercise bank. The agent is trained by a flexible actor-critic network. At each time step, the agent executes an action a t (i.e., recommending an exercise) according to the student's state s t (i.e., historical exercising records); thus, the student answers this exercise. Subsequently, the agent receives a scalar reward r t based on her performance p t , which then transitions to the next state s t+1 T (s t , a t ).

Theoretical and practical implications
In practice, it is vital for e-learning to establish a multiobjective sequence-wise recommendation framework to track students' cognitive level and further generate stepwise exercise recommendations for target students. Besides, in practical learning, it is natural to think that multi-objective exercise recommendation strategy is more useful than traditional single objective algorithm, since it has the potential to provide timely proactive and actionable feedback to each student at each step of the practice.
In addition, the transaction behavior of users in traditional recommendation scenarios and the learning behavior of students in e-learning platforms often have similar characteristics. Therefore, our proposed framework can be applicable to other recommendation scenarios, such as e-commerce, information retrieval, etc. Figure 3 illustrates the general overview of DMoSwR-DRL framework, which mainly consists of two core mechanisms: state representation modeling (SR) and actor-critic Recommender (AC). Specifically, at each time step, we first obtain the vectorized embedding representations of concepts, difficulties, and scores. Then, we implement a bi-directional GRU neural network to capture long-term dynamic nature from forward and backward logged data along depth direction, for effectively model the state of the environment. After obtaining the well-designed state from the previous module, AC determines which exercise to recommend next and whose parameters are dynamically updated during the recommendation. At the end of each timestep, an immediate reward will be delivered to DMoSwR-DRL. Figure 4 shows a single DMoSwR-DRL unit at the t-th time step. First, embedding vectors of knowledge concept k, exercise difficulty d, and student performance p are fed into a well-designed state representation module. In state representation module, the difficulty d is defined as the exercise's error rate based on historical learning records, where difficulty rate is 0.8 if the error rate is 20% [40]. Subsequently, Bi-GRU is introduced to model the student's historical learning logs, and the output of Bi-GRU at step t is input to actor-critic network as the student's current state. Adding recurrence allows the network to better estimate the underlying student's state [41]. Finally, the agent are trained by actor-critic network. The actor network generates an action according to the student's learning state s t , and the critic network estimates the Q-value Q(s, a) of the latent state and the joint action.

State representation module
It has been shown that state representation plays a crucial role in enhancing reinforcement learning performance [38].
Thus, the first step in DMoSwR-DRL is to come up with an accurate state representation.
State representation module aims at designing an appropriate structure that explicitly captures the interactions between the student and the exercise, and generates a state representation s t according to student's long-term exercising trajectories. Therefore, as shown in the lower left of Fig. 3, in this paper, the embedding vectors of knowledge concept, exercise difficulty, and student performance from the historical logs are input into the state representation module. Specifically, given the M knowledge concept k, we set it to be a one-hot representation vector k ∈ {0, 1} 2M . However, such encoding methods are nearly inapplicable to the large-scale exercise bank. Inspired by the compressed sensing representation proposed in the literature [13], the student learning records can be exactly encoded by assigning it to a fixed random Gaussian input vector of length log2 M , denoted as u k . Therefore, at time step t, the exercise can be denoted as: Methodology-wise, we apply the zero-padding strategy proposed in the literature [42], aiming to improve encoding efficiency. Thus, through merging the exercise x t and zero vector 0 (0, 0, ..., 0) T to obtain the interaction representation at each time step t, which can be represented as In an online learning environment, students' learning status is dynamic, with early learning status influencing current recommendations. An increasing scale of study has shown that Bi-GRU could capture bi-directional regularities from the forward and backward directions [43]. In this paper, considering the dynamic characteristics of long-term series dependence, a Bi-GRU is introduced to acquire current state s t from students' whole historical exercising records. Figure 5 clearly illustrates the forward GRU at time step t. One can easily see out that the core idea of GRU is to transmit the information flow by utilizing the cell state.
The reset gate is designed to control how much historical information should be ignored. Specially, at time step t, the reset gate r t is calculated by the vectorized embedding representation x and the hidden state h t−1 at the previous moment. The calculation formula is as Eq. (2) Then, the reset gate is stored in candidate hidden state h t , which is described as Eq. (3) The update gate determines the degree of retention between the hidden state h t−1 and the current input x t , which is expressed as Eq. (4) Therefore, the current hidden state h t based on the update gate z t is written as In Eqs. (2)-(5), σ and tanh represent the sigmoid, tanh activation function, respectively, is Hadamard product, U Therefore, we take the outputs of final hidden state F t as the student's state s t , i.e., s t F t .

Actor-critic network
After obtaining the well-designed state s t from the previous module, actor-critic network determines which exercise to recommend next according to the optimal policy π θ . In essence, actor-critic method is a type of policy gradient method, which decouples the value and policy functions into two separate networks. The actor network, also known as the policy function, is shown in the upper left of Fig. 4. The actor network with weights maintains θ , also called the policy π(a|s, θ ), which action a is sampled based on student's state s. We use multiple rectified linear unit (RELU) layers and hyperbolic tangent (Tanh) function to obtain the output of the Actor network. It converts the state representation s into an action a t π (s, θ ), and then finds the optimal action a on the basis of feedback generated by the critic network.
The critic network, also known as the value function, is depicted in the lower right of Fig. 4. Additionally, the critic network uses a fully connected neural network parameterized with weights w as Q(s, a; w), i.e., Q value function, to approximate the true value function Q π (s, a) and estimate the expected return at a given state s t . The critic assists actor in choosing an optimal action through a policy update algorithm. To this end, these two networks are trained simultaneously.
More specifically, the input of the critic network is the student's state s generated by the state representation module and the action a generated by the actor network, and the output is the Q value, which is a scalar. Based on the exploration mechanism and feedback of the critic network (i.e., Q value), the action strategy of the actor network is updated to determine the exercises that best suit each student. Finally, the actor network is trained using the sampled policy gradient algorithm. The policy gradient is calculated as follows: where Q(s, a; w) represents the parameterized actor network, and π (s|θ ) represents the parameterized critic network.
In addition, the critic network is updated to minimize the temporal [44] difference learning approach error loss L is calculated as follows: where y r t + γ Q w s t+1 , π θ (s t+1 ) , r t R(s t , a t ) means the reward, and γ means discount factor.

Multi-objective rewards
Difficulty: As previously mentioned, the difficulty of exercise is a pivotal factor to influence recommendation effectiveness. Although the existing ER approaches consider students' knowledge state and the expected difficulty set by teacher, they ignore the importance of maintaining students' enthusiasm for learning, affecting students' learning experience. Intuitively, a natural choice is to adaptively adjust the difficulty of the recommended exercises based on the student's recent level of development. Specifically, if a student has done well recently, the recommended difficulty level needs to be increased, and if not, the difficulty level of the exercise needs to be lowered. Therefore, the Difficulty reward is defined as follows: where δ ∈ [0, 1] is the "desired difficulty" factor. Here, we design Window N factor to capture the information of recent student performance, which is obtained by calculating the average performance of students for N times recently. Furthermore, with the help of the Window N , the model can adaptively adjust the difficulty of the recommended exercises, so that the student's average score performance of the most recent N records is close to δ. The research in [45] found that easy exercises (e.g., error rate is close to 10%) are more suitable for short-term engagement, whereas difficult exercises (e.g., error rate is close to 30%) are more suitable for long-term engagement and learning. As such, we choose the middle value in the experiment (i.e., error rate is 30%), that is, the desired learning goal is set to 0.7. The core idea of difficulty indicator is that we hope that the recent average development level of the student Window N approaches the desired learning goal 0.7. Therefore, if the relative discrepancy between Window N and 0.7 is large, the agent will receive a strong punishment as well as adapt to adjust the difficulty of the next recommendation accordingly.
Novelty: Aiming at the recommendation problem, this study considers that novelty exercise means the knowledge concepts that students have poorly mastered or never learned in the historical exercising records. Nonetheless, a few existing works care about this factor. Intuitively, we apply Jaccard similarity to calculate the correlation between the recommended exercises and the concepts already mastered by students to achieve this factor. The novelty reward is defined as follows: where H (e(k) t ) represents the set of knowledge concepts contained in exercise q(k) m , H e(k) r t represents the set of knowledge concepts covered by all correctly answered exercises in the historical exercising records, and Jaccsim(•) represents the Jaccard similarity between H (e(k) t ) and H e(k) r t . Therefore, if the system recommends an exercise with poor mastery to target student, the agent receives this feedback and gets a stimulation accordingly. Conversely, if the algorithm recommends an exercise in which the target student has already answered correctly, then a penalty is incurred, which is equal to −1.

Coverage:
The more concepts an exercise contains, the more likely her learning is to be comprehensive. For this purpose, we designed a knowledge coverage reward function with an incremental property. To be specific, as exercises related to the knowledge concept increases, the stimulation to the model will gradually increase from 0 to 1.
where knt(k, e) represents the number of knowledge concepts contained in i-th exercise.

Diversity:
The repeated occurrence of the same knowledge concepts in the recommendation process may decrease the learning enthusiasm of students. Here, for simplicity, we use a naive cosine similarity algorithm to measure the difference between the two exercises where cos e(k) t , e(k) i represents the similarity between exercise e(k) t and the historical exercise list H (e). Larger cos e(k) t , e(k) i denotes the closer similarities of two exercises, the higher punishment the system will receive. Based on the above analysis, a flexible reward function is designed to simultaneously optimize the four domainspecific objectives of difficulty, novelty, coverage, and diversity, providing the students with a trade-off sequence-wise recommendation. Specifically, we merge the above four rewards with the weight constraints γ 1 , γ 2 , γ 3 , γ 4 , and further propose the reward function r as follows: where γ 1 , γ 2 , γ 3 and γ 4 have to satisfy the condition γ 1 + γ 2 + γ 3 + γ 4 1.

Online evaluation simulators
It is worth noting that existing logged data only contain static exercising records, which may also bring a potential problem, i.e., we cannot train the agent to make sequential recommendations due to offline logs lack real-time feedback. To set up the online evaluation, we tested DMoSwR-DRL on a simulated environment where it can model qualitative development of knowledge level and obtain real-time rewards for a given exercise.
Deep knowledge tracing (DKT) does not need experts to pre-labell the difficulty levels of the exercises, which is practical in intelligent education scenario. Specifically, DKT seeks to monitor the student's changing knowledge proficiency via a simple LSTM network. Figure 6 presents an outline of standard DKT framework. The input (x N ) of the DKT is the student's past response logs, and the prediction (M t ) is a vector representing the probability of a specific exercise being answer correctly. The details of the DKT can

Experiments
In this section, we first describe four classic exercise-answer datasets and introduce the experimental setup. Then, we compare the DMoSwR-DRL with several competing algorithms, and conduct a series of sensitivity experiments to investigate how different parameters impact our model. We will carry out a series of experiments to answer the following research questions: • RQ 1 How does DMoSwR-DRL perform compared with several competing algorithms in terms of difficulty, novelty, coverage, and diversity? • RQ 2 Can DMoSwR-DRLs perform better compared with several competing algorithms in terms of the Cumulative Reward? • RQ 3 How do the four domain-specific rewards of DMoSwR-DRL perform for online recommendations? • RQ 4 How do the hyper-parameters settings affect the proposed model performance?
• ASSISTments 0910 is supplied by ASSISTment online tutoring system. We implement our tests on the modified "Skill-Builder" dataset, where an exercise contains multiple knowledge concepts. For this dataset, we discarded the records that had no concept or less than three records in preprocessing. • Algebra 0506 refers to a part of the dataset in KDD Cup 2010 EDN Challenge. Also, we removed the students with no concept or fewer than three records. After preprocessing, there are 574 students, 436 concepts, 1084 exercises, and 607,025 interaction records.

• Statics 2011 is collected from an OLI Engineering Statics
Course in Fall 2011, which contains 45,002 interactions on 87 concepts by 335 students. Note that this dataset is the densest of all four datasets. • ASSISTments 2017 is similarly collected by ASSISTment system. There are 1709 students with 942,816 interactions and 102 concepts in this dataset. Note that the mean number of records per student is much larger than those of other datasets. Especially, all exercises in this dataset only contain one knowledge concept. Table 1 summarizes the statistical data of the four datasets.

Evaluation metrics
• Difficulty metric According to the difficulty stationarity principles, we use the variance of the difficulty of the recommended list L e 1 , e 2 , ..., e |L| to represent the difficulty indicator. Dif where d L denotes the difficulty level of the entire recommended exercise list.

• Novelty metric
Here, we consider novel exercises refers to the knowledge concepts that students have poorly mastered or never learned which poor mastery or never appear in the historical exercising records. Thus, we define the novelty index an episode as follows: where H (e(k) l ) represents the set of knowledge concepts contained in exercise e(k) l , and H e(k) v l represents the set of knowledge concepts covered by all correctly answered exercises in the historical exercising records. |L| represents the length of per episode, and Jaccsim(•) represents the Jaccard similarity between H (e(k) l ) and H e(k) v l .

• Coverage metric
The coverage index reflects the comprehensiveness of recommended exercises. In the evaluation plan, we designed a coverage metric to calculate the proportion of knowledge concepts covered in a specific exercise. The coverage metric can be calculated by following formula: where |K | denotes the number of knowledge concepts contained in a special dataset, and L l denotes the recommended exercise at the time step l. • Diversity metric The diversity is formally defined as the average pairwise distance between exercises in per episode, as calculated in Eq. 20 where sim(i, j) represents the similarity of two exercises in per episode.

Baseline methods
In this work, we adopt representative approaches including RAND and SB-CF [2]. Moreover, two competing techniques are utilized: Q-learning [50] and DQN [51]. To a certain extent, we also employ two bandit algorithms HLinUCB [32] and MIF-TS [52] as the compared baselines. The baseline models are described in detail as follows.
• RAND: A baseline method based on the simple randomly strategy to recommend the exercises at each step. • SB-CF: This is a general user-based collaborative filtering method, in which random recommends an exercise to the target students from the response records of similar students.
• Q-learning: It is the most classical reinforcement learning model where a look-up Q-table is adopted to store the transitions in all state-action spaces.

Experimental setups
Data partitioning: For the four datasets, we partition each student's historical interaction into two parts: 50% of the data to train DKT simulator, while the other 50% to train DMoSwR-DRL framework. To evaluate the performance of DKT, we sample 10% records to validate its predictive performance. The recommendation step is set to 20 for all models.
Framework setting: We set the dimensioned d k 10 in concept embedding, d l 100 in DMoSwR-DRL, and the number of fully connected layers is 2. The embedding dimension used in DKT is 50, and hidden dimension of LSTM is 100. The details of the dimensions of two layers in actorcritic network are discussed in Sect. 5.5. In addition, we set weight constraints γ 1 γ 2 γ 3 γ 4 0.25.
Training setting: For our model, all parameters are initialized using Xavier Initialization [38]. The initialization fills the weights with random values sampled from N 0, std 2 , where std 2 n in +n out . Besides, n in and n out represent the numbers of neurons feeding in and feeding out, respectively. The Adam optimization algorithm [53] is employed in our experiment, where the learning rate is set to 0.001. We also introduce dropout [54] to avoid overfitting.

Performance comparison (RQ1)
To answer RQ1, we employed the above evaluation metrics to compare the performance of DMoSwR-DRL with other competing algorithms. Figure 7 reports the results of difficulty index for 20 recommended steps with different approaches. For the purpose to display more intuitively, the corresponding boxplot is drawn to show the comparison results. The plus sign (+) points are outliers. Through the visual representation of the figures, we can conclude that the difficulty levels of recommendations with DMoSwR-DRL distribute around 0.7, demonstrating its stability and Fig. 9 Performance comparison of different models on four datasets in terms of coverage excellence. Besides, we observe that the traditional and simple algorithms, i.e., RG and SB-CF, have the unfavorable performance on four datasets. This is because it only follows the simple strategy and does not consider the student's recent average performance. Although DQN has a certain stability, both the maximum and median are obviously lower than our proposed DMoSwR-DRL. These results show that, for the sequence-wise recommendation task, the proposed DMoSwR-DRL outstrips the benchmark models.
Next, we show in Fig. 8 the comparison of the novelty indicator among all models on four datasets. As we can observe, DMoSwR-DRL shows outperforming results with all baselines during the test, indicating that our approach can help students explore more knowledge. For the RL-based methods, it can be observed that the performances of Q-learning and DQN are consistently inferior to our proposed DMoSwR-DRL, since the positive effect of adopting the actor-critic algorithm for sequence-wise recommendation. Among all the adopted baselines, we can observe that the performance of SB-CF was the worst. We infer that this may be because SB-CF is dedicated to finding students who are similar to its target students, resulting in the recommendation of repeated exercises. With respect to the two bandit-based recommenders, we can observe that MIF-TS algorithm is more advantageous in terms of novelty.
The experimental results in Fig. 9 demonstrate that the concept coverage value of all methods is increasing rapidly with the number of recommended steps increasing. It shows from the figure that at the beginning of recommendation, their performance is almost same. When the number of recommendation step reaches 4, the agent trained by DMoSwR-DRL has an advantage. In general, Fig. 9 shows that the concept coverage for our method achieved satisfactory results. Though the statics2011 dataset contains a limited number of exercises, the types are relatively rich and the coverage of the recommended exercise is significantly better than other datasets. However, we find that the coverage of all models grows slowly in the Assistments2017 dataset. This is because all exercises in this dataset contain only one knowledge concept, while one exercise in the other dataset contains one or more different knowledge concepts.
For visual display, Fig. 10 shows the histograms of diversity indicator results with different methods. Figure 10 illustrates that, for the four datasets, DMoSwR-DRL can achieve the better performance than other models for sequence-wise recommendation based on diversity indicators. This is because our diversity indicator can effectively guide agent along the direction of increasing the richness of the exercises. Since the SB-CF recommends exercises that are similar to those that students liked in the past, so the diversity of the recommendation is inferior. In addition, we can observe that MIF-TS has superior recommendation diversity than other baselines. This is attributed to the introduced action space division mechanism, which increases the probability of exploring actions. Figure 11 reports the cumulative reward convergence performance of 1000 episodes attained by all models, where the horizontal axis represents the number of training episodes, and the vertical axis represents the cumulative reward of each episode. The most obvious observation from the results of the four datasets is that DMoSwR-DRL converges faster, performs more stable and final reward value is larger than that of the baseline methods. The performance of DQN is generally second only to MIF-TS, but Fig. 11 The cumulative reward of all models on four datasets with the change of episode, respectively the reward fluctuates greatly. As can be seen from the heat maps, the MIF-TS method with carefully designed reward strategy could well select the suitable exercise for student, leading to a higher reward compared to the other baseline algorithms. Note that RG model and SB-CF model perform unsatisfactorily in the Algebra0506 dataset. The key explanation for this may be that Alge-bra0506 dataset relies on two related attention networks that affect one another for modeling target and context interactively.

Reward strategy comparison (RQ3)
In addition, to further investigate the potential impact of four domain objectives on the recommendation quality, we also designed four variants of the DMoSwR-DRL algorithm, including DMoSwR-DRL-r1, DMoSwR-DRL-r2, DMoSwR-DRL-r3, and DMoSwR-DRL-r4. DMoSwR-DRL-r1 highlights the role of the Difficulty reward by removing the other domain-specific objectives, while DMoSwR-DRL-r2 highlights the role of the Novelty reward by removing the other domain-specific objectives. Similarly, another two variants only consider coverage and diversity, respectively, to highlight the role of a single reward.
The result is shown in Fig. 12. The reward function of the variant DMoSwR-DRL-r1 only focuses on the difficulty level of each exercise, aiming to make the difficulty of the recommended exercise closer to the desired learning goal. The limitation of this variant is that specific exercise will be repeated appearances during the learning process, which means that this variant gives unsatisfying results. The reward function of the variant DMoSwR-DRL-r2 only focuses on the novelty of exercise, aiming to select concepts that have Fig. 12 The variant analysis of DMoSwR-DRL algorithm no answers or poor mastery for a certain student. To a certain extent, this variant limits the range of exercise selection, which may affect the improvement of the recommendation diversity. DMoSwR-DRL-r3 focuses on the coverage of recommended exercises, but this variant is less sensitive than other variants, that is, the gap between 1/2 and 2/3 is not as big as the gap between − 1 and 1, so it is not so outstanding in optimization performance. The variant DMoSwR-DRL-r4 aims to recommend non-repetitive exercises for students, while also providing positive guidance for improving the novelty and coverage of learning item.
Different from other datasets, the ASSISTments2017 dataset has the characteristic that one exercise only contains one knowledge concept. Therefore, when only coverage factor r 3 is considered, the novelty factor r 2 will also be improved along with the optimization of coverage factor, because the improvement of coverage factor also indicates the addition of a new knowledge concept. In addition, when only considering the diversity factor, we found that the trend was almost the same as that when considering all indexes, which was also due to the characteristic that an exercise only contains one knowledge concept.
In general, we observe that four domain-specific rewards can benefit sequence-wise recommendation task and DMoSwR-DRL method can achieve significant performance improvements by considering them simultaneously.

Hyper-parameter sensitivity (RQ3)
To make convincing comparisons, we implement a series of sensitivity experiments to investigate the influences of different parameters on recommendation performance, including hidden dimensionality, the size of action space, and sequence length. Effect of hidden dimensionality: For DMoSwR-DRL, we set the hidden dimensionality d to be [10,20,30,40,50,60,70,80,90, 100] to research how the change of model performance with various hidden dimensionalities setting, and report the results of DMoSwR-DRL on cumulative reward in Fig. 13. Within a certain range, the larger hidden dimensionality will greatly improve the performance of the model. Yet, we also notice that when the hidden dimensionality has achieved a certain value, the recommendation quality will decrease even if the number of hidden dimensionalities was increased. When the hidden dimensionality d is 70 on ASSISTmengs0910, and the hidden dimensionality d is 50 on Algebra0506, Statics2011, and ASSISTments2017, the results are the best.
Effect of recommended length: To find out how the length of recommended affects the model performance, we investigated and implemented the length of 10,15,20,25,30,35, and 40 in the experiment. As a result, we set the maximum number of steps per episode to be the same as the recommended length. We calculate the average reward value return at each step in the entire episode. It can be seen from Fig. 14 that DMoSwR-DRL obtains the higher mean reward when the recommended length is set to 20. When the recommended length is long (e.g., the length is 20), the noise is too large, so good performance cannot be achieved. Similarity, we also find that as the recommendation length increased, and there is no further performance improvement of the model. As a result, we manually set the recommended length to 20 in the following experiments.  Fig. 15. Besides, the change of the number of exercise banks directly affects the size of action space K. As depicted by Fig. 15, within a certain range, the cumulative reward value of our proposal shows an increasing trend as the size of action space increases. Another observation that Fig. 15 indicates is that, with further increments on K, the performance of our model slightly decreases. To put it more bluntly, when the size of the exercise bank is small, the recommender agent cannot generate exercises that meet the multiple educational domain-specific objectives, i.e., the exploration is insufficient. On the four datasets, as the size of exercise bank increased, the recommender agent can explore adequately, so that the performance improves. However, when the size of the exercise bank exceeds a certain value, the oversize action space may involve more unrelated exercises, hence leading to degenerated performance. The results are best when the size of action space is 10,000 on ASSISTments0910, 800 on Algebra0506, 300 on Statics2011, and 1800 on ASSIST-ments2017.

Conclusion
To break the limitations of conventional sequence-wise recommendation, we design a novel and effective exercise recommendation approach, named DMoSwR-DRL, aiming at mine vast amounts of online learning logs accumulated in the education platforms to recommend the exercise Fig. 15 Effect of the size of action space for different datasets with the highest benefits step by step. Unlike prior studies, our approach integrates more beneficial multiple types objectives, i.e., difficulty, novelty, coverage, and diversity, to construct reward functions. To explore the long-term dependency of student's state, we employ two independent GRU networks to capture her whole exercising history. By applying domain-objective rewards to the field of recommendation, DMoSwR-DRL is proposed, providing the students with a trade-off sequence-wise recommendation. We have conducted an extensive experiment study by making comparison to 6 competing methods of four different kinds on four real-word datasets. Extensive experimental indicates that DMoSwR-DRL outperforms the competing models in four datasets and can trade-off high novelty, coverage, and diversity on the basis of difficulty.
There might be two directions which can be extended for this work. First, the existing methods solely exploit "exercises to knowledge concepts" relationships, which ignores the prerequisites between concepts. Therefore, we would like to investigate the precedence relations between learning contents. Moreover, we plan to further improve the RL technique, leveraging existing high exploration methods [such as hierarchical RL [37] and random network distillation (RND) [55]] is also another critically promising direction to improve the performance of sequential recommendation. in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.