Fully adaptive recommendation paradigm: top-enhanced recommender distillation for intelligent education systems

Top-N recommendation has received great attention in assisting students in providing personalized learning guidance on the required subject/domain. Generally, existing approaches mainly aim to maximize the overall accuracy of the recommendation list while ignoring the accuracy of highly ranked recommended exercises, which seriously affects the students’ learning enthusiasm. Motivated by the Knowledge Distillation (KD) technique, we skillfully design a fully adaptive recommendation paradigm named Top-enhanced Recommender Distillation framework (TERD) to improve the recommendation effect of the top positions. Specifically, the proposed TERD transfers the knowledge of an arbitrary recommender (teacher network), and injects it into a well-designed student network. The prior knowledge provided by the teacher network, including student-exercise embeddings, and candidate exercise subsets, are further utilized to define the state and action space of the student network (i.e., DDQN). In addition, the student network introduces a well-designed state representation scheme and an effective individual ability tracing model to enhance the recommendation accuracy of top positions. The developed TERD follows a flexible model-agnostic paradigm that not only simplifies the action space of the student network, but also promotes the recommendation accuracy of the top position, thus enhancing the students’ motivation and engagement in e-learning environment. We implement our proposed approach on three well-established datasets and evaluate its Top-enhanced performance. The experimental evaluation on three publicly available datasets shows that our proposed TERD scheme effectively resolves the Top-enhanced recommendation issue.


Introduction
With growing developments in intelligent tutoring systems, advances in artificial intelligence and other emerging technologies have far-ranging consequences in online personalized learning. Technology-supported online learning, such as Recommendation System (RS), has been widely used for bringing new teaching model based on ethics of participation, openness, and collaboration [1].
As an advanced technology, Top-N recommendation algorithm [2,3] can provide learners with many personalized learning experiences. Recently, Deep Reinforcement Learning (DRL) [4] technique has arose as an effective way to deal with Top-N recommendation issue. As a distinguished direc-B Kun Liang liangkun@tust.edu.cn 1 College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China tion, many efforts have been dedicated to introducing more advanced RL techniques to maximize the expected sum of long-term rewards [5][6][7].
In this paper, we revisit existing RL-based Top-N recommendation models and find that they still have some inherent deficiencies. First, existing RL recommendation approaches committed to maximizing the long-term returns, while ignoring the significance of the top position recommendations, i.e., they seldom care about the Top-enhanced issue. In practice, if the recommendation system does not provide appropriate exercises at the top position to fulfill learners' needs, they may not have enough patience to keep learning. As illustrated in Fig. 1a, both Bob and Merry were discouraged and demotivated, because they found the recommended exercises are not suitable for them at the beginning of study, that maybe result in dropout. Second, RL models usually have extremely candidate action space when performing recommended tasks, which results in high computational and representational complexity. Also, in the early stage of the learning, the action selected is usually random. This problem may have a severe adverse impact on the task of exercise recommendation. Third, none of existing recommender can be directly applied to top-enhanced issue. We can design specific mechanisms to solve this problem, but since they are based on different neural network architectures, means that we have to redesign different exercise selection strategies for different models, i.e., this scheme still suffers from poor scalability. As a result, a general framework to accommodate the transition process from Top-N to Top-enhanced is required. Lastly, some data mining researchers [8,9] have focused on recommending non-mastered exercises for each student by manually assigning the difficulty label. However, in practical scenarios, this solution inevitably leads to label bias. As suggested by [10], students' knowledge construction process is not static, but evolves over time. Along this line, an advanced deep learning model is needed that can track students' evolutionary mastery state of specific knowledge concepts.
We illustrate these complex interactions more clearly by a toy example in Fig. 1.
To overcome these obstacles, we develop a Top-enhanced Recommender Distillation framework (TERD)to handle exercise recommendation tasks flexibly. Specifically, we investigate: (1) how design generally solutions to effectively improve the recommendation performance in the top positions; (2) how to construct a good state representation and accurately measure student proficiency to help the recommender agent perform effective reinforcement learning.
The main novelties and contributions of the proposed TERD framework can be listed as the following four folds: (1) To our best knowledge, this is the first work to apply distillation technique to Top-enhanced exercise recommendation. It extracts the knowledge of an arbitrary recommenders and inject it into a student network for more effective Top-enhanced recommendations. (2) Different from the action selection strategy of the traditional DDQN-based methods, our proposed framework absorbs the essence of well-trained recommenders, thus largely decreasing the action selection space in DRL. (3) We attempt to design an effective student network (i.e., DDQN) by introducing new state representation scheme, and a flexible individual ability tracing method. Considering student's learning long-term dynamic nature, we incorporate a stacked GRU network to design the state representation scheme. Different from the previous works that manually assigning the difficulty label for exercises, our proposed TERD is able to stand in the perspectives of different students and gain insight into their mastery of specific knowledge concepts. (4) We perform comprehensive experiments on three largescale benchmark datasets to demonstrate the effective-ness of TERD model. The experiment results also show that student network outperforms the teacher recommender in Top-enhanced recommendation tasks.
The remainder of this paper is structured as follows. In Related work, we briefly review some recent related research. Section Preliminaries elaborates the basic work of TERD model. The technical details of proposed TERD framework are described step by step in Proposed methodologies. Then, we present the experimental results in Experiment. Finally, Conclusion summarizes this paper, analyzes this study's limitations and provides future directions.

Recommender system
Among various recommendation models, collaborative filtering (CF) algorithm [11,12] has attracted increasing attention from researchers, who have proposed many classical CF algorithms. Motivated by the excellent performance of Deep Learning (DL) techniques, researchers have developed various Top-N recommendation methods using deep learning. In the latest relevant research, Wu et al. [13] developed a two-stage neural recommendation model named KCP-ER, which consists of a knowledge concept prediction module and an exercise set filtering module. As a promising direction, many variants have been developed by introducing more advanced Deep Learning (DL) techniques, constructing a many-objective optimization model [14], and capturing useritem latent structures [15]. Recently, Deep Reinforcement Learning (DRL) techniques are widely applied to various scientific problems and, in several tasks, perform superior to humans. Lin et al. [6] presented a hierarchical reinforcement learning with dynamic recurrent mechanism for course recommender systems. The author in [7] also designed a DRL method based on Actor-Critic (AC) framework for knowledge recommendation. In contrast, we do not focus on the Top-N recommendation issue, and perform model distillation theory on our proposed TERD to reinforce the recommendation effect of the top position.

Knowledge distillation
Knowledge Distillation (KD) [16], which is a highly promising knowledge-transfer technique from a large well-trained model (a.k.a., a teacher network) to a relatively lightweight model (a.k.a., a student network), has exhibited the state-ofthe-art performance in various fields. Surveying the recent literature, a lot of KD methods have achieved significant improvements in computer vision [17][18][19], natural language processing [20,21], and graph data mining [22].
Recently, some researchers introduced knowledge distillation (KD) technique to generate a small but powerful recommender. Wang et al. [23] proposed a novel knowledge distillation model with the probabilistic rank-aware sampling, termed as collaborative distillation, which adopted an improved student network training strategies to promote the top-N recommendation performance. Moreover, to alleviate the issue of high dimension and sparsity of tag information in actual scenarios, the author in [24] developed two novel heterogeneous knowledge distillation methods in featurelevel and label-level to build relations between User-oriented autoencoder and item-oriented autoencoder. In the recent literature, several works proposing self-distillation [25,26] have also been emerged. The work in [27] introduced a weighting mechanism to dynamically put less weights on uncertain samples and showed promising results. Huang et. al [28] introduced the self-distillation concept into GCN-based recommendation and proposed a two-phase knowledge distillation model improving recommendation performance.
However, little effort has been made to tackle the topenhanced recommendation challenge. Through the exhaustive analyses, the recommendation algorithm closest to our idea is dedicated to maximize the effect of the ranking distillation in RS. The work in [29] developed a novel rank distillation model named ranking distillation (RD) using teacher network captures ranking patterns to guide the student network to rank unlabeled documents, revealing that knowledge distillation model can help extract more useful features via large teacher network. Motivated by the above approach, Kang et al. extended KD [30] by the regularization mechanism to enhance the student network performance.

Reinforcement learning in education
RL is a known algorithm to enable autonomous learning. In educational psychology scenarios, there has been a series of successful applications of deep RL methods. Online course recommendation has attracted widespread attention in the area of intelligent education. Along this line, Lin et al. [6] presented a hierarchical reinforcement learning with dynamic recurrent mechanism for course recommender systems, which designs a profile constructor to efficiently trace the learner's preferences for personalized course recommendation. By treating learning path recommendation task as a Markov Decision Process, Liu et al. [7] developed a cognitive structure enhanced framework using actor-critic algorithm that can generate a suitable learning path to different learners. The work in [31] proposed a personalized recommendation method based on the standard Q-learning, in which a Q-table is constructed to store the Q-value of all state-action pairs. To extend this idea, author in [32] used several fully connected layers to take the place of the Exercise Q-table to approximately estimate the Q-value.
Despite these productive works, our method differs from these efforts in that we focus more on the recommendation accuracy of the top positions, rather than the most common top-N recommendation tasks.

Problem statement
In this article's recommendation scenario, four components are typically contained in learning systems: students, exercises, knowledge concepts and response score. A folksonomy is a tuple: where U, E, K, Y indicate students set, exercises set, knowledge concepts set, and response scores set, respectively. Also, we denote the association between exercises and knowledge concepts as a binary relation Z is the internal relation among them. For the convenience of representation, the historical learning record of a certain student can be formulated as a 3-tuple: where z (u m ,e i ,y i ) denotes the exercise e i practiced by student u m at her exercising step t and y u m ,i ∈ {0, 1} denotes the corresponding score. y i = 1 when the student answers the exercise correctly, and y i = 0 otherwise.

General RL structure
In essence, the recommendation task can be formalized as a Markov Decision Process (MDP). The overall RL framework for recommendation is presented in Fig. 2. More formally, the elements of an MDP (S, A, R, P) can be characterized as: • State S: At each design point t, the current state s t ∈ S denotes the preceding exercising history of a student as well as each exercise (e, k, y) is also considered. • Action A: An action a is a vector. Based on state s t , taking action a t ∈ A is defined as select an exercise e t for a certain student, after which agent enters a new state s t+1 . • Reward R: An immediate reward r t is a scalar value which is obtained from the environment. When an exercise e is selected, we get the reward r (s t , a t ) according to the feedback of various objectives. • Transitions P: Once the test's feedback is collected, the agent will enter the next state based on the transition probability p(s t+1 | s t , a t ).

Remarks
To simulate the interaction between recommender agent and student with a given environment, agent should follow the four hypotheses.
(1) Each student u i has a total of N rounds of interactions with his/her personal recommender agent. (2) u i can response at most one exercise in L * at each time step t. (3) In the t-th time step, agent first selects an exercise e t with the highest Q-value, and then deletes it from the exercise subset L as well as adds it into the Top-enhanced list L * for a particular student. (4) Given a student's learning record z t (u m ,e i ,y i ) , if y i = 1, there is no hit exercise, and recommender agent receive negative feedback.

Model description
The architecture of proposed TERD model is depicted in Fig. 3. TERD model treats student's response logs as inputs and adaptively recommend suitable exercises to students. Specifically, two sub-modules inside this model are used to achieve this task. One is Teacher network, as a well-trained network, which is used to learn student-exercises embeddings from historical interactions and generate candidate exercise subset. The final output of the teacher network is distilled and transferred to the student network's component to further guide the agent perform effective top-enhanced recommendation. The other is student network, which incorporates the essence of well-trained methods. Benefiting from the teacher network, distilled knowledge (i.e., studentexercise embeddings, and candidate exercise subset) that helps promote the recommendation accuracy of the top positions is transferred to the student network. In addition, two mechanisms are introduced in the student network: i) a well-designed state representation scheme to capture the long-term dynamic nature of student learning; ii) an efficient individual ability tracing model that is used to estimate the mastery probability of a student on each concept.

The teacher network
The main purpose of the teacher network is to transfer powerful distilled knowledge that guides the student network's recommendation process. By transferring knowledge from more heavyweight and powerful teacher network, the performance of the final (student) model relies heavily on the strength of the teacher network. In other words, stronger teacher, stronger student. In this work, six advanced exercise recommendation methods are selected for comparison as the teacher networks, including, • ER-LOAF [33]: It designs a hybrid many-objective framework to recommend suitable exercises that accord with learners' mastery level and knowledge concept coverage. • HB-DeepCF [34]: It embeds the students and exercises into a low-dimensional continuous vector spaces via auto-encoder techniques, and then integrated both recommender component and auto-encoder component into a new hybrid recommendation model for adaptively recommending exercises to each student. • DKVMN-RL [35]: It first acquires students' mastery level of skills using the improved Dynamic Key-Value Memory Network (DKVMN), and a Q-learning algorithm is then used to learn an exercise recommendation policy. • LSTMCQP [36]: It uses a personalized LSTM approach to trace and model students' knowledge mastery states and further designs a "recommend non-mastered exercises" recommendation strategy. • KCP-ER [13]: It develops a knowledge concept predictionbased Top-N recommendation model for finding a set of recommendation lists which are the trade-off among accuracy, coverage, novelty, and diversity. • TP-GNN [37]: It applies graph neural network to Top-N recommendation task, in which the aggregate functions and attention mechanism are employed together to generate a high-quality ranking list.
Technically, the proposed TERD is flexible since any existing recommendation techniques can be used as teacher networks without considering the detailed mechanisms behind them. Benefiting from the model-agnostic strategy of knowledge distillation, we do not require redesigning the recommended strategies of the student network if we modify the teacher network.
The embedding vectors of students and exercises from the teacher network are transferred into the state representation component of the student network, which helps to capture the long-term dynamic nature from students' past learning trajectories to build the state representation of the student network. Moreover, the candidate exercise subset of the final output from teacher network is given as input to student network can effectively narrow the action space in DDQN. In this way, the distilled knowledge makes the student network have higher accuracy and faster convergence in the top position recommendation.

Network structure of the proposed DDQN agents
The goal of the student network is to adjust and optimize the candidate exercise subset L, thus achieve a performance improvement on the top positions. Specifically, we employ the deep reinforcement learning algorithm, and design novel task-specific reward functions for adaptively generating recommendation lists to students during the learning process. Under this architecture, the Double Deep Q-Network (DDQN) [38] with the experience replay mechanism is adopted as the student network. We would like to emphasize that the major centered in this study is on Top-enhanced issues, rather than exploring the best DRL approach in the context of exercise recommendation. The interaction between the agent and the environment in the DDQN-based TERD is depicted in Fig. 4.
Two same Q-Networks with different parameters are introduced in standard DDQN, namely, online network Q parameterized ϕ by and target networkQ parameterized by ϕ − . The target Q-Network estimates the target Q-value of the agent taking that action for the next state and updates parameters with the online Q-Network after a certain number of iterations. Remarkably, DDQN decouples the selection of actions from the evaluation, greatly reducing the overestimation of Q-values. The online Q-Network loss function is calculated through the temporal difference (TD) error [39], as shown in Eq. (1).
The experiences generated in the interaction with the environment are stored in the experience replay buffer D in the form of s, a, r , s , and the training samples are randomly selected from D.

Definition of reinforcement learning components
The state, action and reward for the student network are defined as follows.
State. The learnt student and exercise embeddings by the teacher network are employed to build the state of student network. In detail, at time step t, the embedded knowledge of teacher network output can be defined as:x = [u m ; e t−n , . . . , e t−1 ], where u m , e t−i ∈ R d represent the student and exercise embeddings obtained from teacher network, [e t−n , . . . , e t−1 ] indicates the latest embeddings from exercising step 1 to t.x ∈ R d×(n+1) represents the concatenation of {u m ; e t−n , . . . , e t−1 }.Therefore, with embedded vector sequencex described above as input, we incorporate a stacked GRU network (SGRU), which is more capable of modeling the student's whole exercising trajectories. Specifically, the 1st component of the stacked GRU layer applies the GRU network to generate the hidden representation as below: where h (1) i denotes the hidden states at time step i. The 2nd component of the stacked SGRU layer has a similar structure to the 1st component, denoted as, Similarly, h (2) i denotes the hidden states in the second layer. When multiple layers of neurons are stacked, it may also bring some potential problems, e.g., overfitting and harder to train. Inspired by previous works [40], the residual connection technique are introduced between the two layers to alleviate above limitations.
Then, we obtain state representation s i via an activation function as follows.
Specifically, in the t-th exercising step, if the recommender agent correctly selects one exercise e t for the student, s t will be updated as s t+1 ; otherwise s t+1 = s t .
Action. Taking action a t based on state s t refers to selecting the recommended exercise e t ∈ Z t to the student. Specifically, we select an exercise by sampling from the distribution π(a | t; ϕ), where ϕ is the set of model parameters. Meanwhile, it should be noted that the definition of the action space A t is based on the candidate exercise subset L.
Reward. After the agent selects an action (i.e., an exercise) from the exercise subset, a reward signal r is received. Subsequently, this exercise is added to the Top-enhanced recommendation list L * . Considering that the design of the reward function will directly affect the agent's action optimization strategy, we also carefully design a reward function with ranking quality characteristics. As mentioned before, our goal is to rank the exercises that students answered incorrectly at the top of recommendation list. Therefore, the reward function is designed to follow the strategy that the higher the ranking of the correctly recommended exercise, the stronger stimulation will be come. As a result, we redefined the standard NDCG metric as the reward function. At each exercising step t, the definition of the reward R(s t , a t ) for the state-action pair (s t , a t ) as follows: where f ∈ [0, 1] represents the probability of student's u m wrong response the exercise e i , which is a flexible feedback factor. Formally, we design the feedback factor f as follows: where y t u m ,e i represents the performances score of the exercise e i practiced by student u m at her exercising step t. Specifically, we set g = 0 in Eq. (6). TERD will follow the common strategy of recommending non-mastered exercises as many existing works do. Note that g is a flexible factor where g could be adjusted if we hope the agent focus more on recommending exercises of the desired difficulty.
To this end, we implement a system simulator in it with a knowledge tracing technique to simulate the performances score y t u m ,e i according to the feedback of the corresponding exercising. Specifically, Deep Knowledge Tracing (DKT) [41] is applied to acquire the implicit knowledge mastery level y t u m ,e i . The input of the DKT model Y is the historical learning sequence Z u m of the student u m , while the output y t u m ,e i is a predicted vector that represents the probability of a student answering the exercise correctly. The probability of answering the exercise e i correctly of students u m can be obtained from Eq. (7).
where θ u m is the trainable parameters in the DKT model.

Training mechanism
Training Procedure. To make this framework better understood, Algorithm 1 summarizes the major steps of the training process. The training procedure contains two closely related phases, i.e., acquire prior knowledge (corresponding to line 1-2 in Algorithm 1) and student network training (corresponding to lines 3-16 in Algorithm 1). For the first phase, we dedicate to distill the exercise subset from the well-trained teacher network as a type of distilled knowledge to incorporate the student (corresponding to line 2 in Algorithm 1).
In the second phase, the agent starts its learning process from an initial state s 0 (corresponding to line 4 in Algorithm 1). At each exercising step t, the agent acquires the student's state s t , and then estimates the Q-value of the exercises in candidate exercise subset L, and finally takes action a t at t. Subsequently the agent receives reward r t+1 based on his/her response score y i (corresponding to lines 5-10 in Algorithm 1). On lines 11 to 16, DRE transits to a new state s t+1 and then the action space will be adjusted. Here, the experience replay buffer D bu f f er is used to store the recommender policy. Note that we also employ the widely-used target networks [18] with soft replace technique to smooth the learning and avoid oscillations or divergence in the parameters (corresponding to lines 17-20 in Algorithm 1).
Testing Procedure. Algorithm 2 describes the Algorithm of TERD's testing procedure. Utilizing the candidate exercise subset L from the teacher as distilled knowledge, for each student u i in the testing data, the student network can generate a high-quality Top-enhanced list L * . After T time steps, we can get the high-quality Top-enhanced list L * .

Algorithm 1 Training Algorithm of TERD Framework
Require: ϕ,φ, ϕ , d, T , λ, η 1 , η 2 , D bu f f er , D train 1: for u i ∈ D train do 2: Generate U ,E and L through the pre-trained teacher network; 3: Observe the initial state s 0 according to the offline log; 4: for time step t = 1 to T do 5: Observe current state s t = SG RU (x, s t−1 , θ s ) based on Eq. (4); 6: Calculate Q values Q(s t , a t ; ϕ) for all the exercises in A t ; 7: Execute action a t from A t with the highest Q-value; 8: Add the corresponding exercise e t into L * and remove it from A t ; 9: Calculate reward r t based on Eq. (5); 10: if the agent correctly recommends an exercise e t then 11: Update to a new state Generate U ,E and L through the pre-trained teacher network; 3: Observe the initial state s 0 according to the offline log; 4: for time step t = 1 to T do 5: Observe state s t based on Eq. (4); 6: Calculate Q(s t , a t ; ϕ) for the exercises in A t ; 7: Take action a t from A t ; 8: Add the corresponding exercise e t into L * and remove it from A t ; 9: Update state s t → s t+1 and A t → A t+1 ; 10: end for 11: end for

Theoretical analysis of TERD
In this section, we will provide an attractive theoretical analysis of TERD algorithm with reference to [42][43][44]. Eq. (1) is minimized, the proposed TERD algorithm outperforms the corresponding teacher recommender in maximizing the ranking accuracy of the recommendation list.

Theorem 1 For a given reward function in Eq. (5), if the TD error in
Proof of Theorem 1 Thus, the goal is to learn a policy π that maps each state to action, so the value function of any state s t is the maximization of the expected return received from the time step t it moves forward. The state value function and the state-action value function for a policy π can be defined as follows.
Besides, the DDQN network is updated accordingly by the temporal difference learning approach. Then, the updated policy is denoted as π ϕ . According to the policy improvement theorem [44], By applying the above theorem repeatedly, we have The above equation demonstrates that it would be beneficial to use the updated policy π ϕ to generate a high-quality recommendation list.

Experiment
In this section, we successively report the dataset description, the parameter settings in teacher and student network, and the evaluation protocols in sense of multi-objective optimization. Finally, we conduct plenty of experiments on three datasets to evaluate the Top-enhanced recommendation performance, and we aim to answer the following research questions: -RQ 1: Whether the student network plays a critical role in advancing the performance of Top-enhanced recommendation? -RQ 2: Comparing with existing well-known learning to rank model and RL-based exercise recommendation technique, how does our proposed TERD perform when K takes different values? -RQ 3: How does TERD perform in terms of the model efficiency compared to other state-of-the-art methods? -RQ 4: How do the key hyper-parameter settings affect TERD? -RQ 5: How about the interpretation of TERD on topenhanced recommendation scenario?
We ask RQ 1 to evaluate whether the student network applying to six advanced teacher networks is work. We ask RQ 2 to evaluate the performance of the proposed TERD framework when comparing the results with two advanced learning to rank algorithms, i.e., DeepRank [15] and SQL-Rank [45] and four RL-based exercise recommendation i.e., DQN [46], MOOCERS [47], DDQN [38], and DDPG [48]. For RQ 3, we compare the efficiency of all methods on three datasets. Then, we conduct a series of parameters experiments to test the influence of the action space size, hidden dimensionality, and batch size. Finally, we visualize herein the exercising process of two students to evaluate the proposed TERD models on their ability in solving the topenhanced task.

Dataset descriptions
The experiments are carried out on three real-word datasets: ASSISTments0910 [49], Algebra2005 [50] and Statics2011 [51]. The basic statistical information for all the datasets is shown in Table 1. The detail descriptions are as follows: ASS I ST ments0910. This dataset was provided by the online intelligent tutoring platform ASSISTments. Notably, it was gathered from "skill-Builder" question sets. Among other things, it embedded two heterogeneous features, hint counts and attempt counts, into the embedding of online learning. During the preprocessing, we removed students with no skills or less than three records on the "skill-Builder" dataset.
Algebra2005. This dataset was a part of the dataset in KDD Cup 2010 EDN Challenge. Also, we removed the students with less than three transactions. After preprocessing, there are 574 students, 436 knowledge concepts, 1,084 exercises and 607,025 interaction records.
Statics2011. This dataset was collected from an OLI Engineering Statics Course in Fall 2011, which contains 45,002 interactions on 87 concepts by 335 students. Note that this dataset was the densest of all the three datasets.
All the experiments are conducted on a server with an NVIDIA RTX 3080 GPU with 10 GB of video memory. We implement our model using pytorch, which is a popular deep learning framework. To setup this experiment, we divided the datasets into 70%/10%/20% partitions, using the 70% as training set, 10% as validation set, and 20% as testing set.

Teacher and student settings
Teacher: The parameters of training algorithm (learning rate η 1 , layers l, discount factor β 1 , the number of neighbors n, the initial temperature parameter J , reduction factor c, learning goal g, experience replay memory D, the depth of propagation p, and difficulty range dr are elaborately set by preliminary experiments as shown in Table 2. Student: the learning rate η 2 = 0.01; greedy policy p=0.2; discount factor β 2 = 0.9; size of the experience replay memory D bu f f er =100; and the parameters of the target network are only updated every 100 steps from the online network.

Evaluation protocols
We select three widely used metrics, namely Presision@N , M AP@N , and N DCG@N , to evaluate the top-enhanced recommendation performance in the TERD.
(1) Presision@N is defined as the measure of the accuracy of recommendation results. It is the proportion of the non-mastered exercises (recommended and mistaken) over a total number of exercises in the recommendation list. Presision@N can be simply computed as: (2) M AP@N is computed by considering the performance of precision at all positions of recommended list. First, AP@N is defined as: where r k =1 if recommended the non-mastered exercise and r k = 0 otherwise. Therefore, M AP@N is computed by the mean value of AP@N over all students.
(3) N DCG@N assigns higher scores to correct recommendations at higher ranks in the final recommendation list.
where Z n denotes the normalized term that computed over ideal value iDCG. We perform experiments with the setting of N ={2, 5, 10}.
DeepRank [15]: This is a neural network-based rank approach, which combines Matrix Factorization algorithms and deep neural network for solving ranking learning.
DQN [46]: This is a DQN-based recommendation method, where a deep Q-Network to select the optimal exercises at each step.
MOOCERS [47]: This is the first attempt that using actor-critic framework of reinforcement learning to support exercise recommended service. Besides, they design a flexible reward function, taking into account three objectives including Review, Difficulty and Learn.
DDQN [38]: It extends DQN and proposes a new way to calculate the training target.
DDPG [48]: It uses the deep deterministic policy gradient (DDPG) algorithm to select the highest-ranking score for recommendation.
The implementation of SQL-Rank, DeepRank, DQN, MOOCERS, DDQN, and DDPG are based on original paper with some fine-tuning to fit our task. For the reward functions of RL-based recommendation technique, this study adopts a design idea that exactly same as TERD. For a fair comparison, the six competing methods adopt the same set of important parameters (i.e., the hidden dimensionality, training batch size, learning rate) as the TERD.

TERD evaluation results and analysis (RQ1)
In this section, we compare TERD with six well-trained teacher networks. Tables 3, 4, 5 report the results of comparison methods on three datasets, respectively. We observed many interesting conclusions: • Through comparing the results before and after removing the student network, we find that the proposed TERD achieves comparable or considerably improved performance in the top positions. It should be emphasized that TERD without student network will degenerate into a general model focus on top-N recommendations. In contrast, the student network utilities the knowledge transferred from the teacher network can effectively reduce the action space in DRL, and finally promotes the recommendation performance. This well demonstrates the positive effect of absorbing the essence of well-trained recommenders for Top-enhanced recommendation.
• We notice that, compared to the metric results of recommendation on other two datasets, the improvements of three metrics on the Statics2011 are more significant. This indicates that our proposed TERD can achieve better performance in dense datasets. On the sparsest data set, ASSISTments0910, TERD framework also achieves significantly large performance improvement.
• The performance of the pure KCP-ER method is the closest to ours among all the benchmark models, as it carefully designed four flexible optimization goals. Especially, the difficulty goal emphasized by the KCP-ER method shows advantages in promoting the performance of the algorithm. • The results also reveal that Top-N recommendation models based on cognitive diagnosis, such as KCP-ER and LSTMCQP, can achieve superior performances than the representative recommendation methods, such as ER-LOAF and HB-DeepCF. This is due to the representative recommendation methods focus only on the student-exercise explicit interaction information, while the recommendation methods of the cognitive diagnosis paradigm (i.e., KCP-ER and LSTMCQP) require to provide exercise that cohering with student's proficiency level. • All above evidences indicate that TERD can generate excellent Top-N recommendation by making it flexible to replace the teacher network without redesigning the strategy. This is the strongest validation of the advantage of being fully adaptive.

Comparative results (RQ2)
In this subsection, we also compare TERD with two wellknown learning to rank algorithms and four RL-based exercise recommendation techniques. According to Sect. 5.2 and Table 6, we can make two aspects conclusion. On the one hand, TERD consistently outperforms the state-of-the-art models by a considerable margin. This is the strongest evidence of effectiveness caused by the proposed TERD model.  On the other hand, both RL-based recommendation techniques without explicitly ranking mechanisms are inferior to the learning to rank models. The reason is that the RL-based exercise recommendation technique contains a large number of candidate exercises in action space, which makes it difficult to perform the recommendation task in the complex space. Besides, another merit of TERD is able to stand in the perspectives of different students and gain insight into their mastery of specific knowledge concepts. Overall, the results of numerical experiments confirmed the effectiveness of introducing distillation technology.

Comparisons of efficiency (RQ3)
In real-world large-scale educational scenarios, reducing deployment costs and improving model efficiency is a fundamental but meaningful task. Correspondingly, we test and compare the training cost (running time per epoch), number of parameters, and the testing cost as criterion of judging the efficiency to explore whether TERD outperforms baseline models. Table 7 presents the experimental results of TERD and baselines on three datasets. Due to space limitation, we only keep the results of TERD based on the ER-LOAF model, and other showed the same trend as that with ER-LOAF. From Table 7, we observe that the time efficiency of TERD outperforms baselines significantly because TERD is equipped with more efficient distillation techniques that greatly reduces the search space of the student network. We also find that all the methods in the ASSISTments0910 dataset are extremely time-consuming, as the number of exercises is significantly more than the other datasets. As a result, the time costs of all methods in the ASSISTments0910 dataset are larger than other datasets. Besides, all methods have much faster computation speed and fewer parameters in processing small datasets (i.e., Algebra0506 and Statics2011). Remarkably, the RL-based exercise recommendation methods, require at least two times the parameter amount than the proposed one. In sum, our proposed TERD can save execution time while also achieving exceptional outcomes. Overall, this observation strongly confirms the advantages of the model in balancing effectiveness and efficiency.

Comparisons of efficiency (RQ3)
To further verify the potential impact of hyper-parameter on the performance, we explored the performance of three different datasets for three metrics: Precision, MAP and NDCG. Due to space limitation, for the following studies, we only show the results of Top-5.

Parameter analysis on the action space
The Size of Action Space is an important parameter of TERD, which has a direct and crucial influence on the recommendation performance. The results are shown in Fig. 5, from which we can see: (i) In most cases, with the increase in the size of

Parameter analysis on the hidden dimensionality
The hidden dimensionality is another important hype-parameter in our model used in this study. Figure 6 show the results of six baseline methods w. and NDCG@5 is improved. It is worth noting that the performance of TERD degrades significantly after reaching the peak performance. From these results, we also find that the traditional recommendation models do not perform well on three metrics. Therefore, we conclude that a sensible hidden dimensionality is indeed helpful for improving the model.

Parameter analysis on the batch size
Here we explored the impact of the batch size N on TERD by tuning N in the range of {8, 16, 32, 64}. Figure 7 reports the results of all models when trained on all the datasets using the Presion@5, MAP@5 and NDCG@5 metrics. In our experiment on the ASSISTments0910, Algebra0506, and Statics2011 datasets, we achieved the best performance value when N = 64, N = 16, and N = 32, respectively. In summary, we can conclude that the proper small batch size promotes the training and convergence performance of the TERD algorithm, nevertheless too large batch size capacity occupies too much memory space that leads to the performance decline.

Case study (RQ5)
Besides improving performance, another important ability of TERD is to generate intuitive and easily understandable recommendation explanation. To make deep analysis about this claim, we randomly selected two student samples Fig. 6 Recommendation performance of TERD with hidden dimensionality on three datasets from ASSISTments0910 and Statics2011 (user_id: 79063 and Stu_72da98f3bbf369da59be0b3451a45051), and further visualized the change of her/his performance score generated by teacher and student network. Figure 8 shows the comparison results of six baselines with and without adding student network. From the figure, we can see that all TERD models perform better at the top positions of the exercising process. In addition, we find that the two learning to rank algorithms and four RL-based recommendation methods are significantly less effective, indicating that the introduction of distillation technique leads to better ranking performance. With the visualization results, instructors can know how much student have mastered certain exercises and then carry out targeted exercises. In summary, the heatmap-based explanations of TERD are intuitive, persuasive, and satisfactory.

Conclusion
Reinforcement Learning and Knowledge Distillation learning are both widely used in various recommendation scenarios. Theoretically, the two techniques are usually considered to be exclusive, resulting in most of the existing recommendation algorithms only make use of a single-policy algorithms. This work proposes an advanced version of distillation recommender: TERD, which could bring a synergy effect for Specifically, this work attempts to further reinforce the Topenhanced recommendation performance of the DRL-based student network by absorbing the valuable prior knowledge from the well-trained teacher network. Benefiting from the above innovations, prior knowledge that helps narrow the DRL action selection space is distilled into the student network, such that the recommendation performance, and the efficiency of the student network improves. The experimental evaluation on three datasets shows that our TERD framework indeed resolve the top-enhanced issue. However, TERD is also affected by some limitations. (1) The proposed method estimates students' learning states only according to the exercise representations and students' responses, while ignoring other educational characteristics (e.g., slipping, guessing, exercise texts). We plan to exploit the slipping and guessing factors from the semantic representation of the exercise texts. Intuitively, we can use two single layer neural networks to model the slipping and guessing factors respectively. (2) It is difficult for the proposed recommender to make decent recommendations for new students and new exercises that appear after the recommendation model is trained. To address this problem, we can extend TERD as a cross-domain recommender systems that leverage the data from external domains as prior knowledge to support the learning of the target recommendation model. Fig. 8 The visualization of students' exercising process TERD, a promising tool for practical recommendation tasks, provides a new perspective for KD model. There are three potential improvements for future work. First, we will introduce state-of-the-art DRL technique to make the agent further promote the recommendation accuracy in the top positions. Second, the knowledge incorrectly predicted by the teacher network is difficult to assist the student network to generate excellent recommendations. As a result, we would like to employ an additional professor model to assist training a more expressive teacher. Moreover, to achieve multiple objectives (such as novelty, diversity) of Top-enhanced recommendation, we intend to further apply multi-objective optimization algorithms to redesign the reward functions.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.