Learning Style Integrated Deep Reinforcement Learning Framework for Programming Problem Recommendation in Online Judge System

Exercise recommendation is an integral part of enabling personalized learning. Giving appropriate exercises can facilitate learning for learners. The programming problem recommendation is a specific application of the exercise recommendation. Therefore, an innovative recommendation framework for programming problems that integrate learners’ learning styles is proposed. In addition, there are some difficulties to be solved in this framework, such as quantifying learning behavior, representing programming problems, and quantifying learning strategies. For the difficulties in quantifying learning behavior and quantifying learning strategies, a programming problem recommendation algorithm based on deep reinforcement learning (DRLP) is proposed. DRLP includes the specific design of action space, action-value Q-network, and reward function. Learning style is embedded into DRLP through action space to make recommendations more personalized. To represent the programming problem in DRLP, a multi-dimensional integrated programming problem representation model is proposed to quantify the difficulty feature, knowledge point feature, text description, input description, and output description of programming problems. In particular, Bi-GRU is introduced to learn texts’ contextual semantic association information from both positive and negative directions. Finally, a simulation experiment is carried out with the actual learning behavior data of 47,147 learners in the LUOGU Online Judge system. Compared with the optimal baseline model, the recommendation effect of DRLP has improved (HR, MRR, and Novelty have increased by 4.35%, 1.15%, and 1.1%), which proves the rationality of the programming problem representation model and action-value Q-network.


Introduction
With the rapid development of online education, learners are prone to 'information trek' in the face of many learning resources accumulated on the website. They are constantly faced with the problems of 'what to learn next' and 'what to do after finishing this problem', which reduces the adaptive advantage of online education [1]. Therefore, how to provide learners with personalized learning services and reduce their energy consumption when choosing learning resources is an important research direction. Adaptive exercise recommendations are indispensable in providing adaptive resource learning sequences based on learners' capabilities and situational factors, thereby improving individual learning efficiency and optimizing learning effects [2].The recommendation system provides an open learning environment for learners. Learners can adaptively interact with the system, as shown in Fig. 1. Each learning process can be viewed as a set of interactive actions.
The workflow of the recommendation system: At some moment, the learner logs into the system, and the system randomly recommends an exercise to the learner. The system recommends the following exercise to learners after learning the interaction between learners and exercises. Generally, to recommend a reasonable and personalized exercise for learners, the recommendation system will use machine learning (including deep learning) to analyze a large number of learning behaviors of learners, thereby tapping the unique learning preferences of learners [3]. This paper focuses on the recommended field of programming problems. There are many programming problems for learners to practice in the Online Judging (OJ) system. Facing the blurred and multiknowledge points of programming problems, the newly entry learners often cannot grasp the difficulty of problems correctly, thus falling into the 'sea of the problems'. Based on psychological research, learners with different learning styles often have different learning strategies [4]. To give learners a more personalized learning experience, the exercise recommendation system plays an important role here. It can refer to the learning process of learners using algorithms to recommend problems that are suitable for them, reducing the difficulty of selecting problems and improving their learning efficiency [5].
To achieve adaptive exercise recommendations, many methods have been proposed in recent years, such as genetic algorithms [6], neural networks [7], and graph theory-based models [8]. Other scholars have proposed algorithms for exercise recommendation from the perspective of educational psychology, such as cognitive diagnosis [9]. By analyzing the above studies and the next section (Related work), we can find that the current adaptive learning recommendation system fails to take into account classic education theory, such as memory curves and learning styles [10,11]. It cannot intelligently combine the personalized characteristics of the learner and the confidential information of the exercise resources. In addition, most of the application fields of exercises are primarily English and mathematical problems [11,12]. There are few recommendation algorithms for programming problems. When building a traditional recommendation algorithm, the model did not consider the different learning needs of the learners and only made the learner's right or wrong as the recommendation goal [5,13,14]. This paper introduces a deep reinforcement learning framework to achieve accurate perception and rapid decision-making of learning strategies to solve the above problems. The model not only combines the learner's learning style to design a personalized action space but also designs the corresponding reward function for the learners' multiple learning needs. The leap from a single target reward to multiple target rewards has been achieved.
This work is mainly divided into three parts. First, a multi-dimensional integrated programming problem representation model (MDPR) is proposed for the difficulty of programming problem representation. It quantifies the five parts of the programming problem's difficulty feature, knowledge point feature, text description, input description, and output description to complete the vectorized representation. In particular, the Bi-GRU model based on the Attention mechanism is introduced to learn the contextual semantic correlation information of the programming problem text from both positive and negative directions, so that the model can better understand the implicit information in the text. For the construction of a programming problem recommendation algorithm, a programming problem recommendation framework based on deep reinforcement learning (DRLP) is proposed. It designs specific methods for action space, evaluation Q-network, and reward function more in line with the programming problem recommendation scenario. Unique action spaces are created according to different learning styles, allowing for the incorporation of learners' learning style characteristics into the programming problem recommendation algorithm and making the recommendation more personalized. A recurrent sequencebased estimated Q-network PQN is proposed to enable the model to track the dynamics during the interaction between the learner and the system. A multi-objective cumulative reward function is designed to meet the different learning needs of learners: Discover and Review, Knowledge Points Smoothness, and Difficulty Smoothness. Finally, the closedloop construction of the personalized programming problem recommendation model is realized by comprehensively considering the learner's learning style, learning status, and programming problem information. Then, the simulation experiment selects the real dataset obtained in the LUOGU OJ system. Recommendation experiments verify the effectiveness of the proposed estimated Q-network (PQN). In addition, simulation experiments show that the text representation model based on Bi-GRU (BGAM) can better learn the contextual representation of programming problem texts. At the same time, the rationality of the different reward functions in the reward strategy is explained from the perspective of sufficient data analysis.
Compared with other researches, the main contributions of this paper are summarized as follows. (1) A multidimensional integrated programming problem representation model is proposed, which allows programming problems to contain knowledge point features and difficulty features, as well as rich programming problem text information. The model can well understand the implied information in the text. (2) According to the fact that the learning state of learners is a continuous process, a PQN network based on cyclic sequence is proposed to track the dynamic characteristics of the interaction between learners and the system. (3) A deep  reinforcement learning framework with a fusion learning  style suitable for programming problem recommendation is proposed. It not only builds a personalized action space for learners with different learning styles but also considers the diversification of learning needs when designing the reward function. In particular, the cumulative reward function is obtained by integrating three learning requirements (Discover & Review, Knowledge Point Smoothness, and Difficulty Smoothness).
The remaining sections are organized as follows: Sect. 2 illustrates related work on problem recommendation models and the theoretical underpinnings of Kolb's learning styles. Section 3 presents a multi-dimensionally programming problem representation model from three aspects: difficulty representation, knowledge point representation, and programming problem text representation. Section 4 introduces the programming problem recommendation model integrating learning style from five parts: optimization objective, action space construction, design of agent, reward function design, and training process. Section 5 conducts extensive experiments on the framework of programming problem recommendation. Section 6 discusses the limitations of this work and future research directions. The last section concludes the work of this paper.

Related Work
This section briefly introduces learning style theory and the related studies on the personalized exercise recommendation system. By analyzing these studies and the lack of content, this section also shows the advantages of the programming problem recommendation model based on the deep reinforcement learning framework.

Kolb's Learning Style Theory
Learning style is a concept proposed by Herbert Thelen in 1954. Learning style refers to how learners have personal characteristics when they study learning tasks. Since then, many related theories and models have emerged [15,16], which view learners' learning styles as different learning preferences in the same class, age, and cultural background. Keefe [17] divides learning styles into 32 categories based on learners' cognitive, sensory, and physiological factors. Felder-Silverman [18] analyzes learners' learning styles from four perspectives: information processing, information perception, information input, and content understanding. There are many learning style forms. However, Kolb learning style model based on experiential learning theory is widely cited and occupies a dominant position in empirical research [19]. Therefore, this work takes Kolb's learning style as the research object. Kolb's learning style model divides learning styles into four types through the two dimensions of concrete experience/abstract experience and active experiment/reflection observation: Accommodating, Diverging, Assimilating, and Converging [20,21]. There are two commonly used methods in the research to obtain the Kolb learning style of learners: the rule-based scale measurement method and the automatic identification method [22][23][24]. The focus of this work is not on the identification of learning styles but the recommendation of exercises. Therefore, a simplified simulation experiment is used to obtain the learners' learning styles.

Recommendation Algorithm
Exercise recommendation belongs to personalized recommendation, which provides users with decisions or services that meet their individual needs through algorithms on the basis of mining massive data. The research fields of traditional recommendation algorithms generally include news recommendation, commodity recommendation, music recommendation, and so on [46]. Yera and Martinez et al. [50] studied the fuzzy recommendation system based on collaborative filtering of content and memory; according to the research conducted, most of the existing work only focuses on the attributes of the project, rather than considering user preferences. Many researchers have begun to pay attention to the research of recommendation algorithms in new field. For example: literature [47] developed a new hybrid food recommendation system to overcome some problems of traditional recommendation methods from the perspectives of content and users. Forouzandeh et al. [48] proposed a new tourism recommendation algorithm by combining the Artificial Bee Colony (ABC) algorithm and Fuzzy TOPSIS. Yera and Martinez et al. [49] also proposed a recommendation method based on collaborative filtering to solve the problem of information overload on POJ according to the ability of users in the field of exercise recommendation; compared with traditional recommendation scenarios, personalized exercise recommendation scenarios require recommendation activities to be more real-time, personalized, and explanatory.

Exercise Recommendation System
There are multiple approaches to personalized exercise recommendation modeling, including deep learning, reinforcement learning, and knowledge graph [25][26][27]. In typical research scenarios, the exercise recommendation model uses static learning resource data and dynamic learning behavior data as the model's input. Next, the model utilizes deep learning algorithms to perceive the intrinsic connection between learning resources and learners, recommending personalized exercises. Zhou et al. [28] introduced a fullpath recommendation model for personalized learning based on LSTM. This model only quantifies the feature similarity between different learners and does not consider the features of the exercise. Shu et al. [25] adopted a content-based recommendation method for exercise recommendation. By mining the textual information of resources in an online learning system, they proposed a convolutional neural network (CNN) to recommend learning resources. Although the algorithm can recommend appropriate resources to the corresponding learners, it does not consider the influence of the learner's learning status on problem-solving. Wu et al. [26] used the recurrent neural network (RNN) to predict the coverage of knowledge concepts. They predicted students' mastery of knowledge concepts through deep knowledge tracking, thus contributing to exercise recommendation. They did not consider the impact of learning styles on the problem-solving of different learners. In addition, deep learning algorithms boost performance by adding deep structures to capture complex student-exercise relationships [27,28]. None of the above studies consider the influence of learning style on the exercises recommendation. They all have only one recommendation target (learners did not do the exercises correctly) without considering the different learning needs of learners.

Reinforcement Learning in Education
A promising direction for enriching personalized exercise recommendation models is to leverage reinforcement learning. The decision process of reinforcement learning is similar to humans extrapolating from experience when faced with practical problems [29]. Lei et al. [30] proposed a reinforcement learning method based on deep Q learning of specific users, which used the potential state of the constructed specific users to estimate the optimal strategy to give a multi-step interactive recommendation. Tang et al. [31] introduced a reinforcement learning method within the framework of Markov decision-making, which can strike a balance between making the best advice on current knowledge and exploring new learning trajectories that may be rewarded. In comparison to previously used techniques, reinforcement learning algorithms have made significant progress in strategy selection theory. However, most effective reinforcement learning strategies can only rely on manual feature extraction in the case of high dimensions and a large quantity of data. The quality of feature extraction directly affects the effect of model learning [29]. Deep reinforcement learning has a strong decision-making ability, which corresponds to the exercise selection decision-making process. Lecun et al. [32] believed that deep reinforcement learning is the critical direction for the gradual development of deep learning in the future. Therefore, designing a reinforcement Page 5 of 22 114 learning framework that can work together is a complex problem in exercise recommendation.
Most of the existing exercises' recommendation algorithms only aim at one recommendation goal, that is, whether the learner is doing the exercises correctly, but less consider the influence of the learner's learning status and learning style, and do not consider the learner's personalized learning needs, and do not keep up with the concept of personalized education advocated today. Therefore, we need to explore personalized exercise recommendation to carry out reasonable exercise recommendation activities.
The specific focus of our efforts in this study is the deep reinforcement learning framework, a specialized product that has been developed to solve learners' decision-making difficulties. In this work, we not only design personalized action space based on learners' unique learning styles but also design corresponding reward functions for learners' various learning needs, so as to realize the leap from single goal reward to multi-goal reward. Finally, the realization model can accurately perceive learners' learning strategies and make quick decisions.

Problem Definition
In an Online Judge (OJ) system, assuming there are |S| learners and |E| programming problems, the learner's online learning behavior sequence is defined as: s = { e 1 , f 1 , e 2 , f 2 , … , e T , f T }, s ∈ S , where e t ∈ E means the programming problem done by the learner s at the time t , f t means the corresponding score feedback. In addition, the programming problem can be expressed as a quintuple e = {d, k, c d , c in , c out } , where d means the difficulty attribute of the programming problem, k ∈ K means the knowledge points contained in the programming problem. The number of knowledge points contained in each programming problem is different, such as 'String', 'Recursion', and 'Dynamic Programming'. c d , c in , c out all belong to the content text of the programming problem, representing text description, input description, and output description. Each description is expressed as a continuous sequence of words as c i = {w 1 , w 2 , … , w n }.
The programming problem's recommendation task in the OJ system is modeled as a Markov Decision Process (MDP), which is composed of a combination of student behavior sequence and system environment. The MDP process is defined with a quadruple as {S, A, R, S � }: • State set S : S represents the state space of the learner's behavior. The state s t at time t represents the learner's historical behavior record. Here, the learner's score feedback f t and programming problem = {d, k, c d , c in , c out } need to be jointly modeled.
• Action set A : A represents the action space composed of programming problems in the OJ system. Based on the state s t at the time t , the Agent takes action a t ∈ A on behalf of recommending the programming problem e t+1 to the learner. This work designs personalized action spaces for learners with different learning styles. • Reward function R : R represents the reward function given by state-action (S, A) . After the agent takes action a t in state s t , the learner will complete the given programming problem e t+1 , and the agent will calculate the corresponding cumulative reward r(s t , a t ) based on the learner's feedback. The cumulative reward function comprehensively considers a variety of learning needs. • State transition matrix S ′ : S ′ represents the evaluation when the state-action (S, A) at the time t is mapped to the new state s t+1 .
The work of this paper is inspired by the research of Huang et al. [33] and Tang et al. [31]. The differences are: (1) The application field is programming problem recommendation, and rich programming problem text information is integrated into the problem representation. (2) A personalized action space is constructed for learners with different learning styles, thus realizing the integration of the learner's learning style into the deep reinforcement learning framework. This work aims to obtain the recommendation activities that maximize the target reward through the continuous learning of the algorithm to construct the optimal recommendation strategy .

Multi-dimensional Integrated Programming Problem Representation Model
A programming problem consists of five parts: difficulty feature, knowledge point feature, text description, input description, and output description, quantified to form a multi-dimensional integrated programming problem representation model (MDPR). The overall structure of the model is shown in Fig. 2. Specifically, for a specific programming problem, we first express the difficulty attribute of the problem according to the specific method (Eq. 3). Then, we use a one-hot vector representation for the knowledge points it contains and introduces a parameter matrix to solve the matrix sparsity problem caused by one-hot. Next, we put all the texts (text description, input description, and output description) into BGAM for training to obtain respective vectors. Finally, we splice five eigenvectors to get the final programming problem representation vector. In this section, the Bi-GRU model is introduced to learn the contextual semantic information in the text from both positive and negative directions, so that the model can better understand the implicit information in the text. Next, we mainly introduce the design details of the multi-dimensional integrated programming problem representation model (MDPR), including the difficulty feature, knowledge point feature, and text feature.

Difficulty Feature
The difficulty feature of a programming problem can usually be obtained by expert annotation or the pass rate of the problem. The expert annotation method is commonly used in the OJ system. For example, the difficulty feature is distinguished by the level of Rating in CodeForces, 1 and difficulty labels of programming problems are directly given in LUOGU 2 (levels 1-7). Although the reliability of the expert annotation method is high, it also ignores 'small traps' in some programming problems of the same difficulty level, which significantly reduces the pass rate of the problems and leads to an increase in the difficulty of the problems. Small traps become a stumbling block on the way for learners to solve programming problems. In addition, simply expressing the difficulty feature by the passing rate of the problems often ignores the learner's knowledge level. For example, Jack and Oliver have a very high level of knowledge. They enjoy attempting complex programming problems and can quickly complete the code. Faced with a programming problem of a certain difficulty level, only they have done it and solved it successfully. If the difficulty of the programming problem is reflected according to the pass rate of the problem, it means that the difficulty of the problem is 0 (the pass rate is 100%), which is unreasonable. Therefore, this work believes that the difficulty feature of programming problems needs to comprehensively consider the results annotated by experts and the feedback results from different learns under programming problems of the same difficulty.
Objective problems use 1 and 0 to indicate whether the exercise is correct, which is a general representation method for various exercise recommendation algorithm research. However, subjective problems and programming problems only use 1 and 0 to indicate whether the learners are correct or not, which is one-sided. Therefore, this work considers eight different submission states of programming problems to judge the scores of those who do the same level of problems. The specific status and assessment criteria are shown in Table 1.
Accepted means that the program passes all the test points (learner answers the question correctly), and the learner gets a score of 1. Limit Exceeded means that the program can be run, but the correctness is unknown, so the learner gets a Answer and Error belong to the wrong answer, and the program cannot run, so the learner fails to answer correctly and gets a score of 0. Finally, the formula for the scoring rate of each programming problem is as follows: where n means the number of records for problem i , x Accept , x TLE , x MLE , x OLE mean records whose status is Accepted, Time Limit Exceeded, Memory Limit Exceeded, and Output Limit Exceeded. p i means the scoring rate of problem i . In particular, the data collected in this work do not include the number of submission states, and only include 'total number of submissions' and 'total number of passes'. If we can obtain the specific passing status of all learners on the test points on each programming problem, the calculation formula of the scoring rate at this time is as follows: where TP i,j means the number of test points passed in the j th record of the problem i . TP i,j,total indicates the total number of test points of the j th record of the problem i . The score rate and the original difficulty label are fused with the following formula: Please note that l j ∈ [1, max], max ≥ 1 means the difficulty level annotated by experts. For example, the highest difficulty level in LUOGU OJ system is 7, then the value range of l j is 1-7 ( l max = 7 ). In addition, since p i ∈ [0, 1] , Eq. 3 controls the value range of d i between 0 and 1. Difficulty feature is finally expressed as

Knowledge Point Feature
The knowledge points of programming problems usually include multiple knowledge concepts k ∈ K . For example, the problem P1020 'missile interception' in LUOGU contains three knowledge concepts ('dynamic programming', . 'greed', and 'binary search'). This work adopts the one-hot representation method with the parameter matrix W k . If only the one-hot method is used for representation, the entire matrix space will be very sparse, resulting in subsequent modeling difficulties. The specific representation steps are as follows: 1. Count the knowledge points of all programming problems, and deduplicate the knowledge points to obtain the knowledge point space K. 2. The knowledge points in each programming problem are represented by one-hot, denoted as k. 3. Use the parameter matrix W k to convert k into a lowdimensional v k ∈ ℝ d k , denoted as v k = W k T k , so as to extract the core semantics from the knowledge point space.

Text Feature
In this work, three types of texts (text description, input description, and output description) are regarded as three groups of natural sentences with specific meanings which contain contextual semantic relations. After a unified understanding of the three types of texts, the logical relations between the front and rear of the programming problem texts are deeply excavated. Due to the difficulty in reasoning about the logical relationship of the text of programming problems, the context dependency has a long span. The CNN model represented by TextCNN can only model local dependent data within a given range, so there are limitations [34]. In addition, when RNN processes word sequences, the later word nodes have lower perception ability for the previous nodes, so the problem of gradient disappearance is prone to occur [35]. This work adopts the bidirectional GRU (Bi-GRU) model to preserve the context dependencies of indefinite length spans in the text and the memory of the previous nodes through the 'gate mechanism', to solve the problem of gradient disappearance. Then, the model uses the average pooling method to extract the core information representation from text information with fewer dimensions. Figure 3 shows the modeling process of the word vector of the programming problem text. First, BGAM obtain all the text information E = {e 1 , e 2 , … , e m } of an OJ system. Second, the model uses NLTK to segment the text information (Chinese by jieba) at sentence level and according to the stop word list to remove stop words. Next, the corpus of programming problems after word segmentation is obtained. Then, BGAM uses the pre-trained word-2vec model to convert each word in each programming problem text e ∈ E into d 0 Dimensional word embedding vector. Please note that the text e = {c d , c in , c out } can be represented together during the experiment, where c i = {w 1 , w 2 , … , w N }, i ∈ {d, in, out} . Then, Bi-GRU where v n is the semantic vector of the word w n , ⇀ v n and ↼ v n mean the positive and negative semantic vectors of the word, respectively, and they are spliced by ⊕ to obtain the semantic information of the word as complete as possible. In addition, the forward modeling process of the word sequence of a single text c i is as follows: where Z t and R t mean the update gate and reset gate of GRU; they jointly control the calculation process of the hidden layer state h t−1 to h t . The value range of Z t is 0 to 1, which determines the degree of transmission from h t−1 to the next state, R t controls the importance from h t−1 to the next state, and it chooses to improve or reduce the previous memory information according to the correlation between the previous and current memory states. w t is the current input. ( * ) is the sigmoid function. W z , W r , W are the parameters (the weight matrix to be updated) of the update gate, reset gate, and candidate state. h t is the output of the model at time t.
After obtaining the contextual semantic representation of the word sequence, the model uses the Attention mechanism to calculate the weight probability of each word vector for the N word sequences in each subtext c i , i ∈ {d, in, out} in the text e . This method can filter unimportant information and focus on a small amount of important information, thereby strengthening the extraction of core information in word sequences. The implementation process of the attention mechanism is as follows: where h t means the semantic vector output by the Bi-GRU layer at time t ; g w is the weight matrix, which is updated during training after random initialization. b w is the bias term. D w is the randomly initialized attention matrix. v c is the output value of the attention layer. Note that the Attention mechanism normalizes the product of the attention weight matrix (probability weight matrix) and the state of each hidden layer through the softmax function. Finally, the focus information in the sequence is extracted by Eq. 13. The model then uses an average pooling operation to merge N word sequences to reduce the dimensionality further.
according to different texts (text description, input description, and output description), and this part does not require any manual annotation participation. It can learn the core semantic information implicit in the text of programming problems through the model to effectively distinguish the characteristics of different texts. In summary, integrating the difficulty feature, knowledge points feature, and three kinds of text feature. The vectorized representation of the programming problem is concatenated as follows:

Programming Problem Recommendation Algorithm Based on Deep Reinforcement Learning
The workflow of the recommendation task in the reinforcement learning framework is as follows (see in Fig. 4): First, the model should clarify the environmental state of the programming problem recommendation system. Then, the state s t at the time t is composed of the programming problem information e t and the score feedback f t . The agent randomly selects an action a t from the action space LS ← A of the specified learning style LS based on the state s t (the exercise is recommended to the learner). Next, the learners in the environment change their state by answering the programming problem after receiving the programming problem, and the system enters the next time t + 1 . The system receives the feedback f t from the learner and calculates the income r t obtained by the current action a t based on the cumulative reward function. Finally, the agent enters a new state s t+1 = S � (s t , a t ). This section introduces the design details of the programming problem recommendation framework DRLP, including the optimization objective, environment state, Q-Network, reward strategy, and model training process.

Optimization Objective
The overall goal of reinforcement learning algorithm applied to the field of programming problem recommendation is to find the optimal programming problem recommendation strategy . Based on this goal, the algorithm should learn the optimal action-value function Q * (s t , a t ) among all action values at the time t (based on the state s t at the time t , taking action a t can maximize the expected benefit). On the Markov reward chain, in the time step after the time t , the reward obtained will be discounted as time increases, where the discount factor ∈ (0, 1) reflects the value ratio of future rewards at the current moment. Assuming that the recommended activity stops at time T , the reward calculation function is as follows: At this point, Q * (s t , a t ) = max [R t |s t , a t , ] , is the strategy for mapping state s t to action a t . Usually, the iterative update process of the action-value function (15) R t = r t + r t+1 + ⋯ = ∑ k=0 T k r t+k . follows the Bellman equation [36]. For all feasible actions a′ in the next state s′, if the corresponding maximum value Q* (s′, a′) is known, the current optimal strategy is to select an action a′ that can maximize the target action- However, the recommendation task in the OJ system cannot use the classic Q table in reinforcement learning to save and dynamically update the optimal actions in all states. The reasons are as follows: (1) Each OJ system has many programming problems, such as the CodeForces has saved 7405 programming problems and more than 10 million records of learners' behavior as of December 6, 2021.
The Q table cannot calculate and store all the state-action pairs. (2) The state-action pair in the Q table adopts a step-by-step update strategy. This strategy requires learners to complete almost all programming problems on the OJ system. As long as the learners have not done one of the recommended programming problems, the Q table cannot predict the following action based on the current feedback. In reality, the number of programming problems that learners have done is much less than the number of programming problems in the OJ system.
To solve the problem of Q table failure in reinforcement learning, this framework uses a nonlinear function approximator to approximate the optimal action-value Q * (s, a) , that is, Q(s, a| ) ≈ Q * (s, a) , is the nonlinear function approximator in the neural network. PQN is the action-value estimation network in this framework. Training PQN requires minimizing the loss function L i ( i ) at iteration i . The optimization formula is as follows: Please note that the goal of iteration i is a)] , (s, a) i s the probability distribution corresponding to state s and action a , and L i i gradually tends to be stable in the process of optimization. Then, by fixing the previous iteration i−1 to obtain the partial derivative of L i i , the gradient calculation formula is obtained as follows: This framework uses the Adam optimizer to optimize the loss function, which reduces the high computational complexity caused by excessive gradients after partial derivatives of target expectations.

Action Space
After investigation, it was found that learners with different learning styles have different learning strategies when learning languages [37]. Focusing on the learning process of programming problems, this work believes that different learning styles will also affect learners' preferences.
Continuing previous learning style modeling work, this work still uses the Kolb's learning style theory as the basic theory to build personalized action spaces for different learners. The specific explanations are as follows: Accommodative learners are good at setting specific learning goals and exploring multiple methods to solve problems. They tend to adopt the problem-solving strategy of increasing difficulty level by level and a single type of knowledge point to meet their learning requirements. Divergent learners are good at reflecting and observing specific experiences to generate innovative ideas. They tend to choose problems with low-to-medium difficulty and a single type of knowledge point for sufficient training and thinking to find innovative breakthroughs. Assimilation learners believe that logic is more important than practical value. They tend to choose problems with medium-tohigh difficulty and rich knowledge points to broaden their knowledge. Convergent learners think that trial-and-error learning is the key to improving their ability. They tend to try more challenging and rich knowledge points to find deficiencies from trial-and-error. According to the above essential viewpoints, action space can be constructed.
Based on the difficulty feature of each problem in the programming problem set E , sort all programming problems from high to low to get e 1 d , e 2 d , … , e n d , where e 1 d is the most difficult programming problem, and e n d is the least difficult programming problem. Taking the LUOGU OJ system as an example, the difficulty of programming problems is divided into seven major levels (entry, popular−, popular/improved−, popular+/improved, improved+/provincial selection−, provincial selection/NOI−, NOI/NOI+/ CTSC). According to the weighted difficulty ranking calculated by Eq. Based on the number of knowledge points contained in each problem in the programming problem set E , sort from high to low to obtain e 1 k , e 2 k , … , e n k , where e 1 k is the programming problems with the largest number of knowledge points (if the number of knowledge points is the same, the ranking is in no particular order), and e n k is the programming problems with the least number of knowledge points. Two kinds of action spaces are designed according to the number of knowledge points ( S = e 1 k , e 2 k , … , e S k with the number of knowledge point types greater than or equal to 2, S = e S+1 k , e S+2 k , … , e R k with the number of knowledge points types less than 2). S + R is the total number of programming problems. Table 2 summarizes optional actions for learners of the four learning styles. Please note that the model proposed in this work does not recommend multiple questions to learners at one time during the training process but randomly selects one programming problem from the corresponding four action spaces for learners with a specific learning style for recommendation (a specific action). The optional action spaces under the same learning style type are merged in the form of intersection to construct a personalized action space LS ← A , where LS represents the learning style.

Details of PQN
According to the Markov decision process, the current learning state s t of the learner only depends on the state s t−1 , and s t has nothing to do with any previous moment. Based on this, the learner state s t = e t , f t can be established (the score feedback f t of the learner doing the programming problem e t at the time t ). In fact, the learning state of the learner is a continuous process. The past learning state will impact the current learning state, so PQN is constructed based on this theory. The long short-term memory network (LSTM) can cyclically model the learning state sequence to grasp the long-term dependence of the learner's behavior effectively. The current state is learned from the complete record of the problem, so the learner state in PQN can be expressed as s t = e 1 , f 2 , e 2 , f 2 , … , e t , f t (the scores of all programming problems that the learner has done from time 1 to time t). To make PQN more general, the score here not only considers the scoring method in Table 1 but also can describe the test point score representation method (Eq. 2), so the score space can be denoted as f t ∈ [0, 1] . After MDPR characterizes the programming problem e t , the programming problem representation vector y t can be obtained. The learner state is obtained by fusing y t and f as follows: where 0 = (0, 0, … , 0) means the zero vector of the same dimension as y t , and ̇0 means the zero vector of half the dimension of 0 . In particular, when the dimension of 0 is odd, the front ̇0 will be one more dimension than the rear ̇0 . This work assumes that the state of the agent (PQN) in the reinforcement learning framework is consistent with the learner's state. Figure 5 shows the network structure of PQN. Although the number of programming problems on the OJ system is enormous, it is still in a limited state. LSTM has one more gating unit than GRU. LSTM has better long-term dependent memory performance. LSTM updates the state b i of the learner at each moment by the following formula: where the state of b i at the last moment is recorded as b i−1 . The learner's current state is represented by the state of the last moment of LSTM ( s t = b t ). Then, based on the learning style LS of the current learner, a personalized action space LS ← A is selected, denoted as A LS . Next, according to the state s t and each candidate action a j ∈ A LS , its state-action feature h t is learned by the n-layer fully connected neural network. Finally, the state-action estimate Q s t , a j can be calculated by the sigmoid function, the formula is as follows:

Reward Strategy
Training the agent aims to obtain the maximum reward through the optimal policy when taking action a based on (18)  H problems of high difficulty 3F problems of low, moderate, and high difficulty ( F problems of each type) R problems of rich knowledge point type state s . The reward means the cumulative reward. Designing a rational and practical reward function in the deep reinforcement learning model is essential for the agent's state transition. Traditional problem recommendation algorithms mostly use a single reward mechanism, such as recommending problems with low scores [38] and recommending similar problems [39]. In fact, in the scenario of programming problem recommendation, the learning needs of learners have been dynamic changes. Different learning needs require different types of problems (different knowledge points and different difficulty levels). Therefore, this work proposes a reward function that integrates multiple needs (cumulative rewards obtained by combining rewards for multiple learning needs). The learning needs of programming problem scenarios include Review and Discover, Knowledge Point Smoothness, and Difficulty Smoothness.

Review and Discover
According to the Ebbinghaus forgetting curve [40], the learning process should be fast first and then slow. Only through regular repeated practice can short-term memory become long-term memory. Therefore, in acquiring new knowledge, learners need to review programming problems that they do not master regularly. The factor design is as follows: if the learner gets the Accepted feedback result after completing a specific problem and the system recommends problems that do not contain this knowledge point, the Agent will be given a high reward coefficient of 0 < 1 < 1 (Discover). If the system recommends a problem containing this knowledge point to him, the Agent will be given a low reward coefficient of 0 < 2 < 0.3 (Review). If the learner does not get the Accepted feedback after completing a problem and the system recommends problems that do not contain the knowledge point, the Agent will be given a penalty coefficient −1 < 3 < 0 (inappropriate recommendation). If the system recommends a problem containing this knowledge point to him, no reward or punishment will be given to the Agent (normal recommendation). The reward function r 1 is as follows: where the value ranges of 1 , 2 , 3 have been given. The experiment needs to flexibly set the specific value of the coefficient to adapt to different requirements. For example, if 1 is set larger and 2 is set smaller, the Agent will get more Discover rewards than Review rewards. k t means the knowledge point vector in the problem at the current moment t . k t+1 is the knowledge point vector in the recommended problem at the next time. This situation is not considered, because Limit Exceeded is also a situation in which the learner fails to answer correctly.

Knowledge Point Smoothness
Learning programming problems requires multiple practices in the same type of knowledge points to master this (21)  type of knowledge points and draw inferences from one case, which is different from the exercise mode of introductory education courses, such as mathematics, English, and Chinese. The degree of jump of knowledge point types in the recommendation task should not be too large. Instead, the difficulty of problems should be gradually increased in a specific type of knowledge point before recommending other knowledge points. Please note that since the knowledge points of programming problems are not closely related enough to build a knowledge point network, this work only designs this indicator from the number of knowledge points. The basic assumption is as follows: the problem e t at time t and the recommended problem e t+1 at the next time contain knowledge points k t and k t+1 , respectively. If the absolute value of the difference between the number of knowledge points in k t and k t+1 exceeds 1, it will be punished, and the reward function r 2 is as follows: where count( * ) means the number of elements in the vector. n k t+1 ,k t is the absolute value of the difference between the two knowledge point vectors. If k t+1 and k t have common knowledge points and satisfy the basic assumption, the function will obtain the reward coefficient 1 ∈ (0, 1) . If the absolute value of the difference between k t and k t+1 exceeds 1, the function will get a penalty coefficient 2 ∈ (−1, 0) , and no reward or penalty will be given in other cases.

Difficulty Smoothness
The learner's knowledge acquisition is a continuous process, so the difficulty between the problem and the next problem should not change too much [41]. If the difficulty changes too much, the learner's sense of participation will be reduced, which will directly cause the learner to lose interest in continuing to study. Inspired by the research of Huang et al. [33], this part improves his difficulty smoothness calculation formula, so that the value range of the reward function is (−1, ∕4 . The reward function r 3 is as follows: where L( * , * ) means the negative number of the square of the difference, the closer the difficulty of the two problems is, the larger the value is. Suppose the square of the difficulty difference is greater than 1. In that case, the difficulty difference is too significant and not smooth enough, and the Agent will get a penalty value. If the square of the difficulty difference is less than 1, it means that the difficulty change is appropriate, and the Agent will get a reward value. Examples are as follows: • N o r e w a r d : t h e n r 3 = 4∕ × (arctan (−1) + ∕4) = 0 . • P e n a l t y : Finally, based on the three reward functions, the weight coefficient is used to fuse them into the final cumulative reward function r as follows: The weight of the above formula can be freely set by researchers or developers, which has high flexibility. If the learner pays more attention to difficulty smoothness, the weight 3 can be appropriately increased and the weight 1 , 2 can be decreased appropriately. Other situations are similar.

Training Process
Ideally, to apply the recommendation framework DRLP for programming problems, it is necessary to build a prototype of an OJ recommendation system that meets the data flow requirements. The system's design needs to meet three conditions: (1) when learners log in to the system, the system needs to automatically obtain their learning style, which can be solved through the scale. (2) There are enough programming problems in the system, and their attributes meet the requirements of this work. (3) There are enough learners in the system to leave their learning behaviors. However, building systems is beyond the scope of this paper. To solve this problem, the proposed recommendation strategy is optimized by combining actual data with simulation experiments. The specific algorithm is as follows: Page 14 of 22 The excellent training method of DQN [42] is adopted in the DRLP: two networks with the same structure are set up (the target network Q(s, a| ) and the evaluation network Q(s, a| − ) ). After every M steps, the target network parameters are updated to make = − . The algorithm's stability is improved by making the two networks independent of each other (line 14). At the same time, the state transition sequence x t = s t , a t , r t , s t+1 of each time step t is stored in D by introducing the experience reuse pool (Line 12). In addition, the most significant difference between DRLP and the classical DQN training process is the introduction of personalized action space based on learning style.

Experiments and Results
This section conducts data analysis and evaluates the effectiveness of DRLP by collecting a large amount of programming learning behavior data on the LUOGU OJ system.

Dataset
The experiment selects the learners' learning behavior data and programming problem data on the LUOGU OJ system. This system provides a convenient and refreshing programming experience for the contestants participating in programming competitions, such as NOIP, NOI, and ACM. The learners can independently train their programming skills. This system allows learners to submit codes until they continuously get the total evaluation score. Therefore, the learning behavior of learners every time they try to do a specific programming problem has research value. For each problem, the learner's final feedback score needs to comprehensively consider the learner's learning process evaluation score instead of 'pass' and 'fail' as the final feedback score. The acquisition method of this dataset is divided into three steps: (1) Cache the relevant URLs of LUOGU's learning behavior records and programming problem data through Redis. (2) Initiate requests to the URLs and parse the returned JSON data. (3) Data are stored in a MongoDB database. The initial dataset contains 6236 programming problems and 7,897,618 learning behavior records. To ensure the continuity of research, the period of selecting learners' online learning behavior data is from October 17, 2020 to April 27, 2021. This experiment deletes learners with less than 32 learning records. The dataset contains 183 knowledge points, such as 'Extended Euclidean', 'Greedy', and 'greatest common divisor'. Table 3 shows the information of the dataset.

Explore Implicit Information in Data
The learning records of different learning periods should not be regarded as continuous learning records. For example, a learner generates two consecutive learning records at '2021-02-20 02:20:22' and '2021-02-22 04:25:22'. The time difference is about 50 h, which are two learning sessions of different periods. Therefore, two learning records with a time difference of more than 8 h for each learner were divided into two sessions. According to statistics, the average number of sessions for learners is 21.705, and the total number of sessions is 1,023,341.
After statistics, it is found that the number of learning records in each session is different. Exploring the multidimensional characteristics of learning records in a session can help verify the model hypothesis's rationality. This section explores the difficulty of the programming problems done by the learners in the long and short sessions, the distribution of knowledge points, and the number of learners in different sessions. The specific results are as follows: The left graph in Fig. 6 shows the relationship between the number of learners and the number of sessions over 6 months. The graph shows that the number of learners with sessions between 1 and 20 is overwhelming. Only 1.3% of learners have more than 100 sessions, which reflects that most learners usually spend a lot of time training on programming problems.
We then explored the relationship between session length and different learning record features, where session length means the number of learning records contained in a session.
The right graph in Fig. 6 shows the average test scores of the learning records in different sessions. From the graph, it can be seen that the average test scores of the learners decrease gradually with the increase of the session length, which means that learners usually spend much time in long sessions to overcome complex problems. Appropriate difficulty and smoothness of knowledge points should be provided for such learners. Figure 7 shows the average number of knowledge points contained in the topics involved in different sessions. When the length of the session is less than 150, the number of knowledge points in the long session is larger than that in the short session, indicating that the learner is willing to try more new problems in a period. When the session length is greater than 150, the number of knowledge points in the session remains at a high level (greater than 9). It shows that we not only need to consider adding new knowledge points to meet the learning needs of 'Discover', but also need to consider the smoothness of knowledge points.  Figure 8 shows the average difficulty difference between the two consecutive problems in each session. It can be seen that the difficulty difference of the problems solved by the learners in the short session is minimal. In contrast, the difficulty difference fluctuates significantly in the long session, which means that some short-session learners will selectively do the problems of specified difficulty on the LUOGU system. (There are situations in which the same problem is selected multiple times within a session.) However, another part of the long-session learners' preference for the difficulty is not fixed. Therefore, Difficulty Smoothness should be considered to meet the individual needs of more students. Figure 9 shows the average difficulty of problems in each session. It can be seen that the average difficulty of short sessions is generally between 0.4 and 0.5, while the average difficulty of long sessions is mainly in the high difficulty area, and a small part is in the low difficulty area. This means that learners have individual preferences in selecting problems. The effort put in by students varies between complex and uncomplicated problems. Therefore, Discover and Review can help learners better grasp the problems.

Parameter Settings
All models in this experiment are implemented by PyTorch. Adam optimizer is used for model optimization, where parameter settings: = (0.9, 0.999) , = 1e − 8 , learning   Table 4.

Experiments with MDPR
The programming problem representation model MDPR involves the representation of knowledge points, the representation of difficulty, and the representation of the problem text. This section shows the experimental results of exploring BGAM in the process of problem text representation. To start the experiment, it is assumed that the problem text has a specific relationship with knowledge points and difficulty. The following rules obtain the initial label: Obtain the median of the number of knowledge points among all knowledge points.
• Obtain the median of the number of knowledge points among all knowledge points. • For a programming problem, if the difficulty is greater than 0.33 or the number of knowledge points is greater than the median, it will be set as A. Otherwise, it will be set as B.
Where A and B mean the category label of the problem. As a result, the ratio of the two types of questions is A ∶ B = 0.564 ∶ 0.436 . This task is converted into a 2-category task. After the model training is over, the vector representation of the problem can be obtained by outputting the result of the last layer. In the experiment, texts are divided into the training set and test set by 7: 3. The classification result is obtained by adding the logsigmoid function to the last layer of BGAM. The final prediction accuracy of the model is 0.614 (the hyperparameter settings in BGAM are shown in Table 4). To explore whether different word embedding dimensions can represent different degrees of richness for words in the text, this section conducts experiments with different Embedding dims. The results are as follows (Table 5): It can be seen that when the Embedding dim does not exceed 128, the evaluation indicators increase with the increase of the Embedding dim. When the Embedding dim reaches 128, increasing the Embedding dim will harm the model. Therefore, different Embedding dims can indeed show the different richness of words. However, too high a dimension may lead to an over-representation of words, resulting in a decrease in the prediction effect of the model.  Based on the low overall prediction accuracy, the relationship between the text of the programming problem and the assumed features is not clear. This is because there is no positive relationship between the complexity of the text description and its difficulty. There is such a situation.
Although the text description of the programming problem is complex and challenging to understand, the actual data structure knowledge points contained are clear. Therefore, the implicit information contained in the text cannot be replaced by difficulty and knowledge points, so programming problem representation needs to include the implicit information in the text.

Experiments with PQN
In this section, simulation experiments are constructed to evaluate the effectiveness of the state-action estimation network PQN. It is necessary to build an action space based on learning style for learners before model training. According to the theory of 'zone of proximal development' [43], this work believes that if the difficulty is greater than 0.67, the problem is too difficult for the learner. If the difficulty is less than 0.33, the problem n is too easy for the learner. The rules are outlined as follows: • The problems with the difficulty less than 0.33 are regarded as low difficulty, the problems with the difficulty of 0.33 to 0.67 are regarded as moderate difficulty, and the questions with the difficulty above 0.67 are regarded as high difficulty. • According to the classification criteria for knowledge point types (in the Action space section), problems with a single type of knowledge point and problems with rich knowledge points are divided.
Thus, set the number of action space parameters, as shown in Fig. 10.
The learner's Kolb learning style can be obtained through the scale, but in the LUOGU OJ system, each learner cannot fill in the scale to obtain the most accurate learning style. To achieve the effect of the simulation experiment, this part designs the following rules to divide the learning style of learners according to the corresponding relationship between learning style and action space ( Table 2).
• Learners who have not done problems with higher difficulty are classified as Diverging. Learners who have not done problems with lower difficulty are classified as Assimilating. • In addition to the above situations, if the number of problems with lower difficulty is less than the number of problems with higher difficulty, it is classified as Accommodating; otherwise, it is classified as Converging.
After dividing the learning styles of learners according to the above rules, the distribution of learning styles among the 47,147 learners is 23.6% Accommodating, 58.5% Diverging, 3.0% Assimilating, and 14.9% Converging (the imbalance of learning style samples will not affect the recommendation results and is only used to select the action space). After deduplicating the items, we get that the Diverging, Accommodating, Assimilating, and Converging action spaces contain 3643 items, 4349 items, 4705 items, and 5727 items (the action space of different learning styles has repeated items).
The recommendation experiment idea adopted in this section is to recommend a unique list of problems for learners at a specified time. Then, the problems in the list are sorted according to the algorithm. Next, compare the number of times the students select the recommended problem at the In the simulation experiment, the first 80% of the learning behavior record of each learner is divided into training data, and the last 20% of the data is divided into test data. The algorithm recommends learning the interactive behavior of each learner on the problems that have been learned. The algorithm will list recommended problems in the current state at the last moment of training (sorting rule: according to the Q value from high to low). To focus on evaluating the accuracy of PQN's recommendation, rather than evaluating the impact of the reward strategy, this work only considers the uncorrected problems of each group of learners as recommended targets. The ground truth of the training data in the model is 0 (problems that have been done correctly are recommended) and 1 (problems that have not been done correctly are recommended). The evaluation metrics of the model are HR@K, MRR@K, and Novelty@K [26], which are explained as follows: 1. HR@K: Hit Rate means the ratio of the recommended K programming problems that learners have learned in the test set (learned 1, unlearned 0). The formula is as follows.
where | | D test | | means the number of programming problems in the test set. 2. MRR@K: Mean Reciprocal Rank is used to measure the importance of the correct ranking of the recommended K programming problems. The formula is as follows.
where p i means the rank of the problems that the learners actually learned in the test set in the recommended results. If the recommended i-th programming problem does not appear in the test set, then p i = ∞. 3. Novelty@K: Novelty is used to measure the novelty of the recommended problems. In other words, Novelty is the rate at which the recommended problems contain (26) HR@K = Num of Hit@K |Dtest| , knowledge points that the learner answered incorrectly or did not answer. The formula is as follows.
where dist e i means whether e i is a novel problem. If the recommended i th programming problem contains knowledge points that have not been learned or correctly answered, then dist e i = 1 , otherwise dist e i = 0.
In particular, the learners with less than 75 learning records are deleted. Afterward, five rounds of recommendation experiments are performed on the remaining learners, and the average of all the results is taken as the final result. The baseline method for experimental comparison is as follows (parameter settings are the same as PQN): • IRT [9]: Item Response Theory is a classic method for cognitive diagnostic tasks. It builds a learner's knowledge level model based on Logistic regression to judge whether the learner can do the new exercises correctly. The experiment treats the problem of predicting what the learner will do wrong as the recommended problem. • DKT [27]: The deep knowledge tracking model can predict the grades that learners can achieve on problems by evaluating the knowledge level of learners based on the Recurrent Neural Network (RNN) to recommend problems according to the level of grades. LSTM is selected as the basic model of DKT in this experiment. • DQN: Deep Q-network is a stubborn network used for state-action estimation in deep reinforcement learning. The network uses a multilayer fully connected network to learn input features and give recommended action estimates.
The results are shown in Table 6. PQN considers that the past learning state will impact the current learning state. At the same time, PQN has the best performance on the experimental data, indicating that it can learn the optimal interaction strategy between learners and programming problems. Compared with deep learning methods (DKT) and reinforcement learning methods (DQN, PQN), the performance of traditional methods (IRT) is significantly weaker. The reason may be that the traditional method judges each learning record of the learner as a statically fragmented part. It does not consider the dynamic relationship between consecutive learning records. Therefore, the PQN proposed in this paper can comprehensively consider the characteristics of learners' behavior and problems to grasp the state of learners in the learning process to make accurate recommendations.

Discussion
This section briefly discusses the limitations of this work and guidelines for future work.

Limitations
There are some imprecise ones in LUOGU OJ's programming problem database (it is determined by remarks that the question design is unreasonable). This study only conducted a rough problem screening without investigating the quality of the problems. High-quality programming problems can improve the accuracy of the recommendation algorithm. Therefore, follow-up research can further strengthen and optimize the problems' quality and standardization of the problem database. In addition, learners' learning styles may change dynamically in solving problems. This work does not consider this changing characteristic when constructing the action space. This may adversely affect the recommendation performance of DRLP. The study uses online programming as a recommended segmentation scenario. Although the method's feasibility is verified in this scenario, the proposed recommendation algorithm cannot completely cover the entire educational recommendation algorithm system due to different data conditions in different disciplines. Therefore, to make DRLP more universal, we need to adjust and adapt the algorithm scheme in various scenarios according to the characteristics of different disciplines (such as mathematics and English).

Future Direction
Obtaining the learning style required for this work in a real scenario requires the cooperation of the OJ system, whether using a scale or developing a model that dynamically predicts the learner's learning style based on other learning behavior data. However, in the experimental part of this work, learners' learning styles can only be simulated, so developing an OJ recommendation system prototype that can support DRLP online training in the future is necessary. Learners can give real-time feedback on the recommended problems so that the learning process of the model forms a true closed loop. In addition, how to divide a more personalized action space according to the learners' learning style, cognitive style, and other characteristics and prove its effectiveness is worthy of future research. Moreover, we are willing to design a more rational reward function according to the various learning needs of learners. Finding a suitable automatic parameter tuning mechanism for the reward function focuses on future work. For example, the weights of the three reward functions can be adaptively adjusted according to the learner's problem-solving state.

Conclusion
The current adaptive learning recommendation system lacks consideration of the classic theories of pedagogy, such as memory curve and learning style. It cannot intelligently combine the unique attributes of the learner and the implicit information contained in the exercise resource to make recommendations. Given the above phenomenon, this paper introduces a deep reinforcement learning framework. By designing a personalized action space combined with the learner's learning style and designing the corresponding reward function according to the learner's various learning needs, the model can guide the learner to learn knowledge points in the 'proximal development zone'. According to the characteristics of the learning mechanism in the OJ system, this paper proposes a programming problem recommendation algorithm (DRLP) based on deep reinforcement learning, which can effectively recommend programming problems to learners with different learning styles. Specifically, a multi-dimensional fusion programming problem representation model (MDPR) is proposed to vectorize the difficulty feature, knowledge point feature, text description, input description, and output description of programming problems. By introducing the Bi-GRU model to learn the contextual semantic information in the text from both positive and negative directions, the model can better understand the implicit information in the text. In addition, this paper also proposes specific methods for the action space, Q-network, and reward function in the deep reinforcement learning framework that is more in line with the programming problem recommendation scenario, as follows. (1) The action space is defined according to different learning styles, making programming problem recommendations more personalized. (2) The cyclic sequence-based PQN network is defined according to the learning state of the learner as a continuous process, which can track the dynamic characteristics of the interaction between the learner and the system. Finally, the experiment uses the real dataset obtained in the LUOGU OJ system. The detailed parameters of each model are shown in Sect. 6.3. First, the rationality of the model hypothesis is verified by exploring the multi-dimensional characteristics of learning records in conversation from five perspectives. In particular, it enhances the interpretability of different reward functions of reward strategies. Then, through the experiment of BGAM, a vector representation that can better obtain the textual context of the programming problem is obtained. Next, the comparison experiment shows that the recommendation effect of PQN has improved (HR, MRR, and Novelty have increased by 4.35%, 1.15%, and 1.1% respectively), indicating that PQN can track the dynamic characteristics of learners in the programming process. This paper demonstrates the rationality and feasibility of DRLP integrating learning styles through theoretical discussion and simulation experiments.