KPRLN: deep knowledge preference-aware reinforcement learning network for recommendation

User preference information plays an important role in knowledge graph-based recommender systems, which is reflected in users having different preferences for each entity–relation pair in the knowledge graph. Existing approaches have not modeled this fine-grained user preference feature well, as affecting the performance of recommender systems. In this paper, we propose a deep knowledge preference-aware reinforcement learning network (KPRLN) for the recommendation, which builds paths between user’s historical interaction items in the knowledge graph, learns the preference features of each user–entity–relation and generates the weighted knowledge graph with fine-grained preference features. First, we proposed a hierarchical propagation path construction method to address the problems of the pendant entity and long path exploration in the knowledge graph. The method expands outward to form clusters centered on items and uses them to represent the starting and target states in reinforcement learning. With the iteration of clusters, we can better learn the pendant entity preference and explore farther paths. Besides, we design an attention graph convolutional network, which focuses on more influential entity–relation pairs, to aggregate user and item higher order representations that contain fine-grained preference features. Finally, extensive experiments on two real-world datasets demonstrate that our method outperforms other state-of-the-art baselines.


Introduction
With the rapid development of the Internet, recommendation systems are dedicated to enhancing the user experience in various online applications.Collaborative filtering (CF) analyzes the user's historical behavior to predict based on user-user and user-item similarities [1].However, sparse and cold start problems plague CF-based methods.Graphstructured data (e.g., social networks, etc.) that contain the connections between entities can more accurately reflect real-world circumstances.The Knowledge Graph (KG) is a heterogeneous graph in which entities are linked by various relations, making the KG rich in semantic information [2,3].Therefore, KG is usually used as external information to improve the performance of recommender systems.As an example shown in Fig. 1, movie items are linked to other entities by different relations.Besides, KG can enhance the explainability of recommendation [4].
It is essential to obtain the user's interest preferences in the knowledge graph.As shown in the left of Fig. 1, a movie item links to other entities by different relations (e.g., stars, directors, writers, etc.).However, the reasons why users choose to watch movies are complex.The reason why u 1 choose this movie is that the actress in the film is Zhang Ziyi.However, for u 2 , the director of the film, Ang Lee, is the reason why he likes this movie.This proves that different users have different behavioral motivations.On the right of Fig. 1, the user has tagged two movies (Forrest Gump and Crouching Tiger, Hidden Dragon).The reasons why this user likes the two movies are different.It shows that the same user will have different preferences when choosing different items.Therefore, we cannot simply calculate the user preference for the Fig. 1 An example of movie recommendation, in the left image u 1 , u 2 like the same movie (Crouching Tiger, Hidden Dragon), but they like it for different reasons.u 1 is because of the movie's director, and u 2 is because of the movie's actor.In the figure on the right, the same user has different reasons for liking different movies; for Forrest Gump, it is because of the movie starring Tom Hanks, and for Crouching Tiger, Hidden Dragon because of the movie director Ang Lee type of relationship but learn the preference features based on the user-entity-relation. Usually, the user preference can be reflected by the weight of the edge.Therefore, the personalized weighted knowledge graph of users can improve the performance of the recommender system.
Many works have shown that edge weights play an important role in the feature learning of graphs.GAT [5]computes the weights of edges based on the similarity of head and tail vertices through graph attention.KGAT [6] introduces this method into recommender systems.KGCN [7] and KGNN-LS [8] use a trainable and personalized relation scoring function to learn the weight for the relation between entities and have achieved good results.However, KGAT cannot distinguish the preference of different users for relations, and KGCN uses the score function to calculate the weight, but it cannot distinguish the user's preferences for the same relation on different entities.Personalized preference information influences the performance of recommendations [9,10].Currently, most KG-based methods cannot make finegrained feature learning for each user-entity-relation. Users' preferences are fine-grained, and direct calculating preference for each user-entity-relation, not only increases the training burden of the model but also leads to model overfitting, which makes a worse performance.
In recent years, the successful application of deep reinforcement learning (RL) to graph structures has sparked great interest.Therefore, some research works combine reinforcement learning and knowledge graphs in recommender systems.However, learning fine-grained user preference features in the knowledge graph by reinforcement learning has the following challenges: (1) Data sparsity problem.The user interaction data are very sparse compared to the knowledge graph, and the distribution of user history interaction items in the knowledge graph may be discrete or aggregated.It is difficult to make a unified method exploring all user interaction items.However, we need to model each user, so it is necessary to learn personalized information about users quickly and efficiently.(2) Pendant entity in the knowledge graph.The pendant entity is the entity with just one adjacent neighbor in KG, which is not employed as the starting or target state in RL.It is not included in the path between items, so only negative feedback is available during the reinforcement learning training.However, they may contain useful information about user preferences.(3) Long path exploration, reinforcement learning tends to repeatedly explore shorter paths, because they lead to higher feedback rewards.Although we can design higher feedback rewards for long paths, this increases the training burden of the model.In addition, we believe that closer items reflect more user preference features.Therefore, it is necessary to make a well-designed explore policy in RL.
Considering the limitations of the existing methods, we propose a personalized Knowledge Preference-aware recommender systems combined with Reinforcement Learning Network (KPRLN).Specifically, we describe the learning user's preference for different entity-relation as the process of building interaction paths in the graph.By constructing the path network between the user's historical interaction items in the KG, which can be expressed as a Markov decision process [11].Unlike previous work, to learn the user's preference features on the KG more completely, we replace the single node with a cluster of entities as the starting and target states.We iteratively extend user history interaction items along the links in the KG to their neighbors.The start-ing and target states are represented as item-centric clusters, which reduce the complexity of the state space and action space.The cluster will add the pendant entity when expanding, and after a few iterations, the cluster will absorb the short path between items, making reinforcement learning explore more distant paths.Furthermore, we design hierarchical propagation paths to develop rewards, with each path construction making the model get more feedback.Meanwhile, we design two different reward mechanisms to make KPRLN have stable performance in recommendation tasks.Based on the expected payoff estimates for item-relation pairs, RL computes the preference for each user-entity-relation and generates the user's preference-weighted knowledge graph.In user and item representations learning, we design an attention mechanism to propagate the user's preference interests in their preference-weighted KGs, making KPRLN focus on influential entities and relations to aggregating item embedding representations.Extensive experiments on two real datasets show that our method has efficient performance.
Our contributions are summarized as follows: 1. We propose a personalized recommendation method, KPRLN, which only uses the topological information to learn the fine-grained features of each user-entityrelation. 2. We propose a hierarchical propagation path construction method, which can explore more complex states and increases the efficiency of the model.3. Extensive experiments on two real-world datasets demonstrate that KPRLN outperforms state-of-the-art baselines.Furthermore, it also has a good performance in reducing the effect of noise and providing explainability for the recommendation.

Knowledge-aware recommendation method
Knowledge graph assists the recommender systems through multi-dimensional dense associations and rich semantic information between entities, providing a new perspective to enhance recommender systems.Currently, KG-based recommender systems are mainly divided into three categories: embedding-based method, path-based method, and unified method.Embedding-based method (e.g., TransE [12], TransH [13], MuPR [14], DihEdral [15], etc.) learns the embeddings of entities and relations in KG, and keeps the original structural information of KG as much as possible.
For example, [16,17] unify the structural knowledge and other side information in a unified CF framework.These methods focus on learning semantic associations between knowledge graph entities but ignore the connectivity patterns of information in KG and high-order relationships between entities.Therefore, they lack the interpretability of the recommendation process.The path-based method (e.g., [4,18,19] etc.) regards the KG as a heterogeneous information network, constructing and extracting the latent features based on meta-path/meta-graph between users and items.These methods' performance depends on manually designing meta-paths and meta-graphs, which makes it difficult to achieve optimal performance in reality and leads to information loss when dealing with KG with complex relationships.The unified method combines the ideal of the embedding-based method and path-based method.RippleNet [20] and KGCN [7] enrich the representation of users or items by aggregating the target entity and their multi-hop neighbors.These methods mine the information in the knowledge graph more comprehensively and perform well in recommender systems.However, RippleNet and KGCN do not learn the user preferences by each user-entity-relation.The existing methods are not modeling the fine-grained preferences of users in KG well.We combine deep reinforcement learning with knowledge graph learning users' fine-grained preference features and apply them to the recommendation.

Reinforcement learning for recommendation
Reinforcement learning is a methodology of machine learning that, through interaction with their environment, learns to maximize a numerical reward.The Markov Decision Process (MDP), proposed by Bellman, is the most common form of defining reinforcement learning.After that, Q-learning further extended the application of reinforcement learning [21].DQN [22] introduces deep learning in reinforcement learning, which significantly increases the application scenarios of reinforcement learning.A series of applications of deep reinforcement learning methods (e.g., AlphaGo [23], autonomous driving [24], etc.) have demonstrated its powerful potential.
In recommender systems, user interactions are sequential [25].The recommendation process can be regarded as a sequential decision process [26], which is formulated as an MDP.Reinforcement learning interacts with recommender systems by the agent, which makes it effectively learn dynamic features and improve recommendation explainability.With the development of deep learning, deep reinforcement learning in recommender systems has aroused great interest in recent years.DRN [27] proposed news recommender systems based on the deep reinforcement learning framework, and [28] made non-sequential ranking recommendations by deep reinforcement learning.Meanwhile, some recommendation methods based on other deep learning models have been successfully applied [29]

Reinforcement learning on knowledge graph
Knowledge graphs are often used as external information to improve the performance of recommender systems, and reinforcement learning increases the explainability of the recommendation due to the agent interacting with the recommender systems.Many works have explored the application of reinforcement learning in the knowledge graph.GRL [30] designed a generative adversarial net (GAN)-based reinforcement learning model for knowledge graph completion.DeepPath [31] applies reinforcement learning to knowledge graph reasoning.Specifically, DeepPath is to find reliable multi-hop paths between entity pairs in KG. [32] explored the application of RL to question-answering tasks in the KG environment, [33] was devoted to solving the problems of reinforcement learning-based path-finding methods in question-answering applications, PGPR [34] uses RL to find explainable paths between users and potential items, and [35] proposed a multimodal knowledge-aware reinforcement learning network dedicated to achieving interpretable causal reasoning procedures.
The above methods use reinforcement learning to find user-item or item-item paths in the knowledge graph.However, the challenges of reinforcement learning based on KG, such as the efficiency of model training, complete exploration in the knowledge graph, and long path exploration, have not been well solved.Therefore, in this paper, we design a novel path construction method to address these issues.

Proposed method
Aimed at the above problems and challenges, we propose a knowledge preference-aware reinforcement learning network named KPRLN, which extracts fine-grained user interest preference features in the knowledge graph.The overall framework is shown in Fig. 2. The KPRLN model is generally divided into the preference-weighted knowledge graph generation layer and the recommendation prediction layer.In the user preference-weighted knowledge graph generation layer, we construct the path network of user historical interaction items in the knowledge graph based on deep reinforcement learning.The deep reinforcement learning model explores the knowledge graph by cluster expansion and designs feedback rewards based on hierarchical propagation paths.Meanwhile, the deep reinforcement learning agent globally updates the weights of edges in the knowledge graph by the expected returns for each link.In the recommendation prediction layer, we design an attention mechanism to propagate the higher order interest of users in the knowledge graph and aggregate user and item representations for prediction.The notations and descriptions used in this paper are shown in Table 1.

Framework
We first introduce our general idea and the overall structure of our model.The knowledge graph is defined as where E represents the entity (node) set in the KG, and R represents the relation (edge) set in the KG.The KG is represented in the form of triples as G = {e, r , e |e, e ∈ E, r ∈ R}, between entity e and entity e connected by the relation r .In our deep reinforcement learning model, we learn the user's higher order interest preferences through the user's historical interaction items.The user's historical interaction item set is represented by E u , and for any e u ∈ E u , they, respectively, reflect the user's preference features and relate to each other.However, they are sparsely distributed in the knowledge graph.Therefore, we build a network of paths among them to learn their connection.Initially, the RL agent will randomly select an e u as the starting state s t=0 to construct a path.We select the neighbors of the last entity in the path as the action range.When another user interaction item has been added to the path, return a positive reward and start a new walking process.Otherwise, return a negative reward and continue to walk.For each path, it is described as follows: where S is the path sequence-reward set.We do not add duplicate nodes in p, so p will not be a cycle or a loop.
Then, we extend the representation of user history interaction items to their neighbors.Therefore, the starting and target states represent a node cluster.As the walking process described above, the RL model constructs cluster-to-cluster paths.When the cluster-to-cluster path is found, we backpropagate to the relevant nodes in the starting cluster and link them to all the nodes in the target cluster.Return the rewards based on the hierarchical propagation paths.We formulate the number of extensions based on the size of the knowledge graph and the number of historical interaction items of the users.When the RL model is sufficiently trained, globally generate the weighted graph G u based on local paths.
Finally, we propagate user preferences on G u and aggregate the item embedding representation and user embedding representation by GNNs for CTR prediction and Top-K recommendation.

Reinforcement learning guides weighted graph generation
To illustrate the detailed design of our deep reinforcement learning model, we first introduce the detailed design of state, action, and reward.
State: Consists of the topology information of all entities in the current path, and s t represents a general description of the current path sequence p at step t.We use Node2vec [36] to obtain the entities embedding representations in the knowledge graph as the inputs of state representations in deep reinforcement learning.The embedding representation of entity e i is f i , for p = (e 1 , e 2 , . . ., e t ), the s t is represented as follows: where [;] represents vector concatenation.
We utilize pooling to simplify state input to enhance the efficiency of the reinforcement learning model.Consider that in path p, the last node determines the action range.Therefore, we pool the path except for the last node.The s t = [ f 1 ; f 2 ; . . .; f t ] is pooled as where max-pooling{•} is the pooled representation of the p after removing the last node.And it is concatenated with the embedding of the last node.

Action:
The next node to join path p. Define a t to be an action at time t, a t is the embedding representation of the entity which adds to p.The action set (neighbors of the last node) removes the nodes already present in p to make sure that p is a real path.The RL agent selects action based on the expected reward of a t according to Q(s t , a t ), and updates state s t to s (t+1) .Q(s t , a t ) is the return reward value predicted by Q-network for action a t .We will introduce the design of the Q-network later.

Reward:
The reward is used for feedback to guide deep reinforcement learning model training.In our model, it is designed into two parts: immediate reward for model training and delayed feedback reward for balancing the immediate reward.We define the immediate reward as r i , which is obtained by constructing the path network.And the delayed feedback reward is defined as r feedback , determined by the current weighted knowledge graph.
1. Immediate rewards: Our model expects to build the path network among the user historical interaction items.Therefore, when another user historical interaction item adds to p, return a positive reward, otherwise return a negative reward where d is a constant and ζ is a balancing hyperparameter.KPRLN extend user history interaction items along the links in the knowledge graph to their neighbors.When finding a cluster-to-cluster path, the model will backpropagate to the relevant nodes in the starting cluster.Find all potential paths within the starting cluster based on the extended hops.And link these potential paths with the cluster-to-cluster path.In the target cluster, these paths spread outward around the user interaction item center.The hierarchical propagation path reward is designed as follows: where h represents the number of hops.In KPRLN, the reward is halved for each additional hop compared to the original immediate reward.2. Delayed feedback reward: We hope that the weighted knowledge graph can work well in the recommender systems.Therefore, we designed a delayed feedback reward based on the recommendation task.We divide the whole training process into multiple epochs.We sample users' historical interactions in each epoch and make predictions in the current weighted knowledge graph.According to the predicted performance, the r feedback is defined as where scores(•) is the user weighted graph model estimate, which is calculated base on the recommended task performance, Z(•) is a normalization function, and β is a balance hyperparameter.
The design of the Q-network is shown in Fig. 3. Input the current path state s and the next action a to the Q-network.After applying two ReLU layers, the output Q(s, a) represents the expected value of the action a in the s state, as follows: where f θ (•) is the deep neural network shown in Fig. 3. Experience replay enables the Q-network to update parameters with recent experience stored in the replay memory, thus stabilizing the training process.However, it may lead to overestimating and local optimum, as great q-value paths are found repeatedly.Therefore, we use DDQN [37] as our RL framework.Our model first finds the action corresponding to the maximum q-value.Then, calculate the target q-value of the action in the target network.Finally, decoupling the choice of target Q-value action and the calculation of the target Q-value to eliminate the problem of overestimation where γ is the discount factor, θ is the parameter of the original network, and θ is the parameter of the target network.Backpropagation updates the parameters in the Q-network by the mean squared loss function where |D| represents the number of samples collected in the experience replay pool.

Preference knowledge-aware recommendation
In the deep reinforcement learning layer, the preferenceweighted KG G u is generated for each user based on their historical interactions.We propagate users' interests on G u to get high-order preference representations of items and users.

Item representation
First, we propagate user preferences along the relations in G u .To learn more semantic information in the knowledge graph and consider the size of the knowledge graph, we designed an attention graph convolutional network based on [38].
As shown in Fig. 4, we sample the neighbors of the item sequentially according to the values of the edge weights in G u and aggregate the multi-hop neighbors of items based on this.Then, aggregate item representation based on the attention graph convolutional network Fig. 3 The structure of the Q-network in our framework.The state s is the embedding of the vertices of the path sequence, and the action a is the neighbor of the last vertex.After pooling, they are used as the input to the Q-network Fig. 4 Item feature aggregation in user weight graph where n represents the number of samples, and l is the number of layers in graph convolution, which represents the number of propagation hops.N (i) represents the neighbors of the node i, c ji is the square-root product of node degrees (i.e., is an activation function, and e ji is the scalar weight from node j to node i.

User representation
Considering that the item embedding already contains the user's preference features, we associate the users with their historical interaction items.Specifically, we build the user-item bipartite graph and aggregate the features of user interaction items to get user embedding representation, which can be described as where i u represents the interaction item embedding of user u, which is aggregated in G u .And, f agg (•) is a function for aggregating the user embedding representation.

Learning algorithm
We predict the interact probability between the user and the item based on user embedding u and item embedding v In our recommender system, we iterate over all possible user-item pairs by negative sampling strategy.The loss is calculated as where J (•) is the cross-entropy loss, the second term is the item aggregation loss, || || 2 2 is the L2-regularization loss function, and λ is the balance hyperparameter.
The process of KPRLN is described as Algorithm 1.It mainly consists of two parts: (1) generate the user preferenceweighted knowledge graph; (2) aggregating users' higher order interest preferences under the GNNS framework.

Experiments
In this section, we show the performance of KPRLN.We evaluate our model on two real-world scenarios: Movielens-1 M and Last.FM, and compare it with state-of-the-art methods.First, we introduce the experimental setup, including datasets and baselines.Second, compare with other baselines and model variants under the same scenario.Then, we discuss the impact of hyperparameters on model performance.Finally, we show a case on the movie dataset, demonstrating that KPRLN can provide reasonable explanations for users' preferences on recommendations.

Datasets
We use datasets based on real scenarios as follows: 1. Movielens-1 M 1 is a widely used movie dataset.It is smaller than Movielens-20 M and contains about 1 million ratings.2. Last.FM 2 is a widely used music dataset that contains data from Last.FM.Information from over 2000 users of the online music system.
Since these datasets are explicitly fed back, we convert them to implicit feedback by setting a rating threshold, marking all entries larger than the threshold as 1, indicating that the user is satisfied, and sampling unsatisfactory ones marked as 0 for each user matching set.And we removed users who did not include positive implicit feedback.
The Movielens-1 M includes 6036 users and 753,772 interactions, and the knowledge graph contains 2347 items, 6729 entities, and 20,195 triples.The Last.FM includes 1872 users and 42,346 interactions, and the knowledge graph contains 3846 items, 9366 entities, and 15,518 triples.The basic statistics of the two datasets are shown in Table 2.
The knowledge graph of Last.FM is published by [7], and the knowledge graph of Movielens-1 M is published by [39].

Baselines
We use the following state-of-the-art baselines for comparison with KPRLN.
1. LibFM [40] is a feature-based factorization model in CTR scenarios.We concatenate user ID and item ID as input for LibFM.2. PER [18] connections between users and items are captured by extracting meta-path-based features in heterogeneous networks.We use the properties of items as features to build the meta-path between the user and the item.3. CKE [16] based on the embedding method, which combines collaborative filtering (CF) with structural information, textual information, and visual information in a unified recommendation framework.In this paper, CF is used in conjunction with the structural knowledge module to implement CKE. 4. RippleNet [20] is a method of obtaining links in the knowledge graph in the form of water wave diffusion.Expand users' potential interests through multiple links.In the recommender system, users' interests can be more comprehensively reflected.5. KGCN [7] is an end-to-end framework that effectively captures inter-item correlations by mining relevant attributes on the knowledge graph.Calculate the scores of users and relations, and use the links on the item to propagate the user's potential interest on the knowledge graph.6. HAGERec [41] emphasizes the importance of characterizing semantic information of relations, which explores users' potential preferences from the high-order connectivity structure of the heterogeneous knowledge graph, combining graph convolutional networks for explainable recommendation.

Experiments setup
The hyperparameter statistics of our experiments are shown in Table 3.The hyperparameters are as follows: d represents the embedding dimension, H represents the number of item propagation hops, N represents the number of aggregation domains, λ represents the L2-regularization weight, and η represents the learning rate.
The training, evaluation, and test sets ratio for each dataset is 8:1:1.Each experiment was repeated three times, and the average performance was reported.We evaluate model performance using the following two experimental scenarios: (1) CTR prediction.We use the model to predict click probabilities for items in the test set.We use ACC (Accuracy), AUC (Area Under Curve), and F1 to evaluate the performance of CTR prediction.(2) Top-K recommendation.We select the K items with the highest predicted click probability for users in the test set and then select Precision@K and Recall@K to evaluate the recommended set.We use the Adam algorithm to optimize all training parameters.The code for KPRLN is implemented under Python 3.7, Tensorflow 1.14.0, and Numpy 1.21.5.

Performance comparisons with baselines
We present the results of CTR prediction and top-K recommendation of KPRLN and other baselines in Table 4 and Figs. 5 and 6, respectively, and draw the following conclusions: 1.In general, KPRLN has the best performance on the recommendation scenarios of the two datasets.As shown in Table 4, in Movielens-1 M, the average improvement in AUC, ACC, and F1 is 7.5%, 6.55%, and 6.37%, respectively.In the Last.FM, the average improvement in AUC, ACC, and F1 is 6.8%, 6.22%, and 5.77%.Furthermore, KPRLN also performs well in Precision@K, Recall@K, as shown in Figs. 5 and 6, demonstrating the efficacy of KPRLN in learning users' high-order interest preferences.2. PER does not perform well.Because the meta-path we designed is difficult to achieve optimally in movie and music recommendation scenarios, we need a lot of expertise to design meta-paths.This makes it difficult for PER to be optimal in results.Compared with other baselines, CKE performs relatively poorly, which may be because the learning of image features and text features is introduced into the original CKE model, while only the knowledge structure features are in the process of our construction.3. Ripple and KGCN are unified methods that integrate the semantic representation of entities and relations and the connectivity information base on GNN.However, none of them are well designed to learn the user finegrained preference interest for each user-item-relation triple.Therefore, they do not perform as well as KPRLN.4. HAGERec performs the best in all baselines, which uses the attention mechanism to filter aggregated neighbors and designs an interaction signals unit to make GCN characterize more passed information from the network 123

Ablation study
We conduct ablation experiments on the KPRLN to analyze the effect of different components.To demonstrate the improvement of the performance of the recommendation system by the generated weighted knowledge graph, we compare the weighted knowledge graph generated by KPRLN with the unweighted knowledge graph (average aggregate neighbors), and the result is shown in Table 5.Furthermore, we verified the impact of the hierarchical propagation paths in the deep reinforcement learning training on the model's performance.
The results are shown in Figs. 7, and 8, and the following conclusions are drawn: 1.As shown in Table 5, KPRLN performs better than the average aggregate method, which proves that user preference information can improve the performance of the recommender systems, and KPRLN can effectively learn user preferences.2. As shown in Fig. 7, we train each dataset 10,000 times to ensure model convergence and use the comprehensive indicators of AUC, ACC, and F1 to determine the performance of the recommender systems.We find that the performance of KPRLN is proportional to the number of training sessions of the model, and the model is stable in the late training period.
Fig. 8 The impact of attention mechanism on the model Fig. 9 The results of noise experiment the knowledge graph based on the attention mechanism.The hyperparameter N represents the number of neighbors we sampled.Therefore, we need to discuss the impact of N on KPRLN.
• The results are shown in Fig. 10.KPRLN shows the best performance at N = 4 in Movielens-1 M and achieves the best result in Last.FM when N = 8.This is because if N is too small, it does not contain enough neighbor information, and if N is too large, the model performance is susceptible to noise.It should be noted that the number of neighbors for some items may be less than N , in which case we select all neighboring entities.
(2) The performance on different H .The number of neighbor propagation hops H is also very critical, and the size of the propagation hops determines the range through entity information.Therefore, it is important to ensure the appropriate number of propagation hops.
• The results are shown in Fig. 11.In Movielens-1 M, KPRLN achieves the best performance when H = 2, but in Last.FM, the best performance is obtained when H = 1.The number of entities aggregated to the item increases exponentially with H , which makes H more sensitive than N .In Movielens-1 M, we can get more information in the longer relation chain, while Last.FM is relatively sparse, so too large H brings more noise to the model.In addition, the performance of KPRLN is more stable in Movielens-1 M than in Last.FM.

Case study
We select a real example from Movielens-1 M to intuitively demonstrate the effectiveness of KPRLN.We randomly select a user-item pair from the test dataset, and the item (e 694 ) is treated as a target item that would be recommended for the user.Then, KPRLN generates the user preference weight knowledge graph based on the user's interaction items.The movies used for model training are Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981), Star Wars: Episode I-The Phantom Menace (1999), and American Werewolf in London, An (1981).The movie used for prediction is Star Wars: Episode V-The Empire Strikes Back (1980).As shown in Fig. 12, the weights of the edges in the graph represent the user's preference, and the edges which in the path between the interaction items can get higher weights.Therefore, the model can learn more useful information when aggregating the representation of e 694 .The edge weight between entity e 5280 and predicted item e 694 is relatively low, because e 5280 is not associated with the user's historical interaction items.

Conclusions and future work
This paper proposes a knowledge graph recommender system based on deep reinforcement learning (KPRLN).In the deep reinforcement learning model, we design hierarchical propagation paths to establish associations between users' historical interaction items and learn the features of users' preferences for entities and relations of KG.At the same time, coordinated by different reward mechanisms, the preference-weighted KG is generated for each user.Then, more influential neighbors are sampled based on an attention mechanism to propagate users' preferences on the KG, aggregating to get embedding representations of items and users.Our method is not to learn users' preferences for various relations at a macro-level but to learn in detail about the user and specific entity-relation-entity combinations.And demonstrate excellent performance on widely used realworld datasets, achieving significant progress compared to several state-of-the-art baselines.
Our future work intends to evaluate the effectiveness of our model on more real-world data.

Fig. 5
Fig.5 The results of Precision@K in top-K recommendation

Table 1
Notations and their descriptions used in this paper

Table 2
Detailed statistics of the three datasets