Path-guided intelligent switching over knowledge graphs with deep reinforcement learning for recommendation

Online recommendation systems process large amounts of information to make personalized recommendations. There has been some progress in research on incorporating knowledge graphs in reinforcement learning for recommendation; however, some challenges still remain. First, in these approaches, an agent cannot switch paths intelligently, because of which, the agent cannot cope with multi-entities and multi-relations in knowledge graphs. Second, these methods do not have predefined targets and thus cannot discover items that are closely related to user-interacted items and latent rich semantic relationships. Third, contemporary methods do not consider long rational paths in knowledge graphs. To address these problems, we propose a deep knowledge reinforcement learning (DKRL) framework, in which path-guided intelligent switching was implemented over knowledge graphs incorporating reinforcement learning; this model integrates predefined target and long logic paths over knowledge graphs for recommendation systems. Specifically, the designed novel path-based intelligent switching algorithm with predefined target enables an agent to switch paths intelligently among multi-entities and multi-relations over knowledge graphs. In addition, the weight of each path is calculated, and the agent switches paths between multiple entities according to path weights. Furthermore, the long logic path has better recommendation performance and interpretability. Extensive experiments with actual data demonstrate that our work improves upon existing methods.The experimental results indicated that DKRL improved the baselines of NDCG@10 by 3.7%, 9.3%, and 4.7%; of HR@10 by 12.39%, 20.8%, and 13.86%; of Prec@10 by 5.17%, 3.57%, 6.2%; of Recall@10 by 3.01%, 4.2%, and 3.37%. The DKRL model achieved more effective recommendation performance using several large benchmark data sets compared with other advanced methods.


Introduction
With explosive growth of online information, users have numerous choices.The extensive online content and ser-vices can overwhelm users.Personalized online service recommendation can guide users in discovering services or products more suited to their personal interests.Researchers have proposed several approaches to optimize online personalized recommendations, such as collaborative filtering (CF) [1,2], matrix factorization (MF) [3], and MF-based models [4].Recently, new state-of-the-art methods, such as deep learning models [5,6], knowledge graphs, reinforcement learning [7,8], and reinforcement learning incorporating knowledge graphs, have become popular due to their abilities to model complex user-item interactions and provide recommendations.
However, these newer methods do not effectively address three challenges.First, although existing recommendation methods combine knowledge graphs with deep reinforcement learning, they do not have predefined targets, and thus, cannot discover the items that are most similar to the user-interacted items and cannot fully discover latent rich Fig. 1 Illustration of movies connected through attributes semantic relationships among entities.Knowledge graphs are heterogeneous graphs with rich semantic relations in context, and their entities are connected by attributes.Therefore, for any item, in a knowledge graph, there will be multiple paths to a predefined target.In particular, the recommended items should be consistent with the predefined targets, so, it is expected that the recommendation guides to predefined target.For example, Fig. 1 shows a user who watched a movie titled The Wandering Earth; its attributes are director Fan Guo, actors Jin Wu, Xiaoran Chu, and Jingmai Zhao, and genre fiction.After a while, the user decides to watch Avatar; so from The Wandering Earth to Avatar, there is a path (dashed red line) linking relevance movies from the starting node to target.Based on their multiple attributes, there are strong connections among these items, and so from the starting node to the predefined target, those items that are most similar to user-interacted items can be explored, and rich semantic relations can be discovered between the entities.Many methods [9, 10] do not consider predefined targets, and so, rich relationships between entities cannot be discovered; moreover, because of the uncertainty of target item, some paths may not existing.
Second, in existing recommendation methods, intelligent path-switching between entities is challenging, especially when there are multi-entities and multi-relations over knowledge graphs.Some contemporary studies on the integration of reinforcement learning and knowledge graphs include reasoning path [10], knowledge-guided reinforcement for sequential recommendation proposed by Wang [11], knowledge graph-enhanced reinforcement learning proposed by Zhou [12].However, these methods cannot deal with large action spaces in reinforcement learning, and so, they use truncated strategies [10,13] to decrease the number of entities, which may lead to loss of important attributes of a knowledge graph.Furthermore, agents do not intelligently switch paths among entities, which makes it impossible to explore all the paths in large-scale knowledge graphs.
Third, contemporary methods do not consider the length of multi-hop paths, and there are no long rational paths for analyzing knowledge graphs.Many studies on recommendation with knowledge graphs only consider short path, such as the length of 2 hops and 3 hops, without considering the recommendation performance of the long path.For example, li [14,15] and Xia [16] considered path lengths of 2 hops and 3 hops, respectively.However, few studies have considered recommendation based on long logical path.Therefore, analyzing the impact of long logical paths on recommendation performance in knowledge graphs is a crucial research topic.
In view of these three challenges and inspired by the wide successful application of knowledge graphs and reinforcement learning, we propose a framework called deep knowledge reinforcement learning (DKRL), which combines deep reinforcement learning and knowledge graphs with predefine target and long rational paths to improve recommendation performance.Several existing many approaches [10-12, 17] cannot address multi-entities and multi-relations in real-world knowledge graphs.In our model, we design a novel long path-based intelligent switching (PBIS) algorithm with predefine target, in which weights are assigned to paths, which enables an agent to switch intelligently among multientities and multi-relations without truncating attributes of knowledge graphs, thus preserving important properties of attributes.Our approach has three advantages.First, consider a predefined target with a long path of multiple hops from the start node to the target node in our model.This enables discovery of nodes that are most similar to user-interacted items and provide users with a variety of recommended items.Second, the weight of each path is calculated and the path switch is carried out according to the weight to realize the intelligent switch on multi-entities and multi-relationships in knowledge graph.Third, we consider long rational paths over knowledge graphs, which improve the recommendation performance and enables better interpretability than short paths.Furthermore, we use reinforcement learning [18] to better model the dynamic nature of items and personal user preferences.In addition, rather than considering only the user's feedback rating [6,19], we consider multiple user-item interactions as feedback.
Study contributions are highlighted as follows: • Design a novel PGIS algorithm and calculate the weights of paths.• Agents can deal with multi-entities and multi-relations and switch paths intelligently.• Our system considers long rational paths, which ensures better recommendation performance and interpretability.
• Our method has a definite terminal goal that enables the agent to learn the optimal path.
The paper is arranged as follows: the first section is the abstract, the second section is related work, the third section shows the relevant research methods, the fourth section is the experiment, and the last part is the conclusion.

Reinforcement learning for recommendation systems
Reinforcement learning has attracted substantial attention and achieved successful application in many scenarios.A series of model using the MDP have been proposed for recommendation tasks.MDP-based methods model the recommendation procedure as a sequential interaction between users and items [22]?.In reality, practical recommendation systems record millions of discrete actions.This leads to reinforcement learning-based models becoming inefficient due to an inability to scale to such large datasets.Other meth-ods have achieved better results [7,8,23,24], but these approaches are hard to apply in practice without incorporating the knowledge base.Reinforcement learning using knowledge graphs setting has also been explored for tasks such as question answering (QA) [25-27] and explaining recommendation [10].Our method differs from these methods [10][11][12] in its proposal of an intelligent switching path among multi-entities and multi-relations that enables agents to explore paths effectively over knowledge graphs.Briefly, given a predefined target item, we want the agent to be able to explore the path that connects user-interacted items with the target to ensure better performance.In this paper, we consider a predefined target as a possible item that a user wants to visit in the future (e.g., a movie that the user wants to watch or a book that the user wants to read) and denote it as s T .A recommendation starts from an initial state that is usually randomly picked from user-interacted items.At each step where the agent makes an action, it has access to history paths comprising a series of states, s 1:t = s 1 , . . ., s t .After receiving feedback from the external environment knowledge graph, the agent moves from the current state s t to the next state s t+1 .It aims to satisfy (1), that is, intelligent path-switching from many action spaces and states in reinforcement learning, and (2), from multi-hop long rational paths connecting user-interacted items with the predefined target.Based on a multi-hop paths, PGIS aggregates the entity correlations between a user and target items by explicitly considering the user's different interests through the rich context semantics of a knowledge graph.Furthermore, using long logical paths, can give users a more reasonably explanation of recommended items, and can improve the performance of recommendation.In addition, other more complex and meaningful measures for specific practical applications can be considered.Then the task of path-based recommendation over knowledge graph can be formulated as:

Knowledge graph applying in recommendation
Task description Given the user-item interaction and the knowledge graph, G, our task of long path recommendation with predefine target over knowledge graph is to learn a function that can predict how likely a user would adopt an item.

Methodology
In this section, we present our proposed DKRL model in detail.We first introduce the overall framework of DKRL and then discuss the processes of embedding knowledge graphs, obtaining optimal policies from deep reinforcement learning, and performing intelligent path-switching according path weights over a knowledge graph.

Framework
The DKRL is illustrated in Fig. 2. The knowledge graphs constitute the external environment and interact with agents.In particular, according users-interacted items, relevant terms are extracted to construct a knowledge graph [37], and the knowledge graph becomes the external environment.Reinforcement learning regards exploring paths as a trialand-error process, and it can capture dynamic changes over items and maximize long-term cumulative rewards.In this process, the agent chooses an action from action spaces and uses the PGIS algorithm for exploring paths from the starting node to the predefined target over the knowledge graph.The agent switches paths intelligently according to the weights of paths over knowledge graphs.Finally, the long rational path with the highest weight is selected, and the ranked paths of these diverse items are recommended to users.
The notations and descriptions as used in this paper are shown in Table 7.

Knowledge graph embedding
We construct knowledge sub-graphs from Douban, a popular web service in China, by extracting relevant entities and relations from the open Chinese repository called CN-DBpedia 1[37], CN-DBpedia is similar to DBpedia. 2 In general, an item represents an entity, and an item's neighbors are its attributes, such as directors, actors, genres, and countries.These attributes represent the relationships between entities.The knowledge graph is generally recognized as a heterogeneous information network involving a diversity of nodes and relations between entities [31,32].Multiple attributes shared between entities are typically represented as multiple paths connecting the entities in the knowledge graph.Figure 3 illustrates an example of multiple attributes shared between entities.The graph provides multiple paths containing rich semantic cues that are absolutely useful in representing the relationships between entities.For example, the movies Red Sorghum and To Live are the most similar because they have the same director, actor, genre, and country attributes.The movie To Live does not resemble the movie Farewell My Concubine because they only share one attribute actor between them.We note that we allow reverse edges in the knowledge graph.
Within the knowledge graph, two similar entities can be connected through multiple paths traversing multiple attributes shared between them.A large number of paths between entities indicate greater similar between entities.Although the knowledge graph effectively represents structured data, the underlying symbolic nature of triplets makes it different for deep neural networks to manipulate entities and relations.To tackle this issue, it is necessary to represent knowledge graphs with an embedding.The goal of knowledge graph embedding is the preservation the proximity between an entity and its neighbor in the original knowledge Given the extracted knowledge graph, we use Metap-ath2Vec [38] for the knowledge graph embedding algorithm where each node and its neighbors are represented as a continuous low-dimension vector space, a new state-of-the-art algorithm for heterogeneous information network embedding.For two entities correlated structurally and semantically, this algorithm embeds them close together in the low-dimensional vector space.Multi-attributes of an entity are usually closely related to an entity structurally and semantically.To help learn latent embedding for each entity, the extraction of additional contextual information could complement the identifiability of the entity.In addition to the entities' embedding, we also include its attributes such as directors, actors, genres, and countries.Usually, the context embedding of entity is calculated as the average of its multiple attributes where N (e) is the neighbor of entity e in the knowledge graph, and e i , i ∈ N is the entity embedding of e i represented by knowledge graph embedding.

Deep reinforcement learning
The recommendation task in which our model as a Markov Decision Process (MDP), which is defined as a tuple (S, A, R, P), where S denotes states, A denotes actions, R denotes a reward function, and P denotes the transition.We define the set of states as S. A state s ∈ S is defined as a tuple (u, e t ), where u ∈ U is the starting user, U is the set of users.The current state at step t is defined as s t .The initial state is represented as s 0 and the terminal state as s T .
In reinforcement learning, the action is a, we define the set of actions as A, which is all the relations in the knowledge graph.
We also define the set of rewards R. For each current state s, there is a reward value r .The cumulative future reward is obtained by multiplying the reward after the current state by the corresponding discount factor γ to obtain the final return value R. For any user, there is a known target s T , the terminal reward R is defined as Reinforcement learning learns good policies from sequential actions by optimizing the cumulative future reward.Considering the aforementioned dynamic feature of item recommendation and the need to achieve a future reward, we apply DQN [39] to produce the recommendation list for a user.DQN is a multi-layered neural network for a state space with n dimensions and an action space with m actions.The DQN algorithm uses the target network with parameters θ .Therefore, we can model the target network as Eq. ( 3): where the state s is represented by entity and user features, action a is represented by the relations of entities, rt+1 denotes the rewards for the current state, and max a Q(s t+1 , a; θt ) is the maximum value of the agent producing future rewards.γ is a discount factor to adjust the relative importance of current and future rewards.
To avoid overestimated values and overoptimistic value estimates in DQN, DDQN [18] decomposes the max operation into the target and evaluation networks.We use the DDQN target to obtain the cumulative reward by taking action a as given in Equation ( 4): where rt+1 represents the current reward by taking action a.
Here, θ t and θt are two different sets of parameters of the DDQN.In this formulation, given action a is selected from the set of actions {a}, and the agent reaches the next state s t+1 by interacting with the knowledge graph.Based on parameter θ t , we obtain the maximum future reward from the action {a}.An optimal policy is easily obtained from the optimal values by selecting the maximum valued Q of the action in each state as the agent interacts with the knowledge graph.The future reward is calculated by the parameter θt in the evaluation network.We update the weights of the evaluation network θt using the weights of the target network θ t after a few iterations.DDQN has been proven to prevent the overoptimistic value estimates of DQN.DDQN decomposes the operator into a target network and evaluation network, and we feed the knowledge graph embedding into the two networks.When agent takes action a randomly at a given state s, the knowledge graph will feed back the reward r and the next state s t+1 to the agent.Algorithm 1 presents the combination of knowledge graph and reinforcement learning with path-based intelligent switching algorithm.

Intelligent switching
In the fusion of knowledge graphs and reinforcement learning, there are large of entities and relations, and there are few studies on switching the path over knowledge graphs.Studies [10, 13] have not considered large action spaces in reinforcement learning, and so they have to used truncated strategies to decrease the number of entities.However, these methods may lose important properties of attributes of knowledge graphs.To address this research gap, we design a new pathbased intelligent switching algorithm among many entities and many relations from a starting node to the target item in this section.In real-world applications, one node has a large number of neighbors, and there are multiple edges between nodes.Based on this, we introduce the concept of multi-hop path k.
Definition (Multi-hop path) The multi-hop path consists of n entities connected by k relations {e 0 , r 0 , e 1 , r 1 , . . ., r k , e n }, from entity e 1 to entity e n−1 , {e 1 , e 2 ,…, e n−1 } in the path with the maximum Q value.
Each time the user requires an item, given the initial state(entity) s 0 , the agent explores over the knowledge graph and selects action(relation) r from the R. When traveling to the next state, it checks whether the state is the target state, and, if so, returns the reward R and next state s t+1 , and terminates the task.If the explore continues, it will pass through a large number E and R, with agent switching paths intelligently from any current state s t to the next state s t+1 among the E and R in knowledge graphs.In the knowledge graph, each entity has a large number of neighbors.From the current state, s t , to the next item, s t+1 , the agent has to choose from action spaces, and so, there will be a multi-hop path until the target.

The path-switching based on weight
Although an agent can intelligently switch paths between multi-entities and multi-relations, this switching can not truly reflect the close relationship between the recommended items and the user-interacted items.Because each entity has many neighbors, we should be able to find identify items closest to the user-interacted items among the many neighbors on multi-hop paths.Therefore, we calculate weights of the paths to identify the item closest to the user-interacted item.On the right in Fig. 2, the recommendation problem is summarized in the form of multi-hop paths between the user and the predefined target with path weights.In our model, u info represents user information, for example, click times or watch time.e info represents item information, for example, in case of a movie, it can be movie name, actors, directors, release time, and r e represents the context information of entity e.There is a link between the starting state and the predefined target (indicated by the dashed red line), which indicates the final relationship between a typical user interest and the target.We can further form multi-hop paths from the starting state to the target.The first hop represents interest in a user-item pair, and the remaining hops represent semantic relationships between user-interacted items and the predefined target.Different items have different attractiveness to users, so users pay different attention to items.Users' interests are used to assign weights to items to express the weights of different entities, for example, attributes, rates, and counts; eventually, the agent selects the node whose path has the highest weight as the next state.An agent recommending an item can be formulated as: According to Eq. ( 5), the basic workflow of PGIS can be formulated as follows: where f path is a function to determine the weight of one-hop, path T represents the total weight of the multi-hop, AGG is the scoring function for obtaining the final score between the initial state and the predefined target by summing the weights of multi-hop paths.∈N (e) n ue h (7) where n ue h is the count of user interactions with an item.
e h ∈N (e) n ue h is the total interactions of a user, path T is the total weight of each hop, written as Eq. ( 8), According to the context information in the knowledge graph, the weight of each path is calculated as follows: exp(e t , e t ) ) where N (e t ) is the context information of entity e t .Algorithm 2 presents the intelligent path-switching algorithm based on paths weights, h 1 , h 2 , h 3 , h 4 comprise the set of states corresponding to each action.The selections of each intermediate state on the paths are selected using the PGIS algorithm, which helps refine the recommended items for the user.Therefore, according to the rules of PGIS, the agent will automatically select the node with the highest weight as the next state, that is, intelligent path-switching based on path weight is realized, as shown in Fig. 4. In Fig. 4, s 0 is the initial state, and the s n is the target state.{h 1 , h 2 , h 3 , h 4 } are the entities.rn represents the weight of each path.

Experiments
In this section, we present our model experiments and the corresponding results, including dataset analysis, baseline methods, comparison of different models, ablation study, the influence of different action sizes, and the influence of the k-hop-path.We also provide a case study of agents can switching paths over E and R of the knowledge graph.

Dataset description
All compared models are evaluated on the following three realistic datasets.
KKBOX 3 : The dataset is from the famous music service KKBOX, which includes the historical data recording of many users listening to music.The musical attributes used in our experiments include genre, artist, language, and composer.
Movie The dataset is published on https://doi.org/10.7910/DVN/WCXPQA.The movie attributes include rating, actors, directors, and genre.
Book The dataset is published on https://doi.org/10.7910/DVN/WCXPQA.The attributes of of book used in our experiments include title, author, publication, and publisher.
Each of the three datasets is mapped to CN-DBpedia to build the corresponding sub-knowledge graphs for the datasets [37].Similar to DBpedia, CN-DBpedia can be accessed through the API of the Knowledge works website.

Evaluation protocols
We used two popular metrics evaluation measures to evaluate the recommendation performance of the tested model: • NDCG The most frequently used list evaluation measure that takes into account the position of correctly recommended items.NDCG is averaged across all the testing users.• HR Hit Radio, which is the percentage of users that have at least one correctly recommended item in the list.• Precision Percentage of correctly recommended items in a user's recommendation list, averaged across all testing users.• Recall Percentage of purchased items that are really recommended in the list.Recall is averaged across all the testing users.
We provide top-N recommendation list for each user in the testing set, where N = 10 is token to report the numbers and compare different algorithms.

Baselines
We compared our approach with the following methods: • User-based CF method This method predicts the rating a user will give an item based on the aggregation of their ratings on similar items.We further note that the user-based CF method employs traditional collaborative filtering, LFM employs the factorization model, and NCF, DQN, PGPR, and PDN are all state-of-the-art deep learning models.We have published the program code implementing our framework at https://github.com/shaohuatao/DKRN.

Parameter settings
We implement our DKRL and all baselines in Tensorflow and carefully turn the key parameters.For a fair comparison, we fix the embedding size to 50 for all models, and the Episode step is 3000.We optimize our method with Adam and set the batch size to 64.We turn the learning rate is 0.01 and the memory size is 2000.We set the discount factor is 0.95 and decay parameter is 0.995.For DeepCF and NCF, according to the original paper set, the modal parameters are randomly initialized with a Gaussian distribution and the negative instances are uniformly sampled from unobserved interactions.For PGPR and PDN, the length of depth is set 3 according to the original paper on two datasets.For DQN method, we turn the learning rate is 0.01, the memory size is 2000, the discount factor is 0.95 and decay parameter is 0.995 and the batch size is 64.For MetaPath, DeepWalk and LFM, the loss coefficient is set according to the original paper on two datasets.We reproduce the One-hot and User-based CF method according to the original paper set.

Performance comparison
In this section, we present the results comparing different baselines as well as the variants of our framework.Table 3 shows the results of the different models.We can observe that: • User-based CF performed the second worse performance among all the methods because user-based CF is a traditional CF-based method with low efficiency.• DeepCF and NCF had better results than the user-based CF and LFM methods, which suggests that deep models are effective in capturing non-linear relations and improved recommendation performance.• One-Hot Encoding has the worst performance among all the methods, the embedding of MetaPath method had better performance than One-Hot Encoding, because the data in One-Hot Encoding is too sparse, so the performance is very low.complex dynamic user-item interactions compared to neural networks.• PGPR performs better than DQN, because PGPR using knowledge graph in reinforcement learning, so the performance is better.• Finally, DKRL's advantage over user-based CF and LFM shows that knowledge graph embedding performed better than similar users' rating and factorization models.LFM and CF do not consider the context information between nodes, and the information between nodes is not shared.However, our model makes full use of the rich contextual semantic information of the knowledge graph, and it is easy to find the similarity between nodes using semantic information.Therefore, the performance of recommendation is improved.DKRL's advantage over DeepWalk, MetaPath, and DQN shows that applying knowledge graphs into reinforcement learning improved recommendation accuracy.DeepWalk and MetaPath use node embedding to make recommendations and randomly select the next node.Without the interaction information with the external environment, the expected maximum reward of the node cannot be obtained.However, DQN only considers reinforcement learning as recommendation, without external knowledge as auxiliary information, so the recommendation performance is relatively low.Our model fully combines knowledge graph with reinforcement learning.As an external feedback environment, knowledge graph interacts with agents and gives information feedback.Therefore, agents can obtain the best recommendation performance in the environment of constant trial and error.DKRL's better than PGPR, because in PGPR, there is no pre-defined targeted items, so some paths may not exist, and the recommended result is poor compared to DKRL.Our model has predefined target, each time the agent explores the path, it will explore the appropriate path.PDN has only two hop paths, while DKRL has a multi-hop path with a length of 10 hops.Therefore, the performance of DKRL is higher than that of PDN.This also shows that the recommended performance of long path (less than or equal to 10 hops) is greater than that of short path.

Ablation study
Our model DKRL makes several important extensions to integrate knowledge graphs into reinforcement learning for recommendation.In this section, we conduct ablation study experiments to analyze their impact.

Influence of action sizes
In this experiment, we evaluated how the recommendation performance of our model varied with different action sizes.The knowledge graphs contained a large number of attributes, which we pruned according to the users' attention to each attribute.Attributes attracting more attention from users, such as movie director and actors were more likely to be preserved.In other words, the larger action space contained more attributes with less user interested.In general, the user's focus on the item attributes ranged narrowly from 2 to 4. We preserved at most four action sizes where users were paying attention in the experiment.In the work of Xian et al. [10], the action space ranged from 100 to 500 in the knowledge graphs, which was extremely inappropriate and impractical from the user's perspective.
We experimented on two datasets and using the default settings, and varying the action sizes from 2 to 10, as shown in Table 4.The baseline method PGPR is also reported in the table for comparison.The results show that our model performed better than PGPR which uses pruned action sizes.The NDCG@10 and Prec@10 of our method were consistently above those of PGPR for all action sizes between 2 and 4. The results further demonstrate that our model was effective compared to other baselines.As shown in two datasets, action size is 10, which led to better performance.This means that entities have not so much relations, and it is easier to get to the target from the initial state.Therefore, recommendation performance is better.

Influence of long rational paths
In this experiment, we studied how the path length with predefined target influences the recommendation performance of our model.
According to Table 3, larger action spaces were more likely to have better recommendation performance.This experiment demonstrated whether the path length influenced the recommendation performance.The results for the two datasets are plotted in Figs. 5 and 6.We ran the experiments on the movie and book datasets using the parameter settings given previously.
We make several observations about these results.First, knowledge graphs have more attributes, which resulted in closer connections between entities, the path length is greater than or equal 2 hops, and improved recommendation performance.Second, long rational paths with predefined target can discover those items that are most similar to the userinteracted items and can fully discover rich semantic contexts among entities.Third, according to our statistics in the experiment, path lengths of 2 hops to 10 hops accounted for 90% of the total, indicating that the recommended performance is the best when the path length is between 2 hops and 10 hops.In a actual large-scale knowledge graph, the path length is easy to reach and easily traversed, and various items are recommended for users accordingly.In studies [10,11], the maximum path length was only 3 hops.They did not consider a long rational path and did not consider predefined targets in knowledge graphs; consequently, they could not reveal the deep relationships between the entities, and did not provide diverse recommendations.

Influence of the path-guided intelligent switching
In this section, we analyze the influence of using and without using the PGIS algorithm on the path length and recommendation performance.
According to Table 5, in the movie dataset, for action sized of 2, 3, 4, and 10, path lengths are in the ranges (2-130), (2-79), (2-59), and (2-25), respectively.In the book dataset, for action sizes 2, 3, 4, and 10, path lengths are in the ranges (2-140), (2-132), , and (2-30), respectively.Thus, the PGIS algorithm can effectively switch between multi-entities and multi-relations according to weights of paths and identify a path effectively from the starting state to the target state.Furthermore, for an action size of 10, the NDCG and Prec.are higher than those for action sizes, which are 2, 3, and 4.This shows that in real-world knowledge graphs, for large action sizes, PGIS can switch paths efficiently according to path weight, improving recommendation performance, and achieving accurate recommendation for users.Tables 5 and 6 demonstrates that recommendation perforce is better with the PGIS algorithm than without PGIS algorithm.The longest path length with PIBS is shorter than the longest path length without PGIS.This demonstrates that search paths can be identified more efficiently with PGIS than without it.Knowledge graphs contain diverse heterogeneous structured information, and the PGIS algorithm enables identifying a relationship between connected and unconnected entities on a path.Thus, it improves the diversity and performance of recommendation.

Case study
To demonstrate the efficacy of the knowledge graph as well as deep reinforcement learning more intuitively, we randomly sampled a user as an example, and selected a movie The Cradle of Life that user has watched.The attributes were: director Jan De Bont, genre Action / Adventure, and main actors Angelina Jolie, Gerard Butler.Here, the title was the state with the three attributes of director, genre, and actors.We first studied Ê and R, and path in the case.Each action corresponded to a set of Ê of the same type.There are about 20 movies in the movie set Ê that belong to the same actor as the movie The Cradle of Life, and we only selected some of them {K ungFu Panda, Girl, I nterrupted, K ungFu Panda2, K ungFu Panda3}, and so on.Similarly, there are about 10 movies belonging to the same director and more than 20 movies belonging to the same adventure attribute.According to the attribute weight value, the path explored by the agent is shown in Fig. 7.The initial state is The Cradle of Life, the next state is Speed2, and the next state is Maleficent, and then finally state is The Great Wall and so on.−→ movie, which shows three advantages of our model.First, the agent switched paths intelligently according to the weight of path.Second, from the starting node to the predefined item, there is a long logical path.Third, the selected actions of {actor, director, genre, actor, director, actor, actor} show that our model implements the Epsilon-greedy algorithm in reinforcement learning using exploration and exploitation, which made each action as likely to be selected as possible.This case also showed that our model improved the recommendation performance and diversity.

Conclusion
In this study, we propose a method of path-based intelligent switching over knowledge graphs incorporating deep reinforcement learning for improved and diverse personalized recommendation.Compared with existing methods, our model integrates knowledge graphs and reinforcement learning and considers the impact of long logical paths on recommendation performance, for providing diverse recommendations to users.The designed novel PGIS algorithm can switch among multi-entities and multi-relations over knowledge graphs.Furthermore, we calculate the weights of paths, because of which, an agent can switch paths according to the path weights.In addition, we consider predefined target.From the initial to the predefined target, there are multi-hop paths, with path lengths of 10 or more.We demonstrate that long paths have better recommendation performance and interpretability.Experiments also showed that our model The Great Wall Fig. 7 Real case of long path improves accuracy and diversity.The proposed framework can be applied to many other recommendation platforms.
In the future, we will design models for intermediate nodes when an agent interacts with the knowledge graph.The findings of this study provide more insights for combining knowledge graphs and deep reinforcement learning.

Fig. 4
Fig. 4 Illustration of intelligent switching path based on weight

Fig. 5 Fig. 6
Fig. 5 Path length for different action sizes on the movie dataset The agent interacts with the external environment knowledge graph.The long path from the initial starting state to the target state was discovered as movie

8.
Zheng G-J, Zhang Z-HZF-Z, Nicholas Y-X et al (2018) Drn: a deep reinforcement learning framework for news recommendation.In: Proceedings of the 2018 world wide web conference, pp 126-137 9. Zou L-X, Xia Z-YDL, Song J-X (2019) Reinforcement learning to optimize long-term user engagement in recommender system.In: Proceedings of the 23th ACM SIGKDD international conference on knowledge discovery and data mining, pp 2810-2818 10.Xian Y-K, Fu SMZ-H, Melo G-D et al(2019)  Reinforcement knowledge graph reasoning for explainable recommendation.In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 285-294 11.Wang P-F, Fan LXY, Zhao W-X et al (2020) Kerl: a knowledgeguided reinforcement learning model for sequential recommendation.In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 209-218 12. Zhou S-J, Dai H-KCX-Y, Zhang W-N et al (2020) Interactive recommender system via knowledge graph-enhanced reinforcement learning.In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 179-188 13.Wang X, Xu X-NHY-K, Cao Y-X et al (2020) Reinforced negative sampling over knowledge graph for recommendation.In: Proceedings of the web conference 2020, pp 99-109 14.Li H-Y, Chen C-LLZ-H, Xiao R et al (2021) Path-based deep network for candidate item matching in recommenders.arXiv:2105.0824 15.Wang X, Huang D-XWT-L, Liu Z-G (2021) Learning intents behind interactions with knowledge graph for recommendation.In: Proceedings of the international world wide web conference committee 2021 16.Xia L-H, Huang YXC, Dai P et al (2020) Knowledge-enhanced hierarchical graph transformer network for multi-behavior recommendation.In: Proceedings of the 34th association for the advancement of artificial intelligence, pp 4486-4493 17.Wang X, Wang C-RXD-X, He X-N (2018) Explainable reasoning over knowledge graphs for recommendation.arXiv:1811.045418. Hasselt H-V, AG, Silver D (2015) Deep reinforcement learning with double q-learning.arXiv:1509.064619.Deng Z-H, Huang C-DWL, Lai J-H (2019) Deepcf: a unified framework of representation learning and matching function learning in recommender system.arXiv:1901.047020.Wang H, Y N-YW, Yeung D (2015) Collaborative deep learning for recommender systems.In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235-1244 21.Guo H-F, Tang Y-MYR-M, Li Z-G et al (2017) Deepfm: a factorization machine based neural network for ctr prediction.In: IJCAI, pp 1725-1731 22. Lu Z-Q, Yang Q (2016) Partially observable Markov decision process for recommender systems.arXiv:1608.0779323.Theocharous G, P-ST, Ghavamzadeh M (2015) Personalized ad recommendation systems for life-time value optimization with guarantees.In: IJCAI, pp 1806-1812 24.Wang X-T, Chen JYY-R, Wu L et al (2018) A reinforcement learning framework for explainable recommendation.In: IEEE international conference on data mining, pp 587-596 25.Xiong W-H TH, Wang W-Y (2017) Deeppath: a reinforcement learning method for knowledge graph reasoning.In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 564-573 26.Das R, Dhuliawala MZS, Vilnis L-K et al (2017) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases with reinforcement learning.arXiv:1711.058527.Lin X-V RS, Xiong C (2018) Multi-hop knowledge graph reasoning with reward shaping.arXiv:1808.105628.Yang B-S, T M (2019) Leveraging knowledge bases in lstms for improving machine reading.arXiv:1902.0909129.Tao S-H, Qiu YPR-H, Ma H (2021) Multi-modal knowledge-aware reinforcement learning network for explainable recommendation.Knowl Based Syst 227:107217 30.Cao Y-X, Wang X-NHX, Hu Z-K (2019) Unifying knowledge graph learning and recommendation: towards a better understanding of user preferences.In: Proceedings of the web conference 2019, pp 151-161 31.Yang D-Q, Guo Z-YWZ-K, Jian J-Y et al (2018) A knowledgeenhanced deep recommendation framework incorporating ganbased models.In: Proceedings of IEEE international conference on data mining, pp 1368-1373 32.Wang H-W, Zhang XXF-Z, Guo M-Y (2018) Dkn: deep knowledge-aware network for news recommendation.arXiv:1808.082833.Huang J, Zhao H-JD W-X, Wen J-R et al (2018) Improving sequential recommendation with knowledge-enhanced memory networks.In: Proceedings of the 41nd international ACM SIGIR conference on research and development in information retrieval, pp 505-514 34.Ai Q-Y, Azizi XCV, Zhang Y-F (2018) Learning heterogeneous knowledge base embeddings for explainable recommendation, algorithms.Algorithms 11:137-153 35.Zheng J-Y, J-YM, Wen Y-L (2022) Explainable session-based recommendation with meta-path guided instances and self-attention mechanism.In: Proceedings of the 46nd international ACM SIGIR conference on research and development in information retrieval, pp.2555-2559 36.Huang R-r, C-QH, Cui L (2021) Entity-aware collaborative relation network with knowledge graph for recommendation.In: Proceedings of the CIKM 2021, pp 3098-3102 37. Wang H-W, Zhang XX F-Z, Guo M-Y (2018) Cndbpedia: a neverending Chinese knowledge extraction system.In: Proceedings of the international conference on industrial, engineering and other applications of applied intelligent systems, pp 428-438 38.Dong Y-X, N-VC, Swami A (2017) metapath2vec: scalable representation learning for heterogeneous networks.In: Proceedings of the 23th ACM SIGKDD international conference on knowledge discovery and data mining, pp 135-144 39.Mnih V, Silver KK et al (2015) Human-level control through deep reinforcement learning.Nature 518:529-533 40.Rendle S (2012) Factorization machines with libfm.In: ACM transactions on intelligent systems and technology, vol 3, p 57 41.Perozzi B, RA-RR, Skiena S (2014) Deepwalk: online learning of social representations.In: Proceedings of the 20th ACM SIGKDD conference on knowledge discovery and data mining, pp 701-710 Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
R, and defined as G = {(e h , r , e t ) | e h , e t ∈ E, r ∈ R}, where e h , r , and e t represent the head, relation, and tail of a triple, respectively.We define two types of edges within G.The first type is a reverse edge: if (e h , r , e t ) ∈ G, then (e t , r , e h ) ∈ G.The second type is a self-loop edge associated with the no operation relation: if e h ∈ E, then (e h , r noop , e h ) ∈ G.We integrate the knowledge graph with reinforcement learning.The entities and relations in the knowledge graph are the state and action space in reinforcement learning.
For example, Yang et al. [31] proposed a knowledgeenhanced deep recommendation framework incorporating GAN-based models.Wang et al. [32] employed a deep knowledge-aware network incorporating a knowledge graph representation for news recommendations.Huang et al. [33] adopted memory network using knowledge graph embedding.Other researches have explored the entity and path information in the knowledge graph to make reasoning decisions.For example, Ai et al. [34] incorporated the entities and the relations embedding of the knowledge graph for explaining recommendations.Xian et al. [ Variables:states, actiona, h 1 , h 2 , h 3 , h 4 .Ensure: weight of i − th h i 1: if action a in h 1 is true and with maximum weight then Require:

Table 2
Statistics of experimental datasets

Table 4
Comparison of different action sizes

Table 5
Influence