1 Introduction

The information explosion brings the problem of information overload. Therefore, recommendation systems have emerged to quickly find information that meets the interests of users from the massive data of the Internet. Specifically, recommendation systems are essentially information filtering systems. They can capture the users’ latent interests and provide users with personalized recommendations, especially when users face a large amount of data and do not have specific requirements.

Traditional recommendation algorithms include content-based methods, collaborative filtering (CF)-based methods, and hybrid methods. CF-based methods make use of historical interactions and have made significant success. In addition, soft computing techniques, such as Bayesian approaches, and fuzzy approaches, are combined with traditional recommendation algorithms to manage uncertainty, improving the performance of recommender systems [1]. However, CF-based methods usually suffer from the sparsity of user-item interactions and the cold-start problem [2,3,4]. To address these limitations, researchers incorporate auxiliary information into the CF to optimize and enrich the representations of users or items [5, 6], such as item attributes[7], social networks [8, 9], texts [10,11,12], and images [10, 13]. With a clear structure and sufficient resources, knowledge graphs (KGs) are starting to be the preferred choice for auxiliary information. KGs are semantic networks that reveal the relationship between entities. The nodes of the KG represent entities, and the edges represent relationships. KGs are essentially heterogeneous networks of directed information.

KG-based recommendation systems effectively alleviate the problems of data sparsity and cold-start, improving the quality of the recommendation systems in terms of accuracy, diversity and interpretability [5]. KG-based methods can be roughly divided into three categories: embedding-based methods, path-based methods, and neighborhood-based methods. Embedding-based methods embed nodes and edges in the KG into the low-dimensional space. The low-dimensional embedding vectors of these nodes and edges retain the inherent property of the graph but ignore the semantic relationships between paths. Take the movie knowledge graph as an example, Better Days is starred by Zhou Dongyu and Jackson Yee. Zhou Dongyu also starred in Soulmate, the genre of Soulmate is drama/love/youth, which is represented by the path as follows: \(Better Days {\mathop {\longrightarrow }\limits ^{Starring}} Zhou Dongyu {\mathop {\longrightarrow }\limits ^{Starring}} Soulmate {\mathop {\longrightarrow }\limits ^{Genre}} drama/love/youth\). The genre of Better Days is also drama/love/youth, that is, \(Better Days {\mathop {\longrightarrow }\limits ^{Genre}} drama/love/youth\). The embedding-based methods usually only represent the embedding vectors of the entities Better Days, drama/love/youth and the relation Genre, but the path \(Better Days {\mathop {\longrightarrow }\limits ^{Starring}} Zhou Dongyu {\mathop {\longrightarrow }\limits ^{Starring}} Soulmate\) is ignored, i.e., the semantic relationship between Better Days and Soulmate. Path-based methods explore various connections among items in KG, thereby providing additional guidance for recommendations. Path-based methods rely heavily on manually designed features to express path semantics, and path semantics further rely on domain knowledge [5]. In addition, manually designed features are usually incomplete and cannot cover all possible entity relationships, hindering the improvement of recommendation quality. To make up for the deficiencies of the above methods, neighborhood-based methods utilize the structure of the KG in a center-neighborhood manner to make full use of the advantages of the knowledge graph network structure, and its vector form is also convenient for numerical modeling of entities or relationships [14]. Neighborhood-based methods are usually based on graph neural networks [15,16,17] and they combine the advantages of both embedding-based methods and path-based methods. This type of method not only uses the semantic representations of KGs but also uses the connectivity structure among items in KGs. However, most of the existing neighborhood-based methods ignore the diversity of item information (e.g., texts) and the feature interactions between neighboring nodes at the same level in KG in the process of propagation and aggregation.

Although many existing methods have proposed effective models, we argue that two problems need to be further studied: (1) Missing Text Semantics. Ignoring the diversity of item information (e.g., texts) leads to a certain amount of missing information, resulting in incomplete item representations. Texts contain rich information and have a great impact on the users’ preferences and potential interests. For example, when a user wants to watch a movie, he may first browse the related introductions and reviews. Textual information has a great influence on people’s final decision-making behavior. Therefore, it is necessary to introduce textual information into the recommendation model to enrich the feature representations of items. (2) Ignoring Feature Interactions between Neighboring Nodes. Without feature interactions, we sometimes not only lose information but also tend to draw wrong conclusions, i.e., Simpson’s paradox. Existing methods usually use graph neural network (GNN) and linear collector (LC) to achieve the aggregation of neighbor information. However, these methods ignore the interaction between neighboring nodes at the same level in KG of the particular node, as shown in Fig. 1. Feature interaction is the interaction between features. The product of two features can form a simple second-order interaction feature. This multiplication relationship can be analogized with the logical operator “AND”, which can represent a conditioning result, such as “Female” AND “Emotional Drama”. Suppose there are three independent variables \(x_1\), \(x_2\), \(x_3\), then their second-order feature interactions can be simply expressed as: \(y=w_1 x_1 x_2+w_2 x_1 x_3+w_3 x_2 x_3\). When making recommendations, the combination of features may be more meaningful than individual features and better reflect the preferences of certain groups, for example, “Young” “Female” prefer “Trendy Drama”, but “Young” “Male” prefer “Police Films”. We are inspired by the excellent results of Factorization Machine (FM) [18]. Its success proves that feature interactions can mine the users’ potential preferences. Existing neighborhood-based methods, however, rarely consider feature interactions between neighboring nodes. Consequently, when incorporating entities from knowledge graphs for recommendations, the second-order feature interactions among nodes should be considered.

Fig. 1
figure 1

High-order connectivity and feature interactions in knowledge graph

To address these issues, we introduce item text semantics and second-order feature interactions between neighboring nodes in the model. Further, we introduce the item knowledge representation module and item text representation module, thus, Bilinear Knowledge-aware Graph Neural Network Fusing Text Information (BKGNN-TI) is proposed. There are two focuses here: (1) Item Knowledge Representation Module. The information in KG is extracted using the item knowledge representation module, in which we introduce a bilinear collector to capture the second-order feature interactions. The challenge of this module is how to build Bilinear Collector (BC) to better obtain the interaction information between features of neighboring nodes in the KG. We build the bilinear collector for the BKGNN-TI. It can explicitly model the interaction between two nodes, i.e., explicitly encode the second-order interactions between neighboring nodes at the same level in the knowledge graph and enhance the representation of the central node. (2) Item Text Representation Module. Normally, node representations in KG are randomly initialized. Item representations are computed based on random initialization when only KG is used as auxiliary information. We introduce textual information to give specific semantics to item representations, thus, items with similar semantics are closer to each other. We extract the textual information of the items using the item text representation module. The challenge of this module is using what network structure to extract text features more accurately. We use ALBERT+Bi-LSTM+Attention network to better exploit the contextual information of the texts. Inspired by the application of ALBERT in Natural Language Processing (NLP), BKGNN-TI builds a item text representation module based on ALBERT, Bi-LSTM, and Attention mechanism to process the item texts and further enrich the feature representations of the items.

Our contributions in this paper are summarized as follows:

  1. (1)

    We propose a Bilinear Knowledge-aware Graph Neural Network Fusing Text Information (BKGNN-TI). In addition to the existing item representation information, we introduce the second-order feature interactions between neighboring nodes and the item text semantics, which can obtain more complete item representations with less information loss.

  2. (2)

    In addition to using the publicly available dataset Movielens-20M, we also build our IPTV dataset, which is mainly composed of two parts: IPTV Movie Knowledge Graph and IPTV Interaction Data. Existing models usually use only the English dataset, we constructed this Chinese dataset as a contrast.

  3. (3)

    Experiments on two real-world recommendation scenarios prove the efficacy of BKGNN-TI over several state-of-the-art baselines. In addition, since Movielens-20M is an English dataset and our IPTV is a Chinese dataset, the experiments prove that our BKGNN-TI is a generalization for both Chinese and English datasets.

The remaining of this paper is organized as follows. Section 2 introduces the related works, covering the recommendations with knowledge graphs and recommendations using textual features. We define the problem in Section 3 and present the proposed method BKGNN-TI in detail in Section 4. After that, we conduct experiments in Section 5. Finally, we conclude the paper in Section 6.

2 Related Work

2.1 Recommendations with Knowledge Graphs

Knowledge graph-based recommendation algorithms can be split into three main categories: embedding-based methods, path-based methods, and neighborhood-based methods.

Embedding-based methods can learn vector representations of entities/relations in KG for recommendations by utilizing Knowledge Graph Embedding (KGE) [19]. Specifically, we obtain the low-rank embeddings of KGs by applying KGE, and KGs are introduced into recommendation systems in the form of embedding vectors. CKE [10] divides knowledge information into three sections: structured knowledge based on knowledge graph, text knowledge based on texts, and visual knowledge based on images. CKE utilizes multiple approaches to extract and fuse the three types of knowledge separately to improve the quality of the recommendation system. Ai et al. [20] proposed a knowledge base representation learning framework, and their model is a collaborative filtering model constructed based on user-item knowledge graph. By using TransE [21], user vectors, relation vectors, and item vectors are generated from the user-item knowledge graph. Embedding-based methods only take advantage of semantic information from KG, and they do not make use of the semantic relationship between entities. This kind of method is not interpretative.

Path-based methods are also known as the recommendation algorithms based on Heterogeneous Information Network (HIN) [22]. This type of method regards KGs as heterogeneous information networks, constructing meta-path [22] or meta-graph [23, 24]-based features between items. This methodology explores connectivity patterns among items in KG, naturally bringing interpretability into the recommendation. HeteRec [25] is an implicit feedback recommendation model based on item-item similarity. The model first uses PathSim [22] to calculate the similarity of item pairs along different paths in the KG, then combines it with matrix factorization technology to obtain the potential feature vectors of users and items, and finally uses the Bayesian ranking optimization technique to learn the weights of the model to obtain the recommendation results. Yu et al. [26] further optimize the model HeteRec [25] by directly using the user-item similarity to represent the user diffusion scores, which more fully utilizes the paths of the knowledge graph and improves the recommendation effect. Although path-based methods have interpretability, they rely heavily on manually designed features to express path semantics, and path semantics further relies on domain knowledge. For example, we design a path “movie-be directed of-director”, which means the director information of the movie.

Neighborhood-based methods have been generated to complement the two methods above which can take advantage of only one aspect of information in KG. The idea of embedding propagation, i.e., using multi-hop neighbors in KGs, is utilized in neighborhood-based methods to enrich the representations of users or items, which means that the potential information of KGs can be fully mined. RippleNet [5] proposes an end-to-end model to extract user-centric neighborhood information, expands the users’ latent interests along with the links in the KG automatically and iteratively. KGCN [15] proposes a model which mines item neighborhoods, abstracting samples from the neighborhood of each entity in the KG and regard them as receptive field. Afterward, the model combines the neighborhood information to get the feature representation of a given entity and discover the high-order structure and semantic representation of the KG automatically. Similar to KGCN [15], KGNN-LS [16] extends GNN architecture to knowledge graphs to simultaneously capture semantic relationships between the items as well as personalized user preferences using the user-specific relation scoring function. Meanwhile, the model introduces label smoothness [27] for the regularization of edge weights to prevent overfitting. KGAT [17] explicitly models the high-order interactions in KG in an end-to-end manner. Neighborhood-based methods benefit from both the semantic representations of KGs and the connectivity patterns for better recommendations. Moreover, neighborhood-based methods have the ability to interpret the recommendation results. Although neighborhood-based methods have made great progress with their ability to exploit both the semantic representations of KGs and the connectivity structure between items, the diversity of item information and the second-order feature interactions are ignored in the existing neighborhood-based methods.

2.2 Recommendations with Text Features

The text features of items contain rich information, which helps to further infer users’ preferences and potential interests. The introduction of texts can enrich the feature representations of users and items, improving the accuracy of recommendations. For example, when a user decides to watch a certain movie, he may first browse the introductions and reviews. We can conclude that the item texts can be used to enrich the feature representations of items. TriRank [28] indicates that aside from users’ ratings, their affiliated reviews often provide the rationale for their ratings and identify what aspects of the items they care most about. DeepCoNN [12] learns item properties and user behaviors jointly from review texts with two parallel neural networks. RMR [29] utilizes review texts with a topic model. CoA-CAMN [30] combines visual and textual information in Twitter Hashtag recommendation, simultaneously utilizing a cross-attention mechanism to collectively consider the textual content, the historical behavior of tweeters, and latent interests of candidate users. With the rapid development of convolutional neural networks (CNN), their outstanding performance in NLP tasks has gradually been noticed by many researchers. Compared with traditional word embedding models, CNN can better capture the contextual features. DAN [31] is a deep attention network for news recommendation and it utilizes two parallel convolutional neural networks (PCNN) to capture features of news titles and summaries. Then, the model combines them with an attention mechanism that aims to calculate the different influences of given users’ historical behaviors on candidate news. A deep knowledge-aware network (DKN) [32] adds the knowledge graph representation into the overall structure of CNN. DKN aligns the words and entities to input them into the knowledge-aware convolutional neural network (KCNN), fuses the semantic-level and knowledge graph-level representations to obtain the embedding vectors of the texts, and then combines them with the dynamic features of the user to obtain the click prediction results. Wang et al. [33] propose a feature representation method based on short texts. In this paper, a short text is first conceptualized to obtain a concept vector, and the relevant concepts of the word and short text are combined to construct an embedding vector, which is finally input into CNN for training.

According to the above mentioned, to refine the item feature representations, feature interactions between neighboring nodes and item text semantics need to be considered. For the first point, two types of interaction information are involved in the KG-based recommendation systems. They are the interaction information between nodes at different levels and between neighbor nodes at the same level. BKGNN-TI emphasizes the importance of second-order feature interactions between neighboring nodes while obtaining high-order connectivity of the knowledge graph. More specifically, the second-order feature interaction mentioned here refers to the interaction information between the same-level neighbor nodes in the knowledge graph. For the second point, since the diversity of item information is not considered in the existing neighborhood-based methods, textual information of items is also used for recommendations besides KG. Inspired by natural language processing, BKGNN-TI builds a item text representation module based on ALBERT [34], Bi-LSTM [35], and Attention [36] to process the item text information and further enrich the feature representations of items.

3 Problem Formulation

We formulate the recommendation task in this paper as follows. Given user-item interaction matrix Y, item texts T and knowledge graph G, our goal is to predict whether user u will be engaged with item v that has not interacted with before and recommend those high probability items to users. The schematic diagram of the input information is shown in Fig. 2, which includes part of the interaction information, knowledge graph and text information. Specifically, we aim to learn the prediction function \(\hat{y}_{uv}=F(u,v\mid \Theta ,Y,T,G)\), where \(\hat{y}_{uv}\) denotes the prediction score, that is, the probability that user u will interact with item v. Prediction function F is our recommendation algorithm BKGNN-TI, \(\Theta\) are model parameters of function F.

Fig. 2
figure 2

Diagram of input information

The specific meanings associated with the variables mentioned above are shown below. We have a set of M users \(U=\{u_1,u_2,\ldots ,u_M\}\) and a set of N items \(V=\{v_1,v_2,\ldots ,v_N\}\). The set of texts corresponding to these N items is \(T=\{t_1,t_2,\ldots ,t_N\}\). The interaction matrix \(Y\in {R^{M\times N}}\) between the users and the items can be obtained according to users’ historical interactions. The interactions can be obtained by explicit feedback (e.g., users’ previous ratings) or implicit feedback (e.g., history of clicking, watching, browsing, or purchasing). We utilize implicit feedback for recommendations and define each entry \(y_{uv}\) in the user-item interaction matrix \(Y\in {R^{M\times N}}\) as follows: \(y_{uv}=1\) when user u has interacted with item v, otherwise \(y_{uv}=0\).

$$\begin{aligned} y_{uv} = \left\{ \begin{array}{cl} 1, &{} \text{ if } \text{ user } u \text{ engaged } \text{ with } \text{ item } v \\ 0, &{} \text{ otherwise } \end{array} \right. \end{aligned}$$
(1)

The knowledge graph G is represented in the form of entity-relation-entity triples (hrt). Here \(h\in E\), \(r\in R\), \(t\in E\) denote the head entity, relation, and tail entity of a knowledge triple, E and R are the sets of entities and relations in the KG, respectively. For example, the triple (Jackson Yee, actor, Better Days) describes the fact that Jackson Yee is the actor of the film “Better Days”. The triple (hrt) means that node h is directly connected to node t through relation r, so node t is defined as a first-order neighbor of node h. If KG contains two triples \((h_1,r_1,t_1)\) and \((t_1,r_2,t_2)\), which means that node \(h_1\) can be connected to node \(t_2\) through relation \(r_1\), node \(t_1\), and relation \(r_2\), node \(t_2\) is a second-order neighbor of node \(h_1\). By analogy, node \(t_n\) is defined as a high-order neighbor of node \(h_1\) if it can be connected to node \(h_1\) by more than two nodes. In addition, if KG contains two triples \((h_1,r_1,t_1)\) and \((h_1,r_2,t_2)\), node \(t_2\) is defined as the same-level neighbor of node \(t_1\). In the recommendation system, item \(v\in V\) can be aligned with entity \(e\in E\) in the KG. For example, movie “Better Days” also appears in the entity of the knowledge graph under the same name. This correspondence makes it possible to introduce knowledge graph into the recommendation system.

4 The Proposed Method

In this section, we introduce bilinear collector (BC) and item text representation module to build a Bilinear Knowledge-aware Graph Neural Network Fusing Text Information (BKGNN-TI). BKGNN-TI can obtain richer item features and thus improve the performance of recommendation system. Figure 3 shows the framework of BKGNN-TI, which mainly consists of three parts: item knowledge representation module (green sections in Fig. 3), item text representation module (orange sections in Fig. 3), and prediction module. The specific process is shown in Fig. 3. The knowledge feature representations are extracted from the KG in the item knowledge representation module, while the text feature representations are extracted from the item texts in item text representation module, respectively. According to the similarity of the items and users, the possibility of users interacting with items can be calculated in the prediction module.

Fig. 3
figure 3

The framework of BKGNN-TI

4.1 Item Knowledge Representation Module

In the item knowledge representation module, the KG, user information, and interactions are imported to construct a unique item v-centric knowledge graph for each user. The item knowledge is extracted from the KG more comprehensively by iteratively aggregating the linear neighbor information and bilinear feature interaction information. The high-order connectivity of the KG and the second-order feature interaction between neighbor nodes are captured. As shown in Fig. 3, the module can be divided into the embedding layer and the information propagation layer.

4.1.1 Embedding Layer

The knowledge graph G is input into the item knowledge representation module as a triple of (entityrelationshipentity). The neighbor entities and neighbor relationships of item v are previously extracted from the KG. N(v) is used to denote the set of first-order neighbor entities of item v, and R(v) is used to denote the set of relationships between item v and its first-order neighbors. Due to the large number of neighbors, a fixed number H of neighbor entities are randomly selected to construct the set of neighbor entities S(v) and the set of neighbor relations Q(v), where \(S(v)\triangleq \{e \mid e\sim N(v) \}\) and \(\mid S(v)\mid =H\), \(Q(v)\triangleq \{r \mid r\sim R(v) \}\) and \(\mid Q(v)\mid =H\).

With embedding technique, user ID, item ID, entity ID and relationship ID are converted into embedding vectors in the embedding layer, where user vector \(u\in R^d\), item vector \(v\in R^d\), entity vector \(e\in R^d\), relationship vector \(r\in R^d\), and d is the dimension of the embedding vectors.

4.1.2 Information Propagation Layer

The information propagation layer can be further subdivided into two parts: information collection and information fusion. In the former part, the high-order connectivity in the KG is captured through a linear collector on the basis of graph convolutional network (GCN), and the second-order feature interactions between neighboring nodes can be obtained through a bilinear collector, respectively. In the latter part, the two aspects are fused to obtain more complete item knowledge and generate item knowledge vectors.

Linear collector (LC) aggregates neighbor vectors by linear weighted sum to generate linear vectors, as shown in Fig. 4. Different users pay different attention to homogeneous relationships. Take the movie recommendation as an example, some users prefer to watch programs of the same genre, while other users prefer to watch programs produced in a certain country. Therefore, the “user preference for the relationship” is defined as the weight of linear aggregation of item neighbors, which is calculated as follows:

$$\begin{aligned} \pi _r^u = g_u(r) = g(u,r) \end{aligned}$$
(2)

where u is the user embedding vector, r is the relationship embedding vector, and \(g: {R^d}\times {R^d} \rightarrow R\) is a function measuring the similarity between the user and the relationship, such as the inner product. The linear aggregation weights of the item neighbors are normalized by the following formula:

$$\begin{aligned} \tilde{\pi }_r^u = \frac{exp(\pi _r^u)}{\sum _{e\in S(v)}exp(\pi _r^u)} \end{aligned}$$
(3)

where S(v) is the set of neighbor entities for item v.

The linear aggregation vector of items v with the normalized weights can be expressed as follows:

$$\begin{aligned} LC(v_{S(v)}^u) = \sum _{e\in S(v)}{\tilde{\pi }_r^u}e \end{aligned}$$
(4)

where entity e is the first-order neighbor of item v, and r is the relationship between item v and entity e.

Fig. 4
figure 4

Illustration of linear collector

Bilinear collector (BC) interacts two adjacent nodes in the KG and captures the second-order feature interactions of neighboring nodes through polynomial interaction. BKGNN-TI is inspired by, but different from, the KG-BGAT [37]. Although KG-BGAT designs a bilinear collector based on the principle of FM, its input is Collaborative Knowledge Graph (CKG), so it blurs the boundary between the user, item, and item attributes when performing feature interaction, which is not the same as FM. The BKGNN-TI is based on the Item Knowledge Graph, specifically this KG does not include user information, but only item and item attribute information. The model performs feature interaction between item attributes, which is in line with the principle of FM. The bilinear aggregation vectors are calculated as follows:

$$\begin{aligned} BC(v_{\tilde{S}(v)}^u) = \frac{1}{C^2_{\tilde{S}(v)}}{\sum _{e_i\in {\tilde{S}(v)}}}{\sum _{e_j\in {\tilde{S}(v)}}}e_i\odot e_j \end{aligned}$$
(5)

where \(\tilde{S}(v)\) is the set of neighbor entities containing v; \(C^2_{\tilde{S}(v)}\) is the number of combinations of any two nodes in \(\tilde{S}(v)\); \(e_i\) and \(e_j\) are both nodes in \(\tilde{S}(v)\) with \(i<j\); and \(\odot\) denotes the Hadamard product, which is the multiplication of the elements of two vectors at corresponding positions. The strong signal shared by two nodes on a feature is enhanced and the weak signal is weakened with the Bilinear Collector. Thus, the interaction information of the features is captured.

Both the high-order connectivity and the second-order feature interactions in the KG can be captured in the information fusion layer, as shown in Fig. 5. First the information from linear and bilinear collectors within a single layer is integrated to form a single-layer neighbor aggregation vector, and then the vector is iteratively propagated using GCNs. The neighbor aggregation vector of item v is defined as follows:

$$\begin{aligned} v_{S(v)}^u = (1-\alpha )\cdot LC(v_{S(v)}^u)+ \alpha \cdot BC(v_{\tilde{S}(v)}^u) \end{aligned}$$
(6)

where \(\alpha \in \left[ 0,1\right]\) is the hyperparameter that controls the bilinear collector occupancy ratio. An aggregator is used to fuse the item vector v with the neighbor aggregation vector \(v_{S(v)}^u\).

Fig. 5
figure 5

Schematic overview of item knowledge representation module (take the information aggregation of item v as an example: the set of neighbor entities of item v is \(\{e_1,e_2,e_3\}\), the set of neighbor entities of \(e_1\) is \(\{e_4,e_5,e_6\}\). Similarly, the set of neighboring entities of \(e_2\) and \(e_3\) can be known. In the first layer of GCN, the linear collector aggregates \(e_4\),\(e_5\), and \(e_6\) to \(e_1\) to obtain \(e_1^u[1]_{LC}\). Similarly, \(e_2^u[1]_{LC}\) and \(e_3^u[1]_{LC}\) can be obtained. Aggregate \(e_1\),\(e_2\), and \(e_3\) to item v to get \(v^u[1]_{LC}\). The bilinear collector computes \(e_1^u[1]_{BC}\) by simultaneously aggregating \(e_1\odot e_4\), \(e_1\odot e_5\), \(e_1\odot e_6\), \(e_4\odot e_5\), \(e_4\odot e_6\), and \(e_5\odot e_6\). Similarly, we can get \(e_2^u[1]_{BC}\),\(e_3^u[1]_{BC}\) and \(v^u[1]_{BC}\). The weighted sum of \(e_1^u[1]_{LC}\) and \(e_1^u[1]_{BC}\) is obtained as \(e_1^u[1]\),and then \(e_1\) and \(e_1^u[1]\) are aggregated to \(e_{1,kg}^u [1]\) using an aggregator. Similarly, \(e_2^u[1]\), \(e_3^u[1]\) and \(v^u[1]\) of the first layer are obtained. In the second layer of the GCN, the operations of the first layer are repeated to obtain the final \(v^u[2]\).)

In this paper, three aggregators, namely sum aggregator (sum), concat aggregator (concat) and neighbor aggregator (neighbor), are used to fuse the vectors, and the aggregation effects of the three aggregators are compared in the subsequent experiments.

Sum Aggregator: The knowledge representation vector is obtained by adding the neighbor aggregation vector to the item vector, and then transforming it by a nonlinear function, as shown as follows:

$$\begin{aligned} v_{kg}^u=\sigma (W \cdot (v+v_{S(v)}^u)+b) \end{aligned}$$
(7)

where W is the weight matrix, b is the bias, and \(\sigma\) is a nonlinear activation function, such as tanh or ReLU.

Concat Aggregator: The knowledge representation vector is obtained by concatenating the item vector with the neighbor aggregation vector, and then transforming it by the activation function, as shown as follows:

$$\begin{aligned} v_{kg}^u=\sigma (W \cdot (v, v_{S(v)}^u)+b) \end{aligned}$$
(8)

Neighbor Aggregator: The knowledge representation vector is obtained by directly transforming the neighbor aggregation vector into a nonlinear function, as shown as follows:

$$\begin{aligned} v_{kg}^u=\sigma (W \cdot v_{S(v)}^u+b) \end{aligned}$$
(9)

The information fusion layer gradually aggregates the outermost information to item v by repeating the above operation of entity aggregation several times, thus achieving high-order propagation of information. The equation for iterative update is shown as follows:

$$\begin{aligned} v_{kg}^u \left[ h \right] =f(v_{kg}^u\left[ h-1 \right] , v_{S(v)}^u\left[ h-1 \right] ) \end{aligned}$$
(10)

where \(v_{kg}^u \left[ h \right]\) is the h-order knowledge representation vector, \(f(\cdot )\) is the aggregator, \(v_{kg}^u\left[ h-1 \right]\) is the h-1 order knowledge representation vector and \(v_{S(v)}^u\left[ h-1 \right]\) is the h-1 order neighborhood representation vector obtained after the last aggregation.

4.2 Item Text Representation Module

The inputs of the item text representation module are item texts, and the outputs are the text vectors of items. Theoretically, the inputs can be any kind of text. Taking movie recommendations as an example, the texts include program titles and user comments. However, to avoid the influence of others’ subjective factors, the more objective program titles and program introductions, rather than the user comments, are taken as the inputs of the text module. This work is inspired by the Hierarchical Attention Networks (HAN) [38], which is based on the idea of generating sentence vectors from word vectors.

The framework of item text representation module is shown in Fig. 6. First, the ALBERT is used to extract the word vectors, then the Bidirectional Long Short-Term Memory (Bi-LSTM) is used to extract features from word vectors due to its ability to capture the bidirectional dependence of semantic information, and finally the Attention mechanism is used to further capture the importance of each word in the sentence to generate the final sentence vectors. The text vectors can be obtained by the network structure of ALBERT + Bi-LSTM + Attention.

Fig. 6
figure 6

The framework of item text representation module

Before importing the text data into ALBERT, it is necessary to pad the data. For example, to unify the text length of all titles (which are generally short) and introductions (which contain more words) by intercepting or completing them, the maximum length of the titles and the average length of the introductions are taken as the standards. Suppose the number of programs is \(n_v\), the maximum length of the titles is \(L_{Tmax}\), and the average length of the introductions is \(L_{Davg}\), then the dimension of the title input vector is \([n_v,L_{Tmax}]\) and the dimension of the introduction input vector is \([n_v,L_{Davg}]\).

In order to mine the deep semantic features of the texts, the vectors output from ALBERT need to be further word encoded or sentence encoded. Bi-LSTM, as an encoder to fully consider the contextual information, can accurately capture bidirectional semantic dependencies. Since BKGNN-TI eventually needs to fuse the output vectors of the item knowledge representation module and the item text representation module, the dimensions of the item text vectors and the item knowledge vectors should be consistent. Assuming that the dimension of item knowledge vector is d, the dimensions of item text vector and Bi-LSTM hidden layers should be set to d and d/2, respectively.

In order to assign different weights to different words according to their importance, the Attention networks after the Bi-LSTM are connected when generating sentence vectors. First, the input vector \(h_i\) is converted into key vector, value vector and query vector by three different ways, where the key vector \(k_i\) is generated by \(h_i\) through the MLP layer, the value vector is \(h_i\), and the query vector is a randomly generated weight matrix q. The calculation equations are shown as follows:

$$\begin{aligned} k_i&=\tanh (Wh_i+b) \end{aligned}$$
(11)
$$\begin{aligned} \alpha _i&=\frac{exp(k_i^Tq)}{\sum _{i}exp(k_i^Tq)} \end{aligned}$$
(12)
$$\begin{aligned} s&=\sum _{i}\alpha _i h_i \end{aligned}$$
(13)

where \(\alpha _i\) is the attention score, s is the final output vector, W and b are the weights and biases of the one-layer MLP, respectively.

There are two ways to derive sentence vectors by ALBERT + Bi-LSTM + Attention network: one is to obtain the sentence vectors directly using ALBERT, and then further extract the feature representations of the sentence vectors using Bi-LSTM; the other is to extract the word vectors using ALBERT + Bi-LSTM network, and then weighted aggregate the word vectors into sentence vectors by Attention mechanism. Relatively speaking, the structure of ALBERT + Bi-LSTM + Attention can better capture the contextual information and assign more weights to the important words. However, due to the large content of the program introductions, the generated ALBERT word vector matrix would be too large to be read into the network subsequently because of memory overflow problem. So the former way is taken for the program introductions and the latter for the program titles. In the item text representation module, the program titles are encoded as the title vector \(v_{title}\), and the program introductions are encoded as the introduction vector \(v_{dpt}\), then the final text vector of the program is \(v_{text}=v_{title}+v_{dpt}\) with dimension d.

4.3 Prediction Module

In prediction module, first the item text vector and the item knowledge vector generated in the above steps are fused into the item vector v. Then the similarity function \(f(\cdot )\) is used to calculate the relevance between the user and the item, and finally the sigmoid function is used to restrict the range of the similarity between (0, 1) as the prediction score, which indicates the possibility of interaction between the user and the item. The equation is shown as follows:

$$\begin{aligned} \hat{y}=sigmoid(f(u,v)) \end{aligned}$$
(14)

where \(f(\cdot )\) is the similarity function, u is the user vector, and v is the item vector. In this paper, the prediction function \(f(\cdot )\) is set as the inner product function.

4.4 Learning Algorithm

BKGNN-TI uses cross-entropy as the loss function and introduces the L2 regularization of the weight parameter in the objective function to constrain the model. It should be noted that in the traditional GNN, the weights of the edges in the graph are fixed and do not change along with the parameters, but the edge weights in BKGNN-TI change depending on the users, which will lead to an increase in the number of parameters in the network and the risk of overfitting. To address the above problems, similar to KGNN-LS [16], a Label Smoothness Regularization (LSR) is introduced in the objective function, i.e., the loss function of the Label Smoothness (LS) is added as a regular term to the final objective function of the model. In order to define the loss function for the label smoothness, the concept of adjacency matrix is introduced. The adjacency matrix of a user is defined as \(A_u\in R^{|E|\times |E|}\), and each element of the adjacency matrix \(A_u^{ij}=g(u,r_{e_i,e_j})\), where \(g(\cdot )\) is the above-mentioned function that measures the similarity between the user and the relationship, and \(r_{e_i,e_j}\) represents the relationship between entity \(e_i\) and \(e_j\). In particular, if there is no relationship between entity \(e_i\) and \(e_j\), then \(A_u^{ij}=0\). Assuming that the prediction results of the label smoothness task is \(\hat{l}_u(v)\), the loss function of the label smoothness can be expressed as follows:

$$\begin{aligned} R(A)=\sum _{u}R(A_u)=\sum _{u}\sum _{v}J(y_{uv},\hat{l}_u(v)) \end{aligned}$$
(15)

where J is the cross-entropy loss, u is the user, v is the item, and \(y_{uv}\) is the true label.

With the introduction of the label smoothness regularization, the objective function of BKGNN-TI can be expressed as follows:

$$\begin{aligned} \min _{W,A}L=\min _{W,A}\sum _{u,v}J(y_{uv},\hat{y}_{uv})+\lambda R(A)+\gamma \Vert F\Vert _2^2 \end{aligned}$$
(16)

where W is the weight matrix of GNN, \(\Vert F\Vert _2^2\) is the L2-regularizer, which contains the relevant parameters of the aggregator and encoder, and \(\lambda\) and \(\gamma\) are hyperparameters to adjust the ratio of the label smoothness regularization and L2-regularizer in the objective function. The first term of this objective function is the cross-entropy loss, the second term is the label smoothness regularization, and the last term is the L2-regularizer.

5 Experiments

In this section, the datasets, related experimental settings, baselines and the results of the experiments will be illustrated individually.

5.1 Datasets

To validate the recommendation effect of BKGNN-TI proposed, extensive experiments are conducted on two datasets: Movielens-20M and IPTV. The specifics of the two datasets are shown in Table 1.

MovieLens-20MFootnote 1 is a widely used movie recommendation dataset containing about 20 million explicit ratings (ranging from 1 to 5) from the MovieLens website. The Movielens-20M used in this paper is a public dataset provided by KGNN-LS [16], which is constructed based on both the Movielens-20M movie dataset and the Satori knowledge graph. It mainly includes two parts: user-item interactions and knowledge graph. The program texts of BKGNN-TI require two parts, i.e., program titles and program introductions. However, Movielens-20M provides only the title information of the movies. Therefore, the corresponding program introductions are crawled from the IMDb movie website. Some of the program introductions, which could not be crawled, were supplemented by manual query to improve the program texts.

IPTV is our self-built dataset, which includes 37,095,369 implicit interaction records collected from an IPTV platform in one province. The dataset is mainly composed of two parts: IPTV movie knowledge graph and IPTV interaction data. The data sources of IPTV are user behavior data and program data collected from the IPTV system. In terms of knowledge graph, this work builds an IPTV movie knowledge graph based on the program data of an IPTV platform in one province, referring to the construction process of knowledge graph, and incorporating OwnThinkFootnote 2. It should be noted that OwnThink is a third-party Chinese knowledge graph that incorporates more than twenty-five million entities with billion-level entity attribute relationships and thousands of movie-related attributes. It provides a full set of interfaces with relatively high knowledge quality and a more comprehensive supplement to IPTV program data.

After determining the knowledge sources, the IPTV movie triples are extracted from the IPTV program data and the movie knowledge graph subgraphs are extracted from the third-party knowledge graph. Then the IPTV movie triples and movie knowledge graph subgraphs are fused through the processes of entity cleaning, entity linking, attribute fusion, value normalization, and data merging to form the IPTV movie knowledge graph.

In terms of interaction data, the historical data of the IPTV platform in one province over some time were pre-processed with data cleaning and data negative sampling. The final IPTV dataset is obtained by fusing the knowledge graph with the interaction data. In the specified period, the “watched or not” field is set to “1” for the programs clicked by users, otherwise, the field is set to “0”. In this way, the user’s interaction history was obtained according to the user behavior log file. The user-item interaction data were cleaned of duplicate values and missing values, and noisy data (interaction records with viewing time less than 60s) was eliminated. Since the interaction data are very sparse, low-frequency users and low-frequency movies (less than 20 interactions) are removed to improve the data sparsity of IPTV. Program texts contain the program titles and program introductions. Data cleaning was performed on the program texts, and the meaningless special symbols (such as & amp;) were removed or replaced with corresponding symbols.

Table 1 Basic statistics of the two datasets

5.2 Experimental Settings

In this paper, we focus on two experimental scenarios, i.e., CTR prediction and Top-K recommendation. In CTR scenario, the trained model is applied to each sample in the test set and outputs the predicted click probability. AUC and F1 are used as metrics to evaluate the performance. In Top-K recommendation scenario, the learned model is used to select K items with high prediction probability as recommendation results for each user. Precision@K and Recall@K are used as metrics to evaluate the performance (in the following sections, Precision@K is abbreviated as P@K, and Recall@K is abbreviated as R@K).

The ratio of training set, validation set and test set is set to 6:2:2. The number of neighbors H in constructing the set of neighboring entities S(v) is set to 8. The results of testing the effect of different H-constructed entity sets show that too many neighbors do not bring better results, and \(H=8\) is the most suitable value. The dimension d of embedding layer in BKGNN-TI is 32, the number of layers in GCN is 1, and the ratio of bilinear collector \(\alpha\) is 0.6. The learning rate during training is \(l=2\times 10^{-2}\), the number of stopping steps \(s_{stop}\) for early stopping is set to 5, and the number of training rounds \(n_{epoch}\) is 10. In the loss function, the coefficient of label smoothness regularization \(\lambda\) is set to 1, and the coefficient of L2-regularizer \(\gamma\) is \(1\times 10^{-7}\). The above-mentioned parameter values are the optimal parameters obtained by testing on the validation set.

5.3 Baselines

Multiple models are used as baselines for comparison experiments.

SVD [39] is a classical implicit factor model based on collaborative filtering in recommendation systems.

LibFM [40] is a feature-based factorization model for CTR prediction.

LibFM + TransE extends the LibFM method, utilizing the entity features learned by TransE [21] to introduce the KG to LibFM.

PER [26] is a representative of path-based method. In this experiment, all “item-attribute-item” are used as features of PER (e.g., “movie-director-movie”).

CKE [10] is a representative of embedding-based method. In this experiment, CKE is implemented as a collaborative filtering plus a structural knowledge module.

RippleNet [5] is a representative of neighborhood-based method. It propagates user preferences over the knowledge graph for recommendation.

KGNN-LS [16] utilizes the idea of inward aggregation to aggregate the neighbors of the KG inward layer by layer, thus aggregating the high-order neighbor information into the item vector. In addition, the model introduces label smoothness regularization to prevent overfitting.

5.4 Results and Discussion

5.4.1 The Effect of BKGNN-TI on Movielens-20M

In order to comprehensively and objectively verify the recommendation effect of BKGNN-TI, comparison experiments for CTR prediction and Top-K recommendation on Movielens-20M were conducted, respectively, to compare the experimental results of BKGNN-TI and various baselines. Table 2 shows the comparison results in CTR prediction, the improvement ratios \(Improve-AUC\) and \(Improve-F1\) of BKGNN-TI relative to each baseline. The two calculation equations are shown as follows:

$$\begin{aligned} Improve-AUC&=\frac{AUC_{BKGNN-TI}-AUC_{baseline}}{AUC_{baseline}} \end{aligned}$$
(17)
$$\begin{aligned} Improve-F1&=\frac{F1_{BKGNN-TI}-F1_{baseline}}{F1_{baseline}} \end{aligned}$$
(18)

Table 3 and Fig. 7 show the comparison results of the BKGNN-TI and baselines for Top-K recommendation, and gives the improvement ratio Improve of BKGNN-TI relative to the best performance in all baselines. The improvement ratio Improve is calculated as follows:

$$\begin{aligned} Improve =\frac{Recall@K_{BKGNN-TI}-Recall@K_{best_{baseline}}}{Recall@K_{best_{baseline}}} \end{aligned}$$
(19)

From Tables 2 and 3, it can be seen that BKGNN-TI outperforms baselines, especially for Top-K recommendation. In Top-K recommendation, the improvement ratio Improve gradually decreases as the number of recommendation lists K increases. The improvement of Recall@K value of BKGNN-TI is stable around 0.02 at different K. However, when K gradually increases, Recall@K also gradually increases, which leads to a gradual decrease of Improve.

Table 2 Comparative experiments of BKGNN-TI based on Movielens-20M for CTR prediction
Table 3 Comparative experiments of BKGNN-TI and baselines based on Movielens-20M for Top-K recommendation
Fig. 7
figure 7

Comparative experiments of BKGNN-TI and baselines based on Movielens-20M for Top-K recommendation

5.4.2 The Effect of BKGNN-TI on IPTV

The same experiments as those on Movielens-20M are conducted on IPTV in CTR prediction and Top-K recommendation. The experimental results are shown in Table 4.

Table 4 Comparative experiments of BKGNN-TI based on IPTV dataset

On the IPTV dataset, both the AUC of KGNN-LS and BKGNN-TI can reach above 0.9, which proves the validity and reasonableness of the IPTV dataset constructed in this paper. In addition, compared with KGNN-LS, BKGNN-TI has advantages in all metrics, which again verifies that BKGNN-TI has excellent recommendation capability. From the overall analysis of the experimental results of BKGNN-TI on Movielens-20M and IPTV, we can find that BKGNN-TI has a good recommendation effect on both Chinese and English datasets.

5.5 Ablation Experiments

BKGNN-TI contains three important modules: the knowledge graph module, the bilinear collector module and the item text representation module. To verify the effectiveness of these three modules, respectively, ablation experiments are conducted with appending each module to the model in turn. The experimental results are shown in Table 5, where the KG module refers to the knowledge graph, the BC module refers to the bilinear collector, and the Text module refers to the item text representation module. The mark of “\(\surd\)” means that the corresponding module is added to the current experiment, otherwise, the module is not added.

Table 5 Ablation experiment of BKGNN-TI

From Table 5, it can be seen that the values of each evaluation metric increase sequentially with the increase of the number of modules in CTR prediction and Top-K recommendation. This indicates that the recommendation effect becomes better, but at the same time, the execution time gradually increases. The effectiveness of each module is discussed as follows:

  • (1) Effectiveness of the knowledge graph. The knowledge graph introduces auxiliary information about items for recommendations, including the attributes of the items and the relationships between the entities. Take the movie “Better Days” for example, it stars Zhou Dongyu and Jackson Yee, is directed by Derek Tsang, and the genre is drama/love/youth. “Better Days” is a node in the movie knowledge graph. By introducing KG into the model, we can further explore the information about the relevant entities of items (such as Zhou Dongyu, Jackson Yee, Derek Tsang, and drama/love/youth), as well as the relationships between entities (such as actors, directors, and categories). The GCN aggregates each order of neighboring nodes in the KG to the item vector, effectively capturing the high-order connectivity of the KG, and enriching the node representations.

  • (2) Effectiveness of the bilinear collector. The bilinear collector interacts with the neighboring nodes in the KG in pairs, i.e., it performs feature combination of item neighbors. Feature combinations enable more precise mining of the reasons why users are interested in the current item, and take into account multiple possibilities. For example, if two users like to watch the movie “A Little Red Flower”, user A may like to watch the movie because of the actor “Jackson Yee” and the genre “campus drama”, while for user B, the attribution would be the actor “Haocun Liu” and the genre “warm drama”. The above possibilities can be directly and effectively captured by the feature combinations of the bilinear collector. BC can explicitly model the interaction between the actor “Jackson Yee” and the genre “campus drama”, the actor “Haocun Liu” and the genre “warm drama”, enhancing the representation of the central node “A Little Red Flower”. Therefore the potential interests of the users can be explored, and thus the accuracy of the model is improved.

  • (3) Effectiveness of Text Feature Module. To further consider the user’s interest in the content of the program, the text information of the items was introduced while taking the knowledge graph as auxiliary information. The knowledge graph only includes the relevant attribute tags of the program, without the specific content of the program. However, the content of a program is usually the decisive factor in whether a user watches a program or not. Therefore, the introduction of item texts can further enrich the item representations.

5.6 Parameter Sensitivity

In the training stage of the BKGNN-TI, some hyperparameters will have an impact on the training results, thus affecting the final recommendation effect. Three parameters for analysis are selected to explore their effects on the model effect: the number of samples in each training \(batch\_size\), the embedding vector dimension of the embedding layer in the item knowledge representation module d, and the aggregator in the item knowledge representation module aggregator. Parameter sensitivity experiments are all performed on the Movielens-20M. The test environments and versions of the model are shown in Table 6.

Table 6 Test environments and versions of the model

5.6.1 Impact of Batch\(\_\)size

The sample size of each batch is denoted as batch\(\_\)size, whose size affects the optimization degree and speed of the model. Table 7 and Fig. 8 show the experimental results of the BKGNN-TI with different batch\(\_\)sizes. From Table 7, it can be seen that the model training time tends to shorten as the batch\(\_\)size increases. The accuracy initially increases until the batch\(\_\)size reaches 8192, after what the accuracy starts to decrease. The batch\(\_\)size should be set to 8192 to optimize the performance and speed of the model.

Table 7 Parameter sensitivity of BKGNN-TI w.r.t. batch\(\_\)size
Fig. 8
figure 8

Parameter sensitivity of BKGNN-TI w.r.t. batch\(\_\)size

5.6.2 Impact of Embedded Dimensions

In the Top-K scenario, only the dimension of the embedding vector changes to test the recommendation effect of the BKGNN-TI. The experimental results are shown in Table 8 and Fig. 9. The accuracy improves as the dimension d increases when the dimension is small. However, when the dimension is larger than a certain threshold, the Recall starts to decrease. This is probably because when the dimension d is too small, the item vector and user vector contain too little information to make an effective recommendation, but when d is too large, the item vector and user vector contain too much noise information, which affects the effectiveness of recommendation. Therefore, it is necessary to select a suitable dimension that contains moderate information in the training process through experiments, so as to achieve an acceptable situation of recommendation accuracy and efficiency.

Table 8 Parameter sensitivity of BKGNN-TI w.r.t. dimension of embedding
Fig. 9
figure 9

Parameter sensitivity of BKGNN-TI w.r.t. dimension of embedding

5.6.3 Impact of Aggregators

To investigate the effects of different aggregators on the recommendation effect of BKGNN-TI, the experiments were conducted based on sum aggregator (sum), concat aggregator (concat), and neighbor aggregator (neighbor). The results are shown in Table 9. From Table 9, it is clear that BKGNN-TI generally works better when using the neighbor aggregator. One possible reason is that both the sum aggregator and the concat aggregator need to aggregate the neighbors of the items with themselves. However, due to the dual action of the GNN and the linear collector, the node information is summed too many times during multiple iterations, resulting in redundant information, so the effect is not as good as that of using the neighbor aggregator.

Table 9 Parameter sensitivity of BKGNN-TI w.r.t. aggregator

5.6.4 Cold-Start Experiments with the BKGNN-TI

One of the main purposes of using knowledge graphs in recommender systems is to alleviate the cold-start problem that exists in traditional recommender systems. In order to verify the performance of the BKGNN-TI in cold-start scenario, the experiments draw on KGNN-LS and gradually reduce the size of the training set \(train\_ratio\) from 100\(\%\) to 20\(\%\), while fixing the validation set and the test set. The experimental results are shown in Table 10 and Fig. 10. From Table 10, it can be seen that when the \(train\_ratio\) decreases, the recommendation accuracy of each model decreases to different degrees. Comparing the experimental results at \(train\_ratio\) of 20\(\%\) with those on the full training data (i.e., \(train\_ratio\) of 100\(\%\)), the AUCs of the six baseline models drop by 5.9\(\%\), 5.4\(\%\), 3.6\(\%\), 2.8\(\%\), 4.1\(\%\), and 1.8\(\%\), respectively, while the performance of BKGNN-TI drops by only 1.6\(\%\). This indicates that the BKGNN-TI still maintains a better prediction performance when suffering from cold-starts, due to the following reasons: (1) the KG can provide information about the attributes of items and the relationships between items, which introduces semantic relevance among items and enhances the item representation; (2) the KG, as a heterogeneous graph, contains various types of relationships, contributing to explore user interests; (3) our model introduces program texts that provide specific semantics for item representations. The simultaneous introduction of both auxiliary information makes the item representations more complete, and the presence of various complex relationships in KG enhances the connection between items. Therefore, with some of the user-item interaction missing, our model performs better compared to the baselines.

Table 10 Cold-start experiments with the BKGNN-TI
Fig. 10
figure 10

Cold-start experiments with the BKGNN-TI

5.7 Explainable of BKGNN-TI

The main goal of recommendation systems is to improve the accuracy of results. However, in real-world scenarios, users do not understand the decisions of the recommender system and may have low trust in it, so the recommendation systems need to explain the recommendation results [41, 42]. The knowledge graph associates the user’s history with the recommendation list, thus making the recommendation results interpretable. We give an example of the interpretability of the model results in a post hoc interpretation approach [41, 43].

In the IPTV dataset, a real user A was selected and watched a total of 50 movies, including Hotel Transylvania, Battle of Memories, Reborn, and so on. Among these movies, the attribute “Category” label covers 23 categories, mainly “Action Films”, “Comedy”, “Storyline”, “Crime Film”, “Suspense Film”, “Adventure Film”, “Romantic Movie” and so on. For the 20 movies in the recommendation list, the category labels involve 22 categories, mainly “Romantic Movie”, “Adventure Film”, “Comedy”, etc. The category labels of the two parts overlap with 17 categories. Similarly, the other attribute tags of the movies overlap.

A new user is selected to click on the movies “Transformers: Revenge of the Fallen”, “Ashes of Time” and “The Grandmaster”, i.e., the user’s interaction history is these three movies. The movies recommended by our model to the user include “Buddies in India”, “Monster Hunt 2”, “Police Story 3” and so on. The knowledge graph with movie attributes is shown in the Fig. 11, with green indicating the movies the user has seen, yellow indicating the movies recommended to the user, and blue indicating the movie attribute values, and the attribute types include movie category, director, starring actor, distribution country, the content provider (abbreviated as ContentProvider), paid package belonging to IPTV platform (abbreviated as PackageName), etc. The recommended interpretability diagram is shown in the Fig. 12. As shown in the Fig. 12, Transformers:Revenge of the Fallen, which the user has watched, is an adventure, action movie and is in the Basic on-demand service. The recommended movie Buddies in India is also an adventure, action movie and is also in the Basic on-demand service. Monster Hunt 2 has the same starring role as The Grandmaster, Tony Leung Chiu Wai, and is part of the Basic on-demand service as Transformers: Revenge of the Fallen. The three movies that the user watched all belonged to action movies, Police Story 3 also belongs to action movies and has the same lead actress Maggie Cheung as Ashes of Time. The knowledge graph connects items and item attributes, giving some interpretability to the recommendation results.

Fig. 11
figure 11

Film knowledge graph subgraph

Fig. 12
figure 12

Diagram of recommended interpretability

6 Conclusion

A novel recommendation algorithm termed as BKGNN-TI is proposed in this paper. BKGNN-TI bridges the gap of existing neighborhood-based methods: (1) a single type of item information; (2) no feature interactions between neighboring nodes at the same level are captured, resulting in some information loss. BKGNN-TI further enriches the feature representations of the items by introducing the text semantics and second-order feature interactions between neighboring nodes at the same level in KG. Specifically, BKGNN-TI emphasizes the importance of second-order feature interactions between neighboring nodes while obtaining high-order connectivity of the knowledge graph. Through extensive experiments on the datasets of supplemented Movielens-20M and constructed IPTV, BKGNN-TI is shown to outperform baselines, as well as the efficacy of knowledge graph, bilinear collector, and item text representation module is demonstrated. The higher order interactions of item neighbors, not limited to second-order feature interactions, should be considered in the future.