1 Introduction

The last decades have witnessed the flourishing of the World Wide Web, facilitating the development of online recommender systems (a.k.a., recommendation). Recommender systems have become essential components for internet applications (i.e., micro-video [1], E-commerce [2], and P2P lending [3]) to discover latent user interests and select items of interest for users accurately and in a timely manner based on a user-item historical interaction network. To alleviate the inherent sparsity and cold-start problems of recommender systems, an increasing amount of cutting-edge research has focused on recommendation methods that incorporate auxiliary information to capture deeper features and improve recommendation performance, including social networks [4,5,6], tags [7], and multi-modal information [8]. Knowledge Graph (KGs) are beneficial for enhancing item features with a large amount of structured attribute information. Unlike classic user-item bipartite graphs and user-user social networks, KGs are composed of a set of triplets, i.e., <head entity, relation, tail entity>. Figure 1 illustrates an example of KGs, which can describe not only node attributes, such as <Music1, Style, Popular>, but also include node relationships, such as <Singer, Friend, Arranger>. Recently, several large-scale KGs have been proposed, such as Satori, Freebase, and Google’s Knowledge Graph. These KGs can benefit recommender systems by introducing relatedness among entities, which makes it convenient to build KGs for recommendation, enrich entity information, and produce explainability. The above-mentioned enriched information in KGs can also supplement the relational modelling between users and items. Therefore, recommendation methods incorporating KGs are of interest to researchers and can effectively improve recommendation performance.

Fig. 1
figure 1

A schematic diagram of the music KGs recommendation system

Indeed, the research work dedicated to knowledge graph recommendation (a.k.a., KGR) has been ongoing. The previous KGs-based recommendation studies aim to obtain high-quality items and KGs entity embeddings leveraging pre-trained models, such as classical K nowledge G raph E mbedding (KGE) approaches are remarkable, including TransR [9], TransE [10], and RotatE [11]. Unfortunately, KGE-based KGR approaches neglect high-order connectivity and collaborative signals, leading to poor performance.

Consecutively, to resolve the natural shortcomings of the KGE approach, researchers attempt to probe the sophisticated high-order connectivity between items and entities. Primarily, there are two endeavours: 1) Meta-path-based methods. These methods mainly focus on leveraging the meta-path to capture item-entity long-distance connections and entity affiliations to augment the item/user representation. Nevertheless, these methods heavily rely on hand-designed meta-paths with expert knowledge. As a result, they are difficult to optimize during a practical training process. 2) Graph Neural Networks (GNNs)-based methods. These approaches focus iteratively over the whole KGs to find side information for recommendations. GNNs are a widespread neighbour aggregation strategy used to integrate multi-hop KGs entity nodes’ features into the target user/item representation. We must acknowledge that the existing GNNs-based KGR methods achieve excellent performance. However, can the current GNNs-based KGR methods achieve adequate awareness of high-order feature interaction signals and users’ latent intent information?

Based on the above doubts, we rethink the shortcomings and improvement goals of existing GNNs-based KGR methods. We believe that an effective approach for GNNs-based KGR methods is to couple both high-order feature interaction and user intent modelling as a whole based on which knowledge-enhanced method can be fully investigated. Nevertheless, fully exploiting high-order feature interaction signals and modelling latent intent information is by no means an easy task. To build an end-to-end KGR framework via high-order feature interaction and latent intent modelling, two issues inevitably need to be tackled:

  • How can an effective high-order feature interaction paradigm be designed? In existing studies, high-order features from GNNs aggregation are generally combined with concatenation, pooling, or summation to receive high-order information without explicitly modelling their interactions. Such high-order aggregation mechanisms easily lead to over-smoothing issue. Besides, no further valuable feature information can be encoded, which significantly limits the performance of the model. Intuitively, adequately modelling the fine-grained feature interaction signals among high-order features has profound implications for enriching node representation learning.

  • How can a user’s latent intent signal be fully captured? In the real world, users’ intents are sophisticated and diversified, driving the user to consume different items. The intent behind user-item interaction offers a deep understanding of the user preferences. Existing KGR recommendation studies rarely consider the underlying user intent modelling, which makes the trained models uninterpretable and leads to unsatisfactory model performance.

Consequently, to solve the above two issues, we propose a novel Knowledge-enhanced Recommendation with Feature Interaction and Intent-aware Attention Networks (FIRE) to address the latent intent modelling and high-order feature interaction deficiencies ignored by existing KGR methods. Initially, we adopt a GNNs-based knowledge-aware backbone network to generate the user/item prototype representation. Next, to combat the first issue raised above, we innovatively use of Convolutional Neural Networks (CNNs) that perform vertical convolutional (a.k.a., bit-level convolutional) and horizontal convolutional (a.k.a., vector-level convolutional) approaches to model multi-granular high-order feature interactions to enhance item-side representation learning. For the second issue, we use a two-level attention mechanism (i.e., node-level attention mechanism and intent-level attention mechanism) to model the latent intent embedding to enhance user-side representation learning. Finally, all user-side/item-side representations are integrated, and inner product operations are performed to output prediction scores.

Overall, the contributions of our FIRE framework are three-fold:

  • A novel high-order feature interaction paradigm: To the best of our knowledge, our work is the first attempt to incorporate high-order feature interaction techniques into the knowledge-aware recommendation task. Concretely, to enhance item-side representations, we highlight the critical importance of explicitly exploiting the feature interaction method in KGs-based GNNs recommendation methods. We propose a novel CNNs-based high-order feature interaction strategy to extract fine-grained interaction information features, which enrichs the item-side node representation learning capability.

  • Comprehensive modelling of users’ latent intent signals: For enhancing user-side representations, we propose a new approach to model a user’s latent intent by leveraging a two-level attention mechanism to enrich the node representation learning capability on the user-side.

  • Extensive experiments: We prepare three real-world datasets to evaluate our model. The empirical results demonstrate the effectiveness of our FIRE framework for KGs-based recommendation and show superiority to the current state-of-the-art baseline. Besides, numerous ablation- and model studies demystify the working mechanism and elucidate the plausibility of our FIRE model.

The rest of our paper is organized as follows. We summarize the related work in Section 2. Then, we briefly outline our task in Section 3. Based on this, we give a detailed description of our method in Section 4. In addition, the proposed model is analyzed and discussed in depth. In Section 5, a series of experiments on real-world KGs-based recommendation data are conducted, and the results are discussed in detail. Finally, a brief conclusion and future work are given in Section 6.

2 Related work

In this section, we review the current related work that is most relevant to the proposed approach. We present the relevant technical points in order from each of the following three technical perspectives: 1) KGs-based recommendation, 2) Feature interaction methods in recommendations, and 3) Disentangled representation learning (intent modelling) methods in recommendations. Next, we conclude and state these approaches after each subsection and briefly explain the differences from our proposed method.

2.1 KGs-based recommendation methods

In the early stages of KGR research, related work focused on embedding-based techniques. Entities and relations in KGs are used as supplementary information for users and items in recommendation tasks. In order to fully utilize the KGs information, the Knowledge Graph Embedding (KGE) technique is used to encode the entities and relations of the KGs as low-rank embeddings. Mainstream KGE algorithms are based on translation models (e.g., TransE, TransR, and RotatE). Based on the above method, a global graph representation of the user/item can be obtained. For instance, CKE [12] utilizes multimodal information as item-side supplemental information and learns item representation via TransR. DKN [13] treats contextual and word embedding information in news as side information and employs a multi-channel approch to generate news representations via TransD. However, embedding-based KGs recommendation approaches neglect the consideration of high-order connectivity and fails to adequately capture item-side high-order attribute embeddings.

Next, to pay more attention to the high-order connectivity problem in KGR, researchers are committed to advancing the path-based recommendation method. The path-based approach aims at path representation utilizing high-order entity connectivity patterns in heterogeneous information networks. The Meta-path is a relation sequence (i.e., \(\mathcal {P}=A_{0} \overset {{R}_{1}}{\rightarrow } A_{1} \overset {{R}_{2}}{\rightarrow } {\ldots } \overset {{R}_{k}}{\rightarrow } A_{k}\) ) connecting object pairs in a Heterogeneous Information Network (HIN), which can be used to extract connectivity features in the graph, which accounts for long-range connectivity by extracting paths that connect the target user and item nodes via KGs entities, such as PER [14], McRec [15], and HERec [16]. However, mainstream path-based methods suffer from some inherent limitations: 1) Brute force searches tend to lead to labour-intensive and time-consuming feature engineering when large-scale graphs are involved; 2) Experts are needed to define domain knowledge. The path-based method often results in difficult-to-train models and poor performance.

Furthermore, the rise of GNNs-based technology offers the possibility of exploring long-range connectivity in recommendations. It can iteratively execute the propagation mechanism to capture the high-order semantic information of target nodes on KGs, thus updating the high-order embeddings of the target node on the KGs. For instance, KGAT [17] proposes a Collaborative Knowledge Graph (CKG) to combine user-item-entity and recursively performs propagation over CKG by G raph A ttention NeT works (GATs [18]) to enrich entity embeddings, KGCN [19] recursively performs propagation over KGs via GNNs to enhance item-side entity embeddings. CKAN [20] utilizes a heterogeneous propagation strategy, which enables simultaneous augmentation of user- and item-side representations via GATs. KGIN [21] utilizes intent- and relation-aware mechanisms to model user/item representations in KGR. Nevertheless, mainstream GNNs-based methods suffer from two inherent limitations: 1) Neglecting fine-grained feature interactions and user intent modelling; 2) Unavoidable over-smoothing phenomenon.

Summary

As we mentioned in the introduction, the existing GNNs-based KGR model shows a strong dominance. Yet, the lack of fine-grained feature interactions and the absence of an explicit intent modelling mechanism limits the performance of recommendations to some extent. Our work adopts a GNNs-based model as a backbone network to explore the issue of high-order feature interactions in KGR models as well as intent-aware modelling, which is the primary focus of our model.

2.2 Feature interaction technique in recommendations

Feature interaction (a.k.a., feature combination) techniques have been successful in the field of click-through prediction (CTR), which can fully extract explicit as well as implicit interactions between high-order features. Converging feature interactions with deep neural networks for end-to-end models has become a mainstream approach for CTR, which can enhance the nonlinear capability of models and is significant for increasing model prediction accuracy. DeepFM [22] is a feature interaction model based on deep neural networks and factorization machines, which can effectively model the interactions of low-order and high-order features. xDeepFM [23] proposes a Compressed Interaction Network (CIN), which aims to learn arbitrary low-order and high-order feature interactions. FINT [24] proposes a feature interaction model that aims to perform high-order feature interaction while preserving the semantic information at the field level. CAN [25] proposes a method for modelling feature interactions utilizing a co-action network.

Afterwards, with the rise of GNNs technology, researchers attempt to utilize GNNs technology to deal with feature interactions. Fi-GNN [26] proposes a graph-based feature interaction method that models high-order features as nodes on a graph for simulating complex high-order feature interactions. L0-SIGN [27] is a graph-based feature interaction model that proposes a method based on L0 regularization to preserve useful feature interactions and filter irrelevant feature interactions in the feature graph.

Summary

In this paper, inspired by the great progress made in the feature interaction paradigm in the CTR task as well as in the sequence model [28], we introduce high-order feature interaction techniques into the KGR task to model fine-grained feature interactions. To the best of our knowledge, our work is the first attempt to incorporate high-order feature interaction techniques into the KGR task. This is one of the contributions and novelties of our work.

2.3 Disentangled representation learning (intent modelling) technique in recommendations

In the real world, there are often highly complex factors behind the construction of graphs(e.g., social networks, and user-item bipartite graphs). With the boom in graph machine learning techniques, deep learning techniques often ignore the latent factors behind these interactions. Indirectly, this leads to poor model robustness, neglect of interpretability, and unsatisfactory performance. The idea of disentangled representation learning originated in Capsules Network, and the core algorithm is the neighbour routing mechanism. DGCF [29] combines disentangled representation learning with collaborative filtering. IPREC [30] adopts a novel package recommendation framework that considers user latent intent modelling via an attention mechanism. GNUD [31] is a framework for news recommendation that combines disentangled representation learning with news recommendation, where a neighbour routing mechanism algorithm is applied. MIDGN [32] proposes a multi-view intent disentangled GNNs-based bundle recommendation model. IDS4NR [33] proposes a novel intent disentangled recommendation model based on item popularity and user preference perspectives. DisenHAN [34] proposes a recommendation method for disentangling user intent in heterogeneous information networks. However, little research has been done on the great potential of intent modelling for KGR.

Summary

In this paper, benefiting from the success of disentangled representation learning and intent modelling, we consider the introduction of an intent-aware technique based on a two-level attention mechanism for modelling fine-grained user intents in KGR tasks. This is another of the contributions and novelties of our work.

3 Problem formulation

We have the following definition and description of the KGs-based recommendation task.

User-Item interaction data

In a classic recommendation scenario, we have a set of M users \(U=\left \{u_{1}, u_{2}, \ldots , u_{M}\right \}\) and a set of N items \(V=\left \{v_{1}, v_{2}, \ldots , v_{N}\right \}\). The user-item interaction matrix Y is defined according to user-item (u, v) implicit feedback as follows:

$$ y_{u v}=\left\{\begin{array}{lc} 1 & \text{if \ interaction}(u, v) \text {\ is \ observed } \\ 0 & \text{otherwise.} \end{array}\right. $$
(1)

Knowledge graph

We have a knowledge graph \(G=(\mathcal {E} , \mathcal {R})\), which is an undirected graph composed of entity-relation-entity triples (h,r,t). Where \(h, t \in \mathcal {E}\) and \(r \in \mathcal {R}\), denote the head entity, tail entity, and KGs-relation of a knowledge triple, \(\mathcal {E}\) and \(\mathcal {R} \) are the set of entities and relations in the KGs, respectively. In addition, we define an item-entity alignment set \(A=\{(v, e) \mid v \in V, e \in \mathcal {E}\}\) that is designed to uncover the alignment operations of items in both the user-item interaction matrix and KGs.Footnote 1 Ultimately, we aim to learn a match function \(\tilde {y}_{u v}=\mathcal {F}(u, v \mid {\Theta }, \mathbf {Y}, G)\), where \(\tilde {y}_{u v}\) denotes the probability that user u will match with item v, and Θ is the set of model parameters(a.k.a., configuration).

Task description

We detail the recommendation of this paper for the task.

  • Input: User-Item interaction matrix Y, knowledge graph G, and the model parameter set Θ.

  • Output: The probability that a user interacts with the item \(\tilde {y}_{u v}\).

In addition, the important symbols involved in this paper are listed in Table 1.

Table 1 The important symbols and definitions in this paper

4 Method

In this section, we introduce the proposed FIRE. The framework is shown in Fig. 2. Precisely, FIRE consists of three modules: 1) Attentive Propagation Layer, which generates high-order propagation embeddings of target users and target items; 2) Feature Interaction (Bi-CNNs) module. The high-order representation computed by the target item is integrated into a 2-D matrix. Features are fused and interacted with two granularity convolutional kernels, thus obtaining a feature-enhanced item embedding; 3) Intent Disentangled Module. It utilizes two-level attention mechanisms to model the latent intent behind user-item interactions and obtain user intent-enhanced embeddings.

Fig. 2
figure 2

The framework of the proposed FIRE. The framework is composed of three crucial components: 1) Knowledge-aware attention network (backbone network), which yields the user/item embedding eu/v via the GNNs-based attentive propagation layer. 2) Feature interaction module (Bi-CNNs), which extracts local and global feature interaction signals from high-order feature matrix (See Fig. 3 for Bi-CNNs details). 3) Intent Disentangled module, which extracts fine-grained latent intent factors via node- and intent-level attention mechanisms. Best viewed in color

4.1 Backbone network

FIRE relies on GATs to capture high-order neighbour information following CKAN [20] and RippleNet [35]. The receptive field R is defined as follows:

$$ \mathrm{R}_{u}^{0}=\left\{e \mid v \in\left\{v \mid y_{u v}=1\right\} \wedge(v, e) \in A\right\} $$
(2)
$$ \mathrm{R}_{u}^{l}=\left\{t \mid(h, r, t) \in G \wedge h \in \mathrm{R}_{u}^{l-1}\right\} $$
(3)
$$ \begin{array}{@{}rcl@{}} \mathrm{R}_{v}^{0}&=\left\{e \mid v^{*} \in\left\{v^{*} \mid \exists u \in U, y_{u v^{*}}=1 \wedge y_{u v}\right.\right.\\ &{}\left.\left.=1\right\} \wedge\left( v^{*}, e\right) \in A\right\} \end{array} $$
(4)

Here, l={1,2,…,L}. We elaborately describe the user representation learning process here. Because item representation learning is a dual process, we omit it for brevity. In addition, we need to define the triple set of l-th knowledge affiliation triples, taking the user as an example.The size of the triplet set \(\mathcal {T}\) directly determines the number of associated high-order entities.

$$ \mathcal{T}_{u}^{l}=\left\{(h, r, t) \mid(h, r, t) \in \mathcal{G} \text { and } h \in \mathrm{R}_{u}^{l-1}\right\} $$
(5)

Next, first-order propagation on the KGs is used as a case to demonstrate in detail the calculation of high-order embeddings via the knowledge propagation attention mechanism as follows:

$$ \pi=\text{softmax}\left( f_{\text {neural }}\left( \boldsymbol{e}_{i}^{h} \oplus \boldsymbol{e}_{i}^{r}\right)\right) $$
(6)

Here \(i=1,2,\ldots ,\left \vert \mathcal {T}_{u}^{l}\right \vert \) and fneural(⋅) is a 3-layer feedforward neural network. ⊕ is the vector concatenation operation. Then, we can obtain the user embedding after first-layer propagation as follows:

$$ \boldsymbol{e}_{u}^{1}=\sum\limits_{i=1}^{\left\vert\mathcal{T}_{u}^{1}\right\vert} \pi \cdot \boldsymbol{e}_{i}^{t} $$
(7)

In addition, we introduce the initial propagation embeddings \(\boldsymbol {e}_{u}^{0}, \boldsymbol {e}_{v}^{0}\) and the item initial embedding \(\boldsymbol {e}_{v}^{i n i t}\) as follows:

$$ \boldsymbol{e}_{u}^{0}=\frac{1}{\left\vert\mathrm{R}_{u}^{0}\right\vert} \sum\limits_{e \in \mathrm{R}_{u}^{0}} \boldsymbol{e}, \boldsymbol{e}_{v}^{0}=\frac{1}{\left\vert\mathrm{R}_{v}^{0}\right\vert} \sum\limits_{e \in \mathrm{R}_{v}^{0}} \boldsymbol{e} $$
(8)
$$ \boldsymbol{e}_{v}^{i n i t}=\frac{1}{\vert\{e \mid(v, e) \in A\}\vert} \sum\limits_{\{e \mid(v, e) \in A\}} \boldsymbol{e} $$
(9)

By analogy, we can obtain both user and item high-order embeddings:

$$ \left\{\boldsymbol{e}_{u}^{0}, \boldsymbol{e}_{u}^{1}, \ldots, \boldsymbol{e}_{u}^{L}\right\},\left\{\boldsymbol{e}_{v}^{i n i t}, \boldsymbol{e}_{v}^{0}, \boldsymbol{e}_{v}^{1}, \ldots, \boldsymbol{e}_{v}^{L}\right\} $$
(10)

To aggregate high-order embeddings, we implement three aggregators: sum, concat, and maxpooling aggregators. Thus, we obtain the user and item integration representations: eu,ev. For convenience, we replace the user/item with the uniform symbol o.

$$ a g g_{\text{sum}}=\left[\mathbf{e}_{o}^{0}+\mathrm{\boldsymbol{e}}_{o}^{1}+\ldots+\mathrm{\boldsymbol{e}}_{o}^{L}\right] $$
(11)
$$ a g g_{\text{concat}}=\left[\boldsymbol{e}_{o}^{0}\oplus\boldsymbol{e}_{o}^{1}\oplus\ldots\oplus\boldsymbol{e}_{o}^{L}\right] $$
(12)
$$ agg_{\text{maxpooling}}=maxpooling\left\{\mathrm{\boldsymbol{e}}_{o}^{0}, \mathrm{\boldsymbol{e}}_{o}^{1}, \ldots, \mathrm{\boldsymbol{e}}_{o}^{L}\right\} $$
(13)

where + is the vector summation operator, ⊕ is the vector concatenation operator and maxpooling{⋅} is the vector maxpooling function.

4.2 Feature interaction

Classic GNNs-based KGs recommendation frameworks merely choose to aggregate high-order embeddings by concat, mean, sum, or maxpooling while fine-grained feature interactions are neglected, e.g. KGCN, RippleNet, and KGIN. Nevertheless, we believe that these methods do not consider high-order feature interactions, which are crucial for recommendations. Convolutional Neural Networks (CNNs) and related variants are known to have recently witnessed breakthroughs in the areas of computer vision and natural language processing. This technique has been successful in extracting both local- and global features. We innovatively attempt CNNs as feature aggregators to process the 2-D high-order embedding matrix \(\textit {\textbf {M}}_{v} \in \mathbb {R}^{(L+2) \times \mathrm {\textit {d}}}\). Specific to the item-side, we adopt two convolutional kernels (Horizontal convolutional kernel and Vertical convolutional kernel) for extracting feature interaction signals, and the module is named Bi-CNNs.

More precisely, the vertical convolutional kernel \(\boldsymbol {V}^{t} \in \mathbb {R}^{(L+2) \times 1}\) slides column-wise over matrix Mv to extract interactions of fixed dimension for high-order features, called the bit-level feature interaction mode. Similarly, a horizontal convolutional kernel \(\boldsymbol {H}^{t} \in \mathbb {R}^{h \times d}\) slides over matrix Mv in rows to extract interaction signals between neighbouring high-order features, called the vector-level feature interaction mode. In addition, to extract richer feature interaction information, we follow concept of multi-head mechanism [18] in GATs and stack several convolutional kernels for two types of convolutional modes.

Initially, the high-order embedding is concatenated into a 2-D matrix, with the following formalization.

$$ \boldsymbol{M}_{v}=\left[\begin{array}{c} \boldsymbol{e}_{v}^{i n i t} \\ \boldsymbol{e}_{v}^{0} \\ {\cdots} \\ \boldsymbol{e}_{v}^{L} \end{array}\right]_{\mathbb{R}^{(L+2) \times d}} $$
(14)

Vector-level feature interaction mode

As shown in Fig. 3, the upper part of the Bi-CNNs module depicts the working mechanism of the horizontal convolutional kernel, denoteed as \(\boldsymbol {H}^{t} \in \mathbb {R}^{h \times d}\), for extracting neighbouring order feature interaction signals. As mentioned previously, following the multi-head mechanism requires several convolutional kernels to extract more feature interaction signals. Hence, \(t \in [1, \tilde {n}]\) and h ∈{1,2,…,(L + 2)} is shown as the height of the horizontal convolutional kernel. The i-th convolutional value \(\tilde {\boldsymbol {c}}_{i}^{t}\) is illustrated as follows:

$$ \tilde{\boldsymbol{c}}_{i}^{t}=\text{ReLU}\left( \boldsymbol{M}_{v}[i: i+h-1,:] \odot \boldsymbol{H}^{t}\right) $$
(15)

Where ⊙ denotes the inner product operator. Thus, the horizontal convolutional result \(\tilde {\boldsymbol {c}}^{t} \in \mathbb {R}^{(L+2)-h+1}\) is :

$$ \tilde{\boldsymbol{c}}^{t}=\left[\tilde{c}_{1}^{t}, \tilde{c}_{2}^{t}, \ldots, \tilde{c}_{i}^{t},\ldots,\tilde{c}_{((L+2)-h+1)}^{t}\right] $$
(16)

Since vector-level feature interactions resulting from horizontal convolutional kernel interactions cannot avoid overlap and redundancy of interaction information. Hence, we apply the maxpooling operation and the output vector \(\boldsymbol {o}_{h} \in \mathbb {R}^{\tilde {n} \times 1}\) for the \(\tilde {n}\) kernels is denoted as:

$$ \boldsymbol{o}_{h}=\left[\max \left( \tilde{\boldsymbol{c}}^{1}\right), \max \left( \tilde{\boldsymbol{c}}^{2}\right), \ldots, \max \left( \tilde{\boldsymbol{c}}^{\tilde{n}}\right)\right] $$
(17)
Fig. 3
figure 3

Bi-CNNs employ two kinds of convolutional kernels to process high-order feature interactions (Horizontal convolutional kernel and Vertical convolutional kernel) for local- and global feature interactions. Specifically, the order L= 2, the size of horizontal convolution kernels \(\tilde {n}\)= 2, the size of vertical convolutional kernels n= 2, and the size of embedding dimension d= 3. Best viewed in color

Bit-level feature interaction mode

To extract significant feature interactions from a fixed dimensional viewpoint, we employ a vertical convolutional kernel \(\boldsymbol {V}^{t} \in \mathbb {R}^{(L+2) \times 1}\) to extract bit-level feature interactions. As shown in Fig. 3, the lower part of the Bi-CNNs layer depicts the working mechanism of the vertical convolutional kernel. The vertical convolutional kernel Vt covers the 2-D high-order feature matrix Mv and slides along the embedding dimension.

Similar to the Horizontal convolutional kernel, the i-th convolutional value \(\boldsymbol {c}_{i}^{t}\) is denoted as:

$$ \boldsymbol{c}_{i}^{t}=\text{ReLU}\left( \boldsymbol{M}_{v}[:, i] \odot \boldsymbol{V}^{t}\right) $$
(18)

Where \(\boldsymbol {V}^{t} \in \mathbb {R}^{(L+2) \times 1}, t \in [1, n]\). Thus, the vertical convolutional result \(\boldsymbol {c}^{t} \in \mathbb {R}^{d}\) is:

$$ \boldsymbol{c}^{t}=\left[c_{1}^{t}, {c_{2}^{t}}, \ldots, {c_{i}^{t}},\ldots, {c_{d}^{t}}\right] $$
(19)

where d denotes the embedding dimension.

In particular, the vertical convolutional interaction results are equal to the weighted sum over the l rows of Mv weighted via vertical convolutional kernel Vt.

$$ \boldsymbol{c}^{t}=\sum\limits_{l=1}^{L+2} \boldsymbol{V}^{\textit{t}}[l,:] \cdot \boldsymbol{M}_{v}[l,:] $$
(20)

Here l ∈ [1,(L + 2)], i ∈ [1,d], and ⋅ denotes multiplication operation.

We stack n vertical convolutional kernels. In contrast to the horizontal convolutional kernel processing, we expect to maximize the retention of bit-level feature interaction information in each dimension. Hence, we join the n vertical convolutional kernels in sequence. The output vector \(\boldsymbol {o}_{v} \in \mathbb {R}^{dn}\) is denoted as:

$$ \boldsymbol{o}_{v}=\left[\boldsymbol{c}^{1}\oplus \boldsymbol{c}^{2}\oplus \ldots\oplus \boldsymbol{c}^{n}\right] $$
(21)

where ⊕ is shown as the connection operator.

Dense layer

We concatenate the above two convolutional vectors, fed into the dense layer to extract global interactive features, and output the item feature-enhanced convolutional embedding zv as such:

$$ \mathbf{z}_{v}=\varphi\left( \mathbf{W} \cdot \left[\begin{array}{l} \boldsymbol{o}_{v} \\ \boldsymbol{o}_{h} \end{array}\right]\right) $$
(22)

Where \(\mathbf {W}\in \mathbb {R}^{d \times (\tilde {n}+dn)}\) is the transformation matrix and the convolutional embedding \(\mathbf {z}_{v} \in \mathbb {R}^{d \times 1}\), φ(⋅) is the Sigmoid function. Then the final representation \(\tilde {\boldsymbol {e}}_{v}\) is:

$$ \tilde{\boldsymbol{e}}_{v}=\boldsymbol{e}_{v}+\mathbf{z}_{v} $$
(23)

In summary, we innovatively employ CNNs and dense networks (layers) for extracting feature interaction signals, with the following advantages: 1) The horizontal convolutional mode primarily extracts feature interaction signals between adjacent high-order features, i.e., overall feature interactions between vectors. 2) The vertical convolutional mode extracts fine-grained feature interaction signals in each dimension of all features. 3) Similar to the multi-head mechanism, several types of convolutional kernels are used to extract more feature interaction signals. 4) The local feature interaction signals generated by the two convolutional modes are recombined and fed into a dense network to learn advanced global feature interaction signals.

4.3 Intent disentangled module

The construction of a real-world user-item interaction graph often results from a highly complex interaction of many latent factors. Existing deep learning techniques consider graph interactions holistically and rarely consider the entanglement of latent factors, causing the learned embeddings to be flawed for downstream tasks and uninterpretable. Figure 4 reveals the details of intent modelling driving user decisions and motivations. Thus, we first disentangle the user and item representations to different spaces and then make the user interact with the item in the same space to model the complex intent behind the user-item interaction:

$$ \boldsymbol{e}_{u^{s}}=\mathbf{W}_{u}^{s} \boldsymbol{e}_{u}, \boldsymbol{e}_{{i_{x}^{s}}}=\mathbf{W}^{s} \boldsymbol{e}_{i_{x}} $$
(24)

where \(\boldsymbol {e}_{u^{s}}\) and \(\boldsymbol {e}_{{i_{x}^{s}}}\) represent the corresponding embeddings under the s-th disentangled space, and s ∈ [1,S]. Ws is denoted as the disentangled matrix. \(o^{\text {\textit {history} }}=\left \{i_{1}, i_{2}, \ldots , i_{x}\right \}\) is defined as the set of items with which the target user has interacted. Immediately afterwards, we integrate the embeddings for each intent via the node-level attention mechanism as such:

$$ \boldsymbol{f}^{s}=\sum\limits_{i_{x} \in O^{history}} {\alpha_{x}^{s}} \boldsymbol{e}_{{i_{x}^{s}}} $$
(25)
$$ {\alpha_{x}^{s}}=\frac{\exp \left( \mathbf{q}^{\mathrm{T}} \cdot \tanh \left( \mathbf{w}\left[\boldsymbol{e}_{u^{s}} \oplus \boldsymbol{e}_{{i_{x}^{s}}}\right]\right)\right)}{{\sum}_{i_{x} \in O^{h i s t o r y}} \exp \left( \mathbf{q}^{\mathrm{T}} \cdot \tanh \left( \mathbf{w}\left[\boldsymbol{e}_{u^{s}} \oplus \boldsymbol{e}_{i_{x^{\prime}}^{s}}\right]\right)\right)} $$
(26)

Where \(\mathbf {q} \in \mathbb {R}^{d}\) and \(\mathbf {W} \in \mathbb {R}^{d \times 2 d}\) are trainable parameters for the attention mechanism. In essence, the attention weight \({\alpha _{x}^{s}}\) captures the target user preference for an item in a particular intent disentangled space. Intuitively, user preferences vary for each intent, motivating us to further combine the influence from s-th disentangled spaces with intent-level attention mechanism:

$$ \boldsymbol{f}=\sum\limits_{s=1}^{S} \beta^{s} \boldsymbol{f}^{s} $$
(27)
$$ \beta^{s}=\frac{\exp \left( \mathbf{q}^{\mathrm{T}} \cdot \tanh \left( \mathbf{w}\left[\boldsymbol{e}_{u} \oplus \boldsymbol{f}^{s}\right]\right)\right)}{{\sum}_{s=1}^{S} \exp \left( \mathbf{q}^{\mathrm{T}} \cdot \tanh \left( \mathbf{w}\left[\boldsymbol{e}_{u} \oplus \boldsymbol{f}^{s}\right]\right)\right)} $$
(28)

Where \(\boldsymbol {f} \in \mathbb {R}^{d}\) is the final embedding for encoding the complex intent influence for the user. Then the final representation \(\tilde {\boldsymbol {e}}_{u}\) is:

$$ \tilde{\boldsymbol{e}}_{u}=\boldsymbol{e}_{u}+\boldsymbol{f} $$
(29)
Fig. 4
figure 4

There are complex and diverse intents behind user-item interactions, and specific users display different intent and interests when confronted with a particular item. Best viewed in color

4.4 Model prediction and optimization

On the premise that user/item prototype representations eu,ev are obtained via the knowledge-aware attention networks (c.f., Equations 11-13), we assign the ability to high-order feature interactions (c.f., Equation 23) and intent-aware (c.f., Equation 29) to the model.

Next, the interaction probability (matching score) \(\tilde {y}_{u v}\) between the target user u and the target item v is calculated by the inner product as follows:

$$ \tilde{y}_{u v}=\sigma\left( \tilde{\boldsymbol{e}}_{u}^{\mathrm{T}} \cdot \tilde{\boldsymbol{e}}_{v}\right) $$
(30)

Here, σ(⋅) is the sigmoid function.

To ensure the effectiveness of model training and to improve training efficiency, we adopt the negative sampling strategy, sampling the same number of negative samples for each user. Ultimately, the loss function of FIRE is defined as follows.

$$ \begin{array}{@{}rcl@{}} L&=&\sum\limits_{u \in U}\left( \sum\limits_{v \in\left\{v\vert(u, v) \in \gamma^{+}\right\}} {\Gamma}\left( y_{u v}, \tilde{y}_{u v}\right)\right.\\ &&\left.-\sum\limits_{v \in\left\{v\vert(u, v) \in \gamma^{-}\right\}} {\Gamma}\left( y_{u v}, \tilde{y}_{u v}\right)\right) \ +\lambda\|{\Theta}\|_{2}^{2} \end{array} $$
(31)

Where Γ(⋅) is the cross-entropy loss function, and γ+ denotes the user-item positive pair set. Conversely, γ denotes the user-item negative pair set. λ is the L2-regularization coefficient for reducing overfitting. Θ denotes the model parameter set (model configuration). Finally, we compute the loss and adopt Adam optimization [36] to optimize our model parameters.

To make the overall framework of FIRE readable, we present the pseudo-code for the overall prediction method in Algorithm 1.

Algorithm 1
figure f

The overall prediction algorithm of FIRE (u,v,Θ).

4.5 Model analysis and discussion

In this subsection, we conduct an in-depth analysis and discussion of the relation between FIRE and existing GNNs-based KGR models [17, 19, 20, 39]. Two main aspects are developed: 1) Novelty and differences, and 2) Relation with the state-of-the-art approach (KGIN [21]).

  • Novelty and differences. For the knowledge graph recommendation task, previous work [17, 19, 20, 39] has focused on extracting long-range attribute (knowledge) information using GNNs techniques, leading to better representation learning of the user/item. Due to the inherent sparsity of recommender systems, we believe high-order feature interactions are crucial for recommendations to achieve feature enhancement. To the best of our knowledge, this is the first attempt to utilize CNNs for feature interaction on high-order attributes generated by knowledge-aware networks. In addition, previous KGR models rarely considered the user intent modelling process. We strongly believe that the knowledge-aware backbone network, empowering item-side feature interaction, and user-side intent modelling could jointly make FIRE more efficient than previous work.

  • Relation with KGIN. KGIN [21] is the state-of-the-art model for KGs-based recommendations. The main highlights are 1) setting multiple latent intent factors to describe the intent associations behind user-item interactions, and 2) proposing a relation-aware mechanism to extract relational dependency signals in long-range connections. Although this has a plausible intent-aware design similar to FIRE, there are major distinctions: 1) KGIN’s intent modelling process is naive, setting the latent intent factor as a trainable parameter, inevitably leading to insufficient extraction of latent intent signals, and resulting in unsatisfactory results. In FIRE, we couple the user’s historical interaction behaviour with the intent-aware process and design a more sophisticated two-level attention mechanism to capture the latent intent-aware signals. 2) Despite the significant progress in KGIN’s path-aware mechanism, it still cannot consider the crucial feature interaction module in the KGR task. In summary, we believe that a great KGR model should fully extract the feature interaction signals and the latent intent information so that the model performance is satisfying.

5 Experiment

In this section, we perform experiments on three real-world scenario datasets to evaluate our method and answer the following four research questions:

  • RQ1: What is the performance of the proposed FIRE framework compared to the state-of-the-art KGs-based recommendation model?

  • RQ2: How do the key hyperparameters influence the performance of the proposed FIRE?

  • RQ3: How do different components affect FIRE?

  • RQ4: What is the time efficiency of the FIRE in model training?

5.1 Experimental setting

5.1.1 Dataset description

To evaluate the effectiveness of our method, we conduct a series of experiments on three different recommendation scenario datasets: 1) Last.FM, 2) Dianping-Food, and 3) MovieLens-1M, which are all openly accessible.

  • Last.FMFootnote 2 is a widely utilized benchmark in music KGs-based recommendations, which includes approximately 2000 users’ listening information from the Last.FM website.

  • Dianping-FoodFootnote 3 is a restaurant recommendation (POI recommendation) dataset, provided by Meituan Dianping, which contains more than 10 million interactions between approximately 2 million users and 1000 restaurants.

  • MovieLens-1MFootnote 4 is a widely utilized benchmark in movie KGs-based recommendations, which contains 1 million ratings (ranging from 1 to 5) on a total of 2445 items from 6036 users.

For the construction of the dataset, we follow the treatment of previous work [20, 35, 39]. First, since the interactions in MovieLens-1M and Last.FM are both explicit feedback, they are converted to implicit feedback, where 1 indicates a positive sample (for MovieLens-1M, the threshold for a rating to be considered positive is 4, but no threshold is set due to the sparsity of Last.FM). We randomly sample negative samples per user with a negative sampling rate of 1. Second, in addition to the construction of the U-I interaction data, it was necessary to construct item-side sub-KGs for each dataset. sub-KGs for MovieLens-1M and Last.FM are constructed using Microsoft Satori.Footnote 5 as for the Dianping-food dataset, we use the KGs provided by Meituan to construct Dianping-food sub-KGs. For all items in the dataset (music, movies, restaurants), the IDs can be matched in the corresponding sub-KGs. In addition, to filter out noise, we have filtered items that matched multiple entities and items that did not match any entity. Table 2 summarizes the detailed statistics for the three datasets; Last.FM (music), MovieLens-1M (movies) and Dianping-Food (restaurant).

Table 2 Statistics and hyper-parameter settings for the three datasets (d: embedding dimension, s: intent number, λ: L2 regularization coefficient, L: depth of GNNs layer)

5.1.2 Experimental settings

We utilize PyTorchFootnote 6 to implement our method and deploy it on a server with Quadro RTX 6000 GPU with 24 G of video memory.Footnote 7 For each dataset, the ratio of the training, validation, and test sets is 6:2:2. For BCE loss, we construct the training set by randomly sampling 1 negative item for each positive item (i.e., negative sampling rate: 1).

5.1.3 Evaluation metrics

We perform experiments in two prototypical recommendation scenarios. For the top-K recommendation task, we adopt a widely-used evaluation protocol to evaluate the effectiveness of our proposed method: Recall@K, and we set K={50, 100}. For the CTR prediction task, we utilize Area Under the ROC Curve (AUC) and F1-score for the evaluation protocol [40].

$$ \text{Recall}@K(u)=\frac{\left\vert R^{1: K}(u) \cap T(u)\right\vert}{\vert T(u) \vert} $$
(32)
$$ \text{Precision} =\frac{T P}{T P+F P} $$
(33)
$$ \text{Fl-score} =\frac{2 \times \text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}} $$
(34)
$$ \text{AUC}(u)=\frac{{\sum}_{i \in T(u)} {\sum}_{j \in I \backslash T(u)} P\left( \tilde{\mathrm{\textit{y}}}_{u i} \geq \tilde{\mathrm{\textit{y}}}_{u j}\right)}{\vert T(u)\vert \vert I \backslash \vert T(u) \vert \mid} $$
(35)

Where P(⋅) is the indicator function, T(u) denotes the ground truth item set, and R1:K(u) denotes the top-K recommended item list.

5.1.4 Implement details

We train the model by optimizing the BCE loss with Adam [36] optimizer and Xavier [37] initializer. We train the model for 50 epochs, and an early stopping strategy is performed to prevent overfitting. We perform a grid search for hyperparameters: The embedding dimension size in the range of {8, 16, 32, 64, 128, 256}, the L2 regularization factor tuned amongst {0, 1e-6, 1e-5, …, 1e-3} and the learning rate is tuned amongst {5e-2, 1e-2, 5e-3, 1e-3, 5e-4, 1e-4}. During KGs propagation, the depth of GNNs layer is adjusted among {1, 2, 3, 4}. We select the user- and item triple set size in {8, 16, 32, 64}. For user intent modelling, the number of intents is searched within {2, 4, 6, 8, 10}. For Bi-CNNs, we search for the number of two convolutional kernels in the range {2,4,8,16}. Three aggregators (sum, concat, and maxpooling) are employed for aggregating high-order representations, and the default Dropout [38] rate (Dr) is set to Dr= 0.5 in the Bi-CNNs and Dr= 0.5 in the all attention networks.

5.1.5 Baselines

To illustrate the effectiveness of our model, we choose eight baselines, as follows:

  • CKE [12]: It incorporates multi-modal knowledge to enhance item embeddings for collaborative filtering.

  • PER [14]: It’s treating the KGs as a heterogeneous information network (HIN) and explores high-order path information based on a meta-path way.

  • RippleNet [35]: It’s a propagation-based KGs recommendation method, that models user representation as plenty of entities related to users’ historical interacting items. And treats users’ preferences as Ripple that captures high-order attributes on the KGs.

  • KGCN [19]: It’s a GNNs-based KGR method, which effectively captures user-specific preferences for items in the KGs.

  • KGNN-LS [39]: It’s a GNNs-based KGR method, that considers label-smoothing in the information aggregation phase to generate user-specific item representations.

  • KGAT [17]: It’s a GNNs-based KGR method, that adopts attention mechanism and relies on GATs to capture the high-order neighbour information and node-level feature interaction.

  • CKAN [20]: It’s a GNNs-based KGR method, that utilizes attention mechanism and relies on GATs to capture the high-order neighbour information.

  • KGIN [21]: It’s a state-of-the-art GNNs-based KGR method, that disentangles the intent factor behind the user-item interaction and utilizes a relation-aware mechanism to obtain user/item representations.

5.2 Performance comparison (RQ1)

To answer the first research question, we compare the performance of all the baselines in Tables 3 and 4. From the results, we make the following observations:

  • PER has the worst performance on the three datasets over all baselines. This indicates that the path-based method requires domain knowledge to define meta-paths and it is hard to optimize path selection in training, which limits the performance of PER.

  • CKE achieves performance improvement over PER because of the introduction embedding-method. This indicates that it is helpful to improve recommendation performance with knowledge graph embedding technology. Nevertheless, CKE cannot effectively capture the high-order features on the KGs, which results in insufficient performance. In addition, CKE does not perform well, which might also be caused by a lack of multimodal information.

  • RippleNet has a significant performance improvement in comparison to path-based and embedding-based methods. This demonstrates the importance of exploring information about high-order attributes on the KGs. However, RippleNet constructs user and item representations asymmetrically and ignores high-order connectivity, which can lead to unsatisfactory performance.

  • Compared with the shallow model, the performance of KGCN, KGNN-LS, KGAT, and CKAN confirm that incorporating the high-order connectivity and attention mechanisms can improve the recommendation effect. However, the above KGs-based GNNs model neglects fine-grained feature interaction and intent disentangled, and struggles to avoid over-smoothing of high-order features, which leads to unsatisfactory performance.

  • KGIN is currently the strongest baseline model for KGR and is highly correlated to our model. Its strengths lie in fine-grained intent modelling and path-aware techniques. Whereas it still fails to deal with the lack of high-order feature interactions compared to our model, resulting in unsatisfactory model performance. This reflects the progressive nature of our model’s fine-grained feature interactions.

  • FIRE outperforms all the baseline methods, which demonstrates that FIRE can effectively explore user latent intent and item fine-grained high-order feature interactions via a two-level attention mechanism and Bi-CNNs module.

Table 3 Overall performance comparison of AUC and F1
Table 4 Overall performance comparison of Recall@K

5.3 Study of FIRE (RQ2)

To answer the second research question, we perform a crucial hyperparameter analysis of FIRE. From the results, we have the following conclusions.

5.3.1 Impact of embedding size

Figure 5 reports the effect of embedding size on the AUC performance in three recommendation scenarios. The best performance is achieved when the embedding size are set to 128, 64, and 64 for the three datasets. We can observe that with the increase in embedding size from 8 to 256, the recommendation performance improves due to the stronger representation feature space. However, a larger embedding size does not always result in stronger model representation ability for recommendations. This is caused by model overfitting as well as encoding irrelevant feature information.

Fig. 5
figure 5

AUC results of FIRE with different dimension of embedding

5.3.2 Impact of different user intent numbers

To analyze the effect of the number of intents, we adjust in the range {2, 4, 6, 8, 10} and illustrate the performance change for the three datasets in Fig. 6. We find that the best performance is achieved when the number of intents is set to 6, 4, and 6 for each of the three datasets. Specifically, performance is poorer when the number of intents is set small, i.e. in the case of coarse-grained intent modelling. This justifies the encouragement of multiple user intents. However, when the number of intents is set larger, the model performance decreases instead. A reasonable explanation is that when intent modelling is too fine-grained, it encodes some irrelevant information and noise, which is detrimental to the accurate representation of the model.

Fig. 6
figure 6

AUC results of FIRE with different user intent numbers

5.3.3 Impact of different L2 regularization coefficients

Figure 7 summarizes the effect of different L2 regularization coefficients on AUC performance, and we have consistent conclusions that the model has different tolerances for different regularization coefficients. We find that the best performance is achieved when the L2 regularization coefficients are set to 1e-5, 1e-5, and 1e-6 for each of the three datasets. Specifically, when the regularization coefficient is small, it can lead to model over-fitting. Conversely, L2 regularization is like a double-edged sword. When the L2 regularization coefficient is too large, it can cause a shift in the correct optimization direction of the model and lead to the severe consequence of underfitting the model. Therefore, choosing an appropriate regularization coefficient allows the model to best performance.

Fig. 7
figure 7

AUC results of FIRE with different L2 regularization coefficients

5.3.4 Impact of different aggregators

We explore the impact of the three aggregators on the AUC performance in Table 5 and find that the sum aggregator consistently outperforms the other aggregators. In contrast, the maxpooling aggregator causes the model performance to collapse. One possible reason is that the concat aggregator may encode some irrelevant features, while the maxpooling aggregator undoubtedly loses important attribute information, which leads to poor performance.

Table 5 AUC results of FIRE with different aggregators

5.3.5 Impact of different GNNs layers

We verified how GNNs depth affects model performance by varying the number of GNNs layers from 1 to 4. Table 6 shows that FIRE achieves the best performance when L is taken to be 2, 1, and 2 for the three datasets, respectively. In addition to the well-known problem of oversmoothing due to deep GNNs, two other reasons that are very significant are 1) The difficulty of avoiding the introduction of irrelevant information when high-order propagation introduces remote knowledge, especially when facing large-scale datasets. 2) The problem of leading to representation degradation. All these problems can lead to overfitting of the model and thus to poor model performance. Therefore, keeping an appropriate depth of layers in high-order information propagation can maximize the performance of recommendations.

Table 6 AUC results of FIRE with depth of GNNs layer

5.3.6 Impact of the size of the triple set

We fine-tune the size of the user and item triple sets to explore their impact on FIRE within the range of {8, 16, 32, 64}. Here we choose Last.FM and Movielens-1M for our experiments. Tables 7 and 8 show the results of the experiments. We find that the best results are obtained when the user triple size is set to a uniform 16. The best performance is achieved when the item triple set size is chosen to be 32 and 64. One possible reason for this is that there is a difference in the initial number of user/item entities for different datasets, which directly determines the number of triples that can be associated. This results in performance differences. In addition, when the user triple size is set too large, the model’s performance is weakened, which is the cause of a degree of overfitting in the analysis.

Table 7 AUC results for Last.FM w.r.t. different sizes of the triple set
Table 8 AUC results for MovieLens-1M w.r.t. different sizes of the triple set

5.3.7 Impact of convolutional kernels

We fine-tune the sizes of the vertical- and horizontal convolutional kernels to explore their impact on FIRE within the range of {2, 4, 8, 16}. Here we choose Last.FM and Movielens-1M for our experiments. Tables 9 and 10 show the results of the experiments. We find that the best results were obtained when the horizontal convolutional kernel size is set to a uniform 4, and the size of the vertical convolution kernels is set to 2 and 4 respectively. A consistent conclusion is that when the size of convolutional kernels is too small, it is not possible to encode feature interaction information adequately. Conversely, when the size of convolutional kernels is too high, it encodes some of the noise and leads to the overfitting of the model. Therefore, choosing the suitable size of convolutional kernels is crucial for model performance improvement.

Table 9 AUC results for Last.FM w.r.t. different sizes of the two convolutional kernels (n: size of vertical convolutional kernel, \(\tilde {n}\): size of horizontal convolutional kernel)
Table 10 AUC results for MovieLens-1M w.r.t. different sizes of the two convolutional kernels (n: size of vertical convolutional kernel, \(\tilde {n}\): size of horizontal convolutional kernel)

5.4 Network visualization

To explore the effects produced by the two core modules in the FIRE model (i.e., the feature interaction module (Bi-CNNs) and the intent modelling module), we observe some details of the trained network. Figure 8(a) reveals the convolutional values of the five vertical convolutional kernels after training FIRE on the Last. FM dataset at L= 3. We find that the five convolutional kernels are trained to be diverse. After analysis, we believe that the vertical convolution kernels produce a similar effect to the attention mechanism by assigning a corresponding weighted sum to each dimension of high-order features (c.f., Equation (20)), which is sufficient to capture fine-grained bit-level feature interactions.

Fig. 8
figure 8

Network visualization w.r.t. vertical convolutional kernel and intent-level attention weight

Next, we explore the practical effects produced by the intent-level attention mechanism in Fig. 8(b). We are again conducting experiments on Last.FM (with the number of intents set to 5), and we randomly select five users (U204, U32, U1324, U305, U472) to reveal the effect of intent-level attentive weight visualization (c.f., Equations 27-28). We can find that, after network training, the five attentive weights produce a significant divergence. This indirectly coincides with our vision to encourage modelling the complex and diverse intent perception factors of users to pinpoint their intents and interests.

5.5 Ablation analysis (RQ3)

To answer RQ3, as shown in Figs. 9 and 10, a comprehensive ablation analysis is conducted from two views (i.e., macro- and micro views) to assess the effectiveness and performance of all key components of the model. The specific variants are shown below.

  • FIREW/OIntent: Remove the user intent component totally.

  • FIRE Dual-Bi-CNNs: Add Bi-CNNs modules to the user-side.

  • FIRE-light: Remove both Bi-CNNs and intent component.

  • FIREW/OH-kernel: Remove the horizontal convolution module.

  • FIREW/OV-kernel: Remove the vertical convolution module.

  • FIREW/OGNNs-att: Remove the knowledge-aware attention mechanism module.

  • FIREW/OIntent-att: Remove the Intent-Aware Attention Mechanism module.

Fig. 9
figure 9

AUC results of FIRE w.r.t. macro-ablation analysis

Fig. 10
figure 10

AUC results of FIRE w.r.t. micro-ablation analysis

Impact of the intent-aware attention mechanism

To investigate the impact of the intent-aware attention module in FIRE, we disable the whole intent module, and FIREW/O Intent is represented as the variant model. Based on the results reported in Fig. 9, the AUC performance decreases significantly after removing the whole intent component, which demonstrates that the intent-aware modelling is essential for performance improvement. In addition, we conduct an additional experiment to verify the effectiveness of a two-level intent-aware attention mechanism. We construct a variant model FIREW/OIntent-att that shows the use of average aggregation instead of attention mechanism aggregation. The results are shown in Fig. 10 which indicates a significant drop in performance. This demonstrates the importance of a two-level intent-aware attention mechanism for distinguishing intents.

Impact of the feature interaction mechanism

Similarly, to investigate the need for the feature interaction module, we set up a variant named FIRE-light with the removal both Bi-CNNs and intent-aware component. As shown in Fig. 9, a significant performance degradation occurs when the feature interaction module is removed, indicating that the feature interaction module is critical to knowledge-aware recommendation models and contributes significantly to performance improvement. Next, to investigate the effect of two convolutional kernels on the performance of the model, two additional variants were set up: FIREW/OH-kernel and FIREW/OV-kernel. As shown in Fig. 10, we found that using only one type of convolutional kernel for feature fusion significantly degrades the model accuracy. Furthermore, the performance degradation is more pronounced when only the horizontal convolutional kernels are retained. Therefore, we conclude that 1) the best performance can be achieved by using both types of convolutional kernels and 2) bit-level feature interactions (i.e. vertical convolutional kernels) can make a sufficient contribution to the model.

Impact of the knowledge-aware attention mechanism

To explore the position of the knowledge-aware attention mechanism in the overall model, we set up variant FIREW/OGNNs-att. As shown in Fig. 10, the model performance collapses when the average aggregation mechanism is utilized instead of the knowledge-aware attention mechanism. This demonstrates that knowledge-aware attention mechanisms play a crucial role in user/item representation. It allows for adaptive aggregation of entity information and provides assurance that the model is accurately represented.

Other aspects

To further explore the efficiency of the feature interaction module in Fig. 9, we construct a variant FIRE Dual-Bi-CNNs, i.e., one that also gives feature interaction functionality for the user side. We have an interesting finding that the performance collapses when the Bi-CNNs module is added together with the user/item side. One possible reason is that the model underwent severe overfitting, rendering the feature interaction module useless. We leave the exploration of a more fine-grained and balanced approach to feature interaction to future work.

5.6 Training cost and efficiency analysis (RQ4)

To answer RQ4, in this section, we examine the time efficiency of the FIRE model and the two macro-variants and select four baseline methods for the CTR task as our control. All methods are performed in the same hardware environment, and the corresponding results are reported in Fig. 11. Specifically, the upper part of Fig. 11 reveals the specific training time of the model, and the lower half reports the overall performance of the model in terms of training time versus accuracy (AUC) (scatter plot). In this case, the top right corner implies the best performance.

Fig. 11
figure 11

Model comparison of the efficiency and accuracy (AUC)

We have the following observations: 1) Compared to the GNNs-based KGR model, FIRE is justified by the introduction of the feature interaction module and the intent-aware attention module, which makes the model more robust and thus introduces some additional overhead. 2) Compared to RippleNet, the temporal performance is relatively superior on large-scale datasets, demonstrating the effectiveness and efficiency of the adaptive knowledge-aware attention mechanism.

Unfortunately, while the performance is sufficiently good compared to traditional GNNs-based KGR methods, the stacking of additional neural modelling mechanisms makes the time consumption not advantageous. We will design a lightweight neural recommendation model in future work to achieve a double win between efficiency and time costs.

6 Conclusion and future work

In this paper, we propose a novel end-to-end KGs-based GNNs recommendation method. Specifically, the method leverages an attention mechanism to capture the high-order attribute information of users and items on the KGs. Next, a multi-granular convolutional neural network is adopted to capture the high-order feature interactions of the item-side. Last but not least, a two-level attention mechanism is utilized to model the latent intent of the user, thus achieving enhanced user embeddings. Empirical results on three large-scale benchmark datasets demonstrate the superiority and efficiency of our FIRE method.

In future work, we will consider how to distill and refine the sub-KGs in KGs-based recommendations, and attempt to integrate self-supervised learning techniques into the KGs-based GNNs recommendation. Besides, in view of the shortcomings of our work, which consumes more time than existing methods, we will aim to design more efficient and light-weight neural recommendation [41, 42] models in the future.

On the other hand, we will pay more attention to negative sampling techniques on KGs-based GNNs recommendation methods to generate better quality negative samples. We also leave these works in the future.