1 Introduction

The development and popularization of the Internet have brought great convenience to people’s lives and work, but, at the same time, people must face the phenomenon of information overload caused by large amounts of content. The development of recommendation systems means that content can be recommended that meets user preferences according to the information and interest preferences left by users on websites, which could effectively alleviate the problem of information overload. Most traditional recommendation systems, such as collaborative filtering [7, 20, 22], make recommendations based on the information that users actively leave on websites or user profiles [11]. However, with the awakening of privacy protection awareness, most users may not leave information or profiles on a website that they enter for the first time. This limits the application of traditional recommendation algorithms. In recent years, session-based recommendation systems (SBR) have gradually developed, which predict a user’s next action based on the sequence of previous actions appearing in each user session [14, 29]. Therefore, session-based recommendation systems do not rely on user profiles or historical information left on a website [1, 20]. A session refers to a record of a user interacting with items over a period of time, e.g., a sequence of items that a user clicks on consecutively over a 20-min period [6].

Initially, most research methods for SBR were chain based [17]. For example, some Markov chain-based methods[16, 31] make recommendations based on chains of items that appear consecutively in a session, and infer the possible sequence of user choices for all items. However, Markov chain-based methods exhibit huge deviations from the real world, resulting in a model whose performance does not meet usable requirements. In recent years, with the vigorous development of deep learning, many deep learning-based SBR methods have achieved promising results. These methods utilize transition relationships between items to simulate user interest preferences for a given session [16, 18].

Many deep learning-based SBR methods, such as GRU4Rec [3] and NARM [8], use a recurrent neural network (RNN) because RNNs have great advantages in modeling sequential transitions between consecutive items. However, since a session contains a user’s multiple interest preferences and noise, RNN-based methods inevitably include information of irrelevant items, which have a negative impact on the subsequent item recommendation.

In recent years, graph neural networks (GNNs) [9, 14, 23, 27,28,29] have demonstrated powerful modeling capabilities on graph-structured data, and have been used in many SBR methods. They investigate the complex transformation relationships that exist in a session by means of an improved GNN. However, most current GNN-based approaches model only based on a single session, but this not only lacks the exploitation of global relationships among items to capture effective synergistic information, but also ignores the relevance of user interests between sessions. While some existing approaches, while exploiting cross-session information or capturing item transformation relationships between different sessions, such as CSRM[22], extract collaborative information by measuring the similarity between the current session and neighborhood sessions. These methods introduce a large amount of noise and may weaken the performance of the model. Therefore, these models do not accurately exploit the collaboration information between different user sequences. In other words, they mostly focus on encoding each user’s own sequence and ignore the higher-order connections between different user sequences. In addition, due to the shortcomings of the GNN model, it does not capture the sequential dependencies of item sequences in a session well, and the sequential dependencies contained in the session and other sequential information that can improve the recommendation effectiveness are ignored. For example, GCE-GNN [24] uses the global graph to learn the item transition relationships on all sessions, but it does not make good use of the sequence information present in the session and the user’s sequential dependencies. Although GCE-GNN [24] incorporates a position-aware attention in generating session representations, it is limited to learning sequential information of sequences and does not capture well the long-range dependencies contained in the sessions.

Therefore, we propose a new method: sequence-aware graph neural network incorporating neighborhood information for session-based recommendation, referred to as SAN-GNN. First, we build a neighborhood graph using the concept of neighborhood, which is a topological structure on a set that represents the proximity to a "target" by defining a "distance" [26]. We design a GCN-based neighborhood information extractor to extract item neighborhood information from the neighborhood graph, but since multilayer graph neural networks suffer from over-smoothing when capturing node information over long distances, we integrate the neighborhood information as implicit-type information with appropriate weights into the final item representation. Then, for a single session sequence, we use a time-aware LSTM (Ta-LSTM) module to capture the sequential dependencies in the session. At the same time, we construct the single session as a session graph and obtain the item representations using the session graph attention (SGA) module that combines the sequence information and sequential dependencies in session, and then combine the sequential dependencies in the session to obtain the final item representations on the session graph. After that, we fuse the item representations at both levels and obtain the final session representations by a soft attention mechanism.

Because of its special framework and improvements on LSTM, our proposed SAN-GNN not only appropriately captures the sequential dependencies in sessions, but also effectively incorporates transition information related to the target item from other sessions. Therefore, better model performance can be obtained. The main contributions of our work are summarized as follows:

  • We designed the session graph attention mechanism to combine the graph neural network with the sequence information contained in the session.

  • We propose a time interval-based LSTM, i.e., Ta-LSTM model to capture the sequential dependence of sessions.

  • We propose the neighbor information extractor module to generate the neighborhood information of item and incorporate it into item representation with adjustable weights.

  • Extensive experimental results on the three datasets show that SAN-GNN outperforms most baseline methods, including the most advanced method.

2 Related Work

There are various methods for session-based recommendation, which can mainly be divided into three categories: Markov chain-, RNN-, and GNN-based SBR. Several existing session-based recommendation models are outlined in this section.

2.1 Markov Chain-Based SBR

Markov chain-based methods, such as MDPs [18] and FPMC [16], use the short-term interactions of users to predict their next possible actions. Specifically, these methods use the Markov chain to model the current session, and then make predictions based on previous interactions. However, Markov chain-based methods mainly consider users’ last interactions, ignore other interaction information in the session, and cannot capture the high-order dependencies between items. This negatively affects the prediction.

2.2 RNN-Based SBR

RNN-based methods are mainly used to model sequential data [12, 21]. GRU4Rec [3] was the first method to use RNN for session-based recommendation, and used an RNN constructed by the gated recurrent unit (GRU) to capture user preferences in sessions. ISLF [19] considered changes in user interests and adopted a variational autoencoder (VAE) and RNN to capture the features of users’ sequential behaviors. NARM [8] utilized an attention mechanism to learn users’ long-term preferences, and considered the last item clicked by a user to be the user’s short-term preference. Influenced by NARM, STAMP [10] captured users’ long-term preferences and short-term interests using long-term and short-term memory networks, respectively. Although most of the RNN-based methods consider users’ sequential behaviors, they can only model adjacent items, and cannot model the dependencies between long-distance and non-adjacent items [4, 15].

2.3 GNN-Based SBR

In recent years, GNN-based SBR methods have received extensive attention because GNNs are able to learn complex transition information from graph-structured models constructed by session sequences [23]. For example, SR-GNN[25] used a gated graph neural network (GGNN) to obtain more accurate item embeddings and combined users’ long-term and short-term preferences for prediction. GC-SAN [29] fused a self-attention mechanism based on SR-GNN to capture the interaction information and dependencies of adjacent items. FGNN [13] aggregated the information of neighbor nodes through a multi-head attention mechanism to generate the item embedding of the target node. MBGCN [5] used a multi-behavior graph convolutional network to make recommendations based on multiple interaction behaviors of users in the sessions. Interval graph of facial regions [33] extracts geometric features of the face using the attributes of the interval map. Additionally, CSRM [22] combined the collaborative information between the current and neighbor sessions to obtain the final session representation. GCE-GNN aggregated the neighborhood information of items using a two-level graph model to generate the final item representation, thereby improving recommendation performance. Although GNNs have achieved positive results in SBR tasks, due to the structural problem of GNNs themselves, they cannot well capture the sequential dependencies in a session, especially high-order dependencies of long-distance and non-adjacent items. At the same time, the excessive stacking of GNN layers can also lead to overfitting and noise propagation.

3 Proposed Method

3.1 Problem Statement

Session-based recommendation systems (SBR) predict the next item a user is likely to interact with in the current session by modeling the user’s session data. Due to the anonymity of session data, SBR cannot access users’ profile information and historical interactions with items in historical sessions. Therefore, we make the problem statement as follows:

In session-based recommendation, \(V=\{ {v_1,v_2,...,v_m}\}\) is used to denote the set of all items, which contains the different items from all sessions. The \(v_i\) represents an item in a session-based recommendation system, which specifies a unique product or item that a user interacts with during a session. The items in an anonymous session s are sorted according to the order in which they were clicked. This can be represented by \(S=\{ {v_s^1,v_s^2,...,v_s^m}\}\), where \(v_s^m\in V\) represents the mth clicked item in the session s by an anonymous user.

For a given session S, the session-based recommendation problem aims to recommend the \(Top-N\) items \((1\le N\le |V|)\) from N that are most likely to be clicked by the user in the current session S.

3.2 Graph Models: Neighborhood Graph and Session Graph

In this section, we outline how to build two kinds of graph models: neighborhood and session graphs. SAN-GNN generates item representations by learning the complex transitions of items on two different graph levels.

3.2.1 Session Graph Model

The session graph is a weighted directed graph, expressed as \(G_s=(V_s,E_s)\). It is constructed according to the transition relationships between items in the session, and its purpose is to learn item representations at the session level. Specifically, for a given session \(S=\{ {v_s^1,v_s^2,...,v_s^m}\}\), \(V_s\subseteq V\) is the set of all items that have appeared in the session. The edge set \(\varepsilon _s=(e^s_{ij})\) is the set of transition relationships between all items that have appeared in session S, where each edge \(e_{ij}^s\) indicates that the user clicked the item \(v_j\) after clicking the item \(v_i\). We assign a normalized weight to each edge, which is calculated as the number of edges divided by the out-degree of the starting node of the edge.

3.2.2 Neighborhood Graph Model

The neighborhood graph is an undirected weighted graph, expressed as \(G_N=(V_N,E_N)\), constructed according to the adjacency transition relationships of the target item in all sessions. Its purpose is to learn the aggregated representation of nodes related to target items from all sessions. Specifically, for a given session \(S_p=\{ v_p^1,v_p^2,...,v_p^i\}\), \(v_p^i\subseteq V_p\) (a subset of the session \(S_p\)) can be any item in the session. The order of the neighborhood is denoted as a, and the sequence segment with the sequence number \([i-a,i+a]\) in \(S_p\) is selected to construct the subset \(S_q\); the neighborhood set is the set of all nodes in the sequence \(S_q\). The generation rules of the weight \(w_{ij}\) of the edge set \(E_N\) in the graph are as follows. For each node, the weights of its adjacent edges are composed of the frequency of each edge appearing in all sessions; however, we only keep the N edges with the highest weights for each item to reduce the complexity of the graph model. If all the edges between items were kept, the model of the cross-session graph would be very large.

3.3 Model Framework

To solve the problems of underutilization of item transformation and GNN-based SBR model ignoring sequential dependencies in sessions, we use Ta-LSTM to capture the sequential dependencies in the target session and construct a neighborhood graph to learn the neighborhood information of items and combine other methods to achieve stronger recommendation performance.

Based on this, we propose a new sequence-aware graph neural network incorporating neighborhood information for session-based recommendation, named SAN-GNN. Figure 1 shows the architecture of SAN-GNN. The model consists of four parts: (1) neighborhood information extractor. Learning neighborhood information of items by GCN model using attention mechanism (2) sequence-aware item representation learning layer. Learning sequential dependencies in the current session by employing a time interval-based LSTM and using a newly designed SGA module to learn item representations on the session graph, after which the two are combined to generate item final representations. (3) Session representation generation module. The session-level item representations learned in the session graph and the neighborhood information learned in the neighborhood graph are combined with appropriate weights, and then a soft attention mechanism is used to generate the final session representation. (4) The prediction layer takes the final session representation as the input, and outputs the predicted probabilities of the candidate items. The specific content is described in the following sections.

Fig. 1
figure 1

An overview of our proposed SAN-GNN model

3.4 Neighborhood Information Extractor

In this section, we outline how to utilize the attention mechanism-based GNN network to learn item representations on the neighborhood graph. We use an attention mechanism to generate attention weights based on the importance of each connection.


In traditional methods, to obtain the neighbor features of item v, the average pooling method is typically used. However, in the neighborhood set of v, In this section, we outline how to utilize the attention mechanism-based GNN network to learn item representations on the neighborhood graph. We use an attention mechanism to generate attention weights based on the importance of each connection. not all items are related to the user preference of the current session. Hence, we consider using the attention mechanism to distinguish the importance of items in \(N_\varepsilon (v)\). We use \(\omega _{ij}\) to represent the neighbor-related attention. The larger the value of \(\omega\), the more similar the user preferences and behaviors are at nodes i and j, and the greater the impact of these nodes on the prediction results. First, we aggregate the information from the neighborhood graph for each node, and linearly combine the neighborhood information based on the neighbor-related attention, which is described as follows:

$$\begin{aligned} h_{v_{i}}= & {} \sum _{v_{j} \in N_{\varepsilon }(v)} \omega _{i j} h_{v_{i}}, \end{aligned}$$
(1)
$$\begin{aligned} s= & {} \frac{1}{n} \sum _{v_{i} \in S} h_{i}, \end{aligned}$$
(2)
$$\begin{aligned} \omega _{i j}= & {} \frac{\exp \left( q_{\alpha }^{T} {LeakyRelu}\left( W_{\alpha }\left[ \left( s \odot h_{v_{j}}\right) : w_{i j}\right] \right) \right) }{\sum _{v_{k} \in N_{\varepsilon }(v)} \exp \left( q_{\alpha }^{T} {LeakyRelu}\left( W_{\alpha }\left[ \left( s \odot h_{v_{j}}\right) : w_{i j}\right] \right) \right) }, \end{aligned}$$
(3)

where \(N_\varepsilon (v)\) is the neighborhood set of the current node i,\(\bigodot\) represents the product of elements,  :  represents the connection operation, \(\omega _{ij}\in \textbf{R} ^{d}\)is the weight of the edge \((v_i,v_j)\) on the neighborhood graph,\(W_\alpha\) and \(q_\alpha\) are learnable parameters, and s is the feature vector of the current session, which is obtained by performing average pooling on the nodes of the current session. The h is the node representations, which are obtained from the neighborhood graph for each node in the session. The averaging operation captures the aggregated information from all the nodes in the session, providing a session-level representation based on the neighborhood information. This session-level representation can then be used for further processing or analysis in the context of the recommendation system.


Then, to capture more neighborhood information related to the current node, we perform multiple aggregation operations. The aggregation layer is described as follows:

$$\begin{aligned} h_{v_{i}}^{k}={agg}\left( h_{v_{i}}^{k-1}, s\right) , \end{aligned}$$
(4)

where \(h_{v_i}^{k-1}\) is the item embedding obtained from the previous aggregation layer.


Finally, to alleviate the overfitting problem caused by stacking aggregation layers multiple times, we adopt a gated network to obtain the final neighborhood item representation, which is described as follows:

$$\begin{aligned} \mu= & {} \sigma \left( W_{\beta } h_{i}+W_{\gamma } h_{v_{i}}^{k}\right) , \end{aligned}$$
(5)
$$\begin{aligned} h_{v}^{N}= & {} (1-\mu ) \odot h_{i}+\mu \odot h_{v_{i}}^{k}, \end{aligned}$$
(6)

where \(W_\beta\) and \(W_\gamma\) are trainable parameters and \(\sigma\) is the activation function, which uses the sigmoid activation function in our experiments.

3.5 Sequence-Aware Item Representation Learning Layer

In section, we introduce how to generate item representation on the session graph using the item transition relationship and the sequence information in the session sequence.As far as we know, the traditional GCN model is good at capturing complex item transition relationship, but it is easy to ignore the sequence information between neighboring nodes. The long-range sequential dependencies and sequence information in the session can make the SBR task more efficient and stable. Therefore, we use a time-aware LSTM model to learn the sequential dependencies existing in the session, and use our proposed SGA module to learn the item transition on the session graph, and combine the two parts to generate the item representations on the session graph. Next, we present our proposed several modules separately.

3.5.1 Sequential Dependency Learning Module

It is well known that real-world purchases are usually traceable, not isolated and sudden. For example, a user who buys sneakers and jerseys may have a potential purchase intention of basketball or pants. This indicates that there is a certain sequential dependency between users’ purchase behaviors. At the same time, the interaction interval between the user and the product also indicates the user’s interest in the product. For example, a user intensively clicks on apple-related goods for a period of time, and clicks on ball goods after a period of time, which indicates that the two behaviors in a short period of time are often related, while a longer time interval may be for different goals. And to our knowledge, traditional RNN solutions usually do not consider the time intervals between users’ adjacent behaviors, which are important for capturing the sequential dependencies present in a session. Therefore, we designed a time-aware LSTM to capture the sequential dependencies existing in the session. The LSTM formulation is listed as follows.

$$\begin{aligned} e_{i}^{t}= & {} \sigma \left( W_{e}^{1} x_{i}^{t}+W_{e}^{2} \eta _{i}^{t-1}+b_{e}\right) , \end{aligned}$$
(7)
$$\begin{aligned} f_{i}^{t}= & {} \sigma \left( W_{f}^{1} x_{i}^{t}+W_{f}^{2} \eta _{i}^{t-1}+b_{f}\right) , \end{aligned}$$
(8)
$$\begin{aligned} O_{i}^{t}= & {} \sigma \left( W_{o}^{1} x_{i}^{t}+W_{o}^{2} \eta _{i}^{t-1}+b_{o}\right) , \end{aligned}$$
(9)
$$\begin{aligned} c_{i}^{t}= & {} f_{i}^{t} c_{i}^{t-1}+e_{i}^{t} \tanh \left( W_{c}^{1} x_{i}^{t}+W_{c}^{2} \eta _{i}^{t}+b_{c}\right) , \end{aligned}$$
(10)
$$\begin{aligned} \eta _{i}^{t}= & {} o_{i}^{t} \tanh \left( c_{i}^{t}\right) , \end{aligned}$$
(11)

where \(W_e^1,W_e^2, W_f^1, W_f^2, W_o^1, W_o^2,W_c^1, W_c^2\in \textbf{R}^{d\times d}\) and \(b_e, b_f, b_o, b_c\in \textbf{R} ^d\) are all learnable weight parameters; \(x_i^t\) is the input embedding of LSTM, which represents the embedding of item i at time t; \(e_i^t, f_i^t, O_i^t\) represent the input gate, the forget gate, and the output gate, respectively, which have the same dimensions as \(x_i^t\); \(c_i^t\) represents the embedding of the state of the node; and \(\eta _i^t\) represents the hidden state at the last moment. Thus, the sequential dependencies occurring in the session s are well encoded into \(\eta _i^t\). In this article, \(\eta _i^t\) is referred to as the sequential preference representation of the user.


To make the LSTM more sensitive to temporal changes, we modified the gating logic in the LSTM by adding a time-aware feature, as follows.

$$\begin{aligned} \delta _{t_{i}}= & {} \phi \left( W_{\delta } \log \left( t_{i}-t_{i-1}\right) +b_{\delta }\right) , \end{aligned}$$
(12)
$$\begin{aligned} T_{\delta }= & {} \sigma \left( W_{x \delta } x_{i}^{t}+\delta _{t_{i}} W_{t \delta }+b_{t \delta }\right) , \end{aligned}$$
(13)

where \(W_{\delta } \in \mathbb {R}^{d}, W_{t \delta } \in \mathbb {R}^{d \times d}\) are learnable weight parameters. The time interval feature \(\delta _{t_{k}}\) is used to encode the relative time distance between two consecutive interactions. Also, based on the work of previous Work[32], we add a fully connected layer to transform the time-aware features into dense vectors and compute the time gates simultaneously.

$$\begin{aligned} c_{i}^{t}= & {} f_{i}^{t} \cdot c_{i}^{t-1} \cdot T_{\delta }+e_{i}^{t} \tanh \left( W_{c}^{1} x_{i}^{t}+W_{c}^{2} \eta _{i}^{t}+b_{c}\right) , \end{aligned}$$
(14)
$$\begin{aligned} O_{i}^{t}= & {} \sigma \left( W_{o}^{1} x_{i}^{t}+W_{o}^{2} \eta _{i}^{t-1}+W_{\delta o} \delta _{t_{i}}+b_{o}\right) . \end{aligned}$$
(15)

Thus, the sequential dependencies occurring in the session s are well encoded into \(\eta _{i}^{t}\). In this article,\(\eta _{i}^{t}\) is referred to as the sequential preference representation of the user.

3.5.2 Session Graph Attention Mechanism

In general, the GNN model is better at capturing the transition relationships with neighboring nodes, but this also ignores the sequence information in the session, while the RNN model is the opposite. Therefore, to effectively distinguish the importance of the neighbors of the target item in the session graph to itself, and to make full use of the sequential information of user–item interactions, we design an SGA module by combining GNN and the sequential information encoding of the session.


For each neighbor node of item in the session graph, which has different importance for the target node, we use the attention mechanism combined with the order dependency extracted by Ta-LSTM to calculate the attention score. We define \(\xi _{i}\)as the relative position between item i and the last item. And for each neighbor node of the target node item, we assign a \(P_{i j} \in \mathbb {R}^{d}\) parameter vector as the relative order embedding of item i and item j. Since the attention coefficient between item i and item j is influenced by \(P_{i j} \in \mathbb {R}^{d}\), we define a relative order-aware attention coefficient E to distinguish the influence of item j on item i, with the following equation.

$$\begin{aligned} E_{i j}=\frac{1}{\sqrt{d}}\left( \left( W_{1} h_{v_{i}}\right) ^{\textrm{T}}\left( W_{2} h_{v_{j}}+P_{i j} \cdot \eta _{i}\right) \right) , \end{aligned}$$
(16)

where \(h_{v_{i}}\) is the representation of item i, \(\eta _{i}\) is the hidden state at the last moment of the Ta-LSTM, and \(\sqrt{d}\) is to avoid excessive dot product and to accelerate convergence. the weighted fraction of item is obtained by the softmax function

$$\begin{aligned} \alpha _{i j}={\text {softmax}}\left( E_{i j}\right) . \end{aligned}$$
(17)

Thus, the characterization of target node i is obtained by aggregating its own features as well as those of its surrounding neighbors.

$$\begin{aligned} h_{i}^{S}=\sum _{v_{j} \in N_{i}} \alpha _{i j} h_{v_{j}}. \end{aligned}$$
(18)

Next, we merge the learned sequential dependencies in Ta-LSTM with the item representations obtained in the SGA module through a gated network to obtain the final session graph item representations, as described below.

$$\begin{aligned} \phi= & {} {\text {sigmod}}\left( W_{4} h_{v_{i}}+W_{5} \eta _{i}+b_{\beta }\right) , \end{aligned}$$
(19)
$$\begin{aligned} h_{v}^{S}= & {} (1-\phi ) \odot h_{v_{i}}+\phi \odot \eta _{i}, \end{aligned}$$
(20)

where \(\phi \in \mathbb {R}^{d}\) is the gated fusion vector, \(W_{4},W_{5}\in \mathbb {R}^{d \times d}\)and \(b_{\beta } \in \mathbb {R}^{d}\) are learnable parameters, \(h_{v_{i}} \in \mathbb {R}^{d}\) is the embedding of the transition relationship between items, and \(\eta _{i} \in \mathbb {R}^{d}\) is the embedding of the sequential dependency. The item representation on the session graph is \(h_{v}^{s}\).

3.6 Session Generation Layer

We combine each item’s representation on the neighborhood graph and on the session graph to obtain its final representation. However, multilayer graph neural networks suffer from the problem of over-smoothing when capturing node information over long distances, which can make the model perform unstably on certain datasets. To solve this problem, we use the neighborhood information as the implicit-type information and integrate the final representation of item using adaptive weights as follows.

$$\begin{aligned} h_{v}^{\prime }=(1-\rho ) \times {\text {dropout}}\left( h_{v}^{g}\right) +\rho \times h_{v}^{s}. \end{aligned}$$
(21)

Here, we add a dropout layer to item representations at two levels to prevent overfitting that may arise when extracting high-order information.

$$\begin{aligned} \varphi _{i}= & {} \tanh \left( W_{6}\left[ h_{v(i)}^{\prime } \Vert \eta _{i}\right] \right) , \end{aligned}$$
(22)
$$\begin{aligned} \tau _{i}= & {} q_{\beta }^{\top } \sigma \left( W_{7} \varphi _{i}+W_{8} s^{\prime }+b_{\delta }\right) , \end{aligned}$$
(23)
$$\begin{aligned} s= & {} \sum _{i=1}^{l} \tau _{i} h_{v(i)}^{\prime }, \end{aligned}$$
(24)

where \(W_{6} \in \mathbb {R}^{d \times 2 d}, W_{7}, W_{8} \in \mathbb {R}^{d \times d}\) and \(q_{\beta }^{\top }, b_{\delta } \in \mathbb {R}^{d}\) are trainable parameters, \(h_{v(i)}^{\prime }\) is the item representation, \(\eta _{i}\) is the sequential dependency of the current session, and \(S^{\prime }\) is the session information, which is obtained by calculating the average of item representations of the session: \(s^{\prime }=\frac{1}{l} \sum _{i=1}^{l} h_{v(i)}^{\prime }\). First, we combine item embeddings and sequential information through concatenation and nonlinear transformation. Then, the corresponding weights are learned using a soft attention mechanism. Finally, the session representation S is obtained by computing the linear combination of all items of the current session.

3.7 Prediction Layer

Based on the resulting session representation s, we combine the initial embedding of each candidate item and the current session representation, and then apply the softmax function to obtain the output \(\hat{\gamma }_{i}\), which is the final recommendation probability:

$$\begin{aligned} \hat{\gamma }_{i}={Softmax}\left( S^{\top } h_{v_{i}}\right) , \end{aligned}$$
(25)

where \(\hat{\gamma }_{i}\in \hat{\gamma }\) represents the probability of item \(v_i\) being the next item to be clicked in the current session.


The loss function is defined as the cross entropy of the prediction result \(\hat{\gamma }_{i}\):

$$\begin{aligned} \mathcal {L}\left( \hat{\gamma }_{i}\right) =-\sum _{i=1}^{m} y_{i} \log \left( \hat{y}_{i}\right) +\left( 1-y_{i}\right) \log \left( 1-\hat{y}_{i}\right) , \end{aligned}$$
(26)

where \(\gamma _i\) is a one-hot vector.

4 Experiments

Based on the following four questions, we conducted extensive experiments to investigate the effectiveness of our proposed SAN-GNN:

RQ1: For the three real-world datasets, does SAN-GNN outperform the most advanced session-based recommendation (SBR) baseline model?

RQ2: Is the sequential dependence of session sequences real and valid, and is Ta-LSTM more effective than the LSTM model?

RQ3: How effective is the SGA module for capturing session graph item representations?

RQ4: Can neighborhood information improve the performance of SAN-GNN?

RQ5: How do weights \(\rho\) affect model performance?

4.1 Datasets and Preprocessing

We adopt three benchmark datasets, Diginetica, Tmall and Nowplaying[30]. The Diginetica dataset comes from CIKM Cup 2016 and contains typical commodity transaction data. The Tmall dataset comes from the IJCAI-15 competition and contains shopping records of anonymous users on the Tmall online shopping platform. The Nowplaying dataset comes from the Kaggle competition and contains rich music listening behaviors of users.

First, we preprocessed the three datasets[25] by filtering out sessions of length 1 and items that appear less than 5 times. Then, we split the session \(s=[s_1,s_2,...,s_{n-1}]\) in the dataset to generate sequences and corresponding labels, namely \(([s_1 ],s_2 ),([s_1,s_2 ],s_3 )\),..., \(([s_1,s_2,...,s_{n-1}])\). Table 1 outlines the statistics of the preprocessed datasets.

4.2 Evaluation Metrics

In this experiment, P@N and MRR@N were used as the metrics for the prediction result. P@N was used in the session-based recommendation and is a measure of prediction accuracy, which refers to the percentage of correctly predicted items in the top N items. MRR@N takes the inverse of the sequence number of the correctly predicted item as the prediction accuracy, and then calculates the average of all prediction accuracy. When the position of the correctly recommended item in the prediction list is closer to the top ten, the prediction accuracy is higher. In the experiments, we set N to be 10 and 20.

Table 1 Statistics of the data sets

4.3 Comparison Models

To verify the performance of the SAN-GNN model, we use the following recommendation models as the comparison models:

POP: This model makes recommendations based on items that appear most frequently in the training set.

Item-KNN: Here, the similarity between the target item and other items is calculated and items similar to the current item are recommended[17].

FPMC: This model combines matrix factorization and a Markov chain and considers both the long-term user preference and the temporal information of the sequence [16].

GRU4Rec: GRU units are designed to train mini-batch sequence groups and a ranking-based loss function is introduced to utilize the interaction information of items in the entire sequence [3].

NARM: This model is based on GRU4Rec; it applies an attention mechanism to capture users’ main interests and sequential behaviors in sessions [8].

STAMP: Here, an attention mechanism is applied to capture users’ general interests from the long-term memory of the session context, and users’ current interests are captured from the short-term memory of the last click [10].

SR-GNN: A gated GNN is employed to capture complex transitions of items, and then users’ long-term preferences and current preferences are extracted and integrated to better predict users’ next preferred items [25].

GCE-GNN: This model generates session representations by learning item representations on both the global and local graphs, and using reversed position vectors and attention mechanisms [24].

4.4 Parameter Setting

Based on the previous work, we fix the dimension d of the implicit vector to 100 and the batch size to 100. In the sequential dependency learning module, we set the number of LSTM layers to 1. For a fair comparison, we set the hyperparameters of each model to be the same as the parameters of SR-GNN, such as learning rate, learning rate decay, etc. We performed the initialization of our model with a Gaussian distribution with mean 0 and standard deviation 0.1. The learning rate was 0.001, the learning rate decay was 0.1 for every three training cycles, the regularization coefficient was \(\lambda =10^{-5}\), the training cycle was set to 30, and a 10% subset was randomly selected from the training set for verification. This experiment adopted the Adam algorithm to optimize the parameters of the model. Furthermore, we set the number of neighbors on the neighborhood graph and the maximum distance \(\alpha\) of adjacent items to 12 and 3, respectively.The weight \(\rho\) is set to 0.3.

Table 2 Effectiveness comparison on the three datasets

4.5 Overall Comparison (RQ1)

Table 2 shows the results of the baseline models and our model on the three datasets; the best result in each column is shown in bold. For all three datasets and two metrics, SAN-GNN outperforms all baseline models, which shows that our model is effective.It is worth noting that the average session lengths for Tmall and Nowplaying are long, so information about user interactions that are farther away in the session may be overlooked. SAN-GNN addresses this issue by obtaining sequential dependencies and sequence information about the session, providing more auxiliary information.

Of the traditional methods, the performance of POP was the worst, because it only recommended the first few items with the highest frequency of appearance to users. However, each user has a unique preference. POP cannot make recommendations based on each user’s interest, resulting in a poor performance. FPMC uses matrix factorization and a first-order Markov chain and performs well on the Tmall dataset. Item-KNN calculates item similarity and achieves relatively good results on the Nowplaying dataset. Although FPMC and Item-KNN can achieve better performance on a single dataset, they cannot achieve better results on multiple datasets due to their limitations.

The performance of GRU4Rec is significantly higher than that of traditional methods, which shows that RNN-based methods are more capable of mining effective high-order information and capturing users’ current preferences. However, GRU4Rec still cannot capture the changes in user interests within sessions. NARM and STAMP outperform GRU4Rec because these two methods combine RNN and an attention mechanism. This demonstrates the effectiveness of assigning different attention weights to different items in a session, and at the same time shows that the sequential information in a session can represent the current change in the user’s interest.

SR-GNN and GCE-GNN outperform other comparison methods on the Diginetica and Nowplaying datasets, which indicates that GNN methods are more effective for session-based recommendations. This shows that GNNs exhibit better performance in capturing complex item transitions, and at the same time shows that the neighborhood node information of items in a session contains the common interest and preference of different users.

Compared to GCE-GNN and SR-GNN, our proposed model, SAN-GNN, outperforms all comparison methods by combining the neighborhood node information of items and the sequential dependencies in sessions. SAN-GNN outperforms GCE-GNN by an average of 3.95% on the Diginetica dataset, an average of 13.02% on the Tmall dataset, and an average of 15.35% on the Nowplaying dataset.

4.6 The Effect of Sequential Dependency Information (RQ2)

In SAN-GNN, we capture the sequential dependencies in a session using an improved model of LSTM based on temporal augmentation, Ta-LSTM, and use it in the SGA module to jointly determine the influence of neighbor nodes on the target node in the session graph. To demonstrate the validity of sequential dependencies, we implemented the following variant model.

  • SAN-GNN-w/s: We remove the Ta-LSTM module from SAN-GNN, use only GCN to extract item representations on the session graph, and use the last clicked item as a trigger in the session generation module to distinguish the impact of different items in the session.

  • SAN-GNN-w/t: To verify the role of Ta-LSTM, we use LSTM to replace Ta-LSTM.

Table 3 shows the performance of variant model, the best results are boldfaced.

SAN-GNN-w/s does not perform well on all three datasets, which may be due to the drift in user interest captured by the model with the missing sequential dependency. It performs better on Diginetica than Tmall and Nowplaying, probably because Diginetica has the shortest average session length and the effect of the missing session order dependency is not as large.

SAN-GNN-w/t performs worse than SAN-GNN, which indicates that our proposed Ta-LSTM effect is better than that of LSTM, and demonstrates that considering the time interval between item interactions can improve the recommendation effect.

4.7 The Effect of SGA Module (RQ3)

In learning the item representations in the session graph, we construct the SGA module by combining the sequential dependency and the relative position embedding of the target node with its surrounding neighbors. The purpose of the SGA module is to solve the problem that GNN cannot use the sequence information in the session well, and through the SGA module, we can combine the advantages of GNN and RNN to generate better recommendation results. To verify the effectiveness of SGA, we implemented the following variant model.

  • SAN-GNN-w/a: We remove the SGA and directly combine the item representations obtained on the session graph with the item representations obtained on the neighborhood graph.

  • SAN-GNN-o/g: The SGA module is replaced with the attention mechanism mentioned in GCE-GNN, i.e., the attention scores are calculated by training four weight vectors separately with different relationships between nodes.

  • SAN-GNN-w/p: To verify the usefulness of the relative position embedding used in the SGA module, we remove the relative position embedding P and keep only the final state of the Ta-LSTM to distinguish the influence of neighbors on the target node.

Table 4 shows the performance of variant models, the best results are boldfaced. It can be seen that the performance of all three variants of the model is unsatisfactory.

The SAN-GNN-w/a model performs the worst, which may be due to the inability to distinguish the importance of different neighbors to the target node, resulting in introducing too much sequence noise in the information propagation.

SAN-GNN-o/g performs stronger than SAN-GNN-w/a and second only to SAN-GNN-w/p. We attribute this result to the improvement of the attention mechanism, which can determine the importance of neighbors based on different interactions.

SAN-GNN-w/p performs best among the three models, second only to SAN-GNN. Meanwhile, it performs worse than Diginetica on Tmall data, probably because Diginetica data length is smaller and the attention scores calculated by the SGA module are not affected much by the missing relative location information.

Table 3 Performance comparison on sequential information
Table 4 Performance comparison on SGA module
Table 5 Performance comparison on Neighbor information

4.8 The Effect of Neighborhood Information (RQ4)

In the neighborhood graph, we use neighbor-aware attention to calculate the importance of different neighbors for the target node, and obtain higher-order interaction information by stacking multiple layers of GCNs to finally generate the neighborhood information of each node. And the neighborhood information of items is combined with the item representations on the conversation graph by adaptive weights to generate more stable recommendation results. To verify the validity of the neighborhood information, we design the following variant model.

  • SAN-GNN-w/n: We remove the neighborhood information extractor and keep only the session graph features.

  • SAN-GNN-k-hop: We set the maximum distance on the neighborhood graph to k (k = 1, 2).

From Table 5, it can be seen that the SAN-GNN model with neighborhood information performs the best, compared with SAN-GNN-w/n without neighborhood information, SAN-GNN can explore item transformation information from other sessions, which helps the model to make more accurate decisions. However, from the comparison of the models with k = 1 and k = 2, when k = 2, the model performance instead SAN-GNN may face the problem of too much noise as well as over-smoothing.

4.9 The Effect of Weights \(\rho\) (RQ5)

We use the weight \(\rho\) to represent the weight of the neighborhood information. It controls the relative proportions of neighborhood information and session graph item representations, and determines the extent to which the model uses neighborhood information to improve recommendations.

According to the results in Fig. 2, the performance of the model first shows an increasing trend and then gradually decreases as the proportion of neighborhood information increases. For the Tmall dataset, better performance is achieved with \(\rho\) between 0.2 and 0.4. For the Diginetica dataset, \(\rho\) between 0.2\(-\)0.3 achieves a more stable performance. And for the Nowplaying dataset, \(\rho\) between 0.4\(-\)0.6 is more appropriate. The experimental results further show that the neighborhood information is helpful to improve the recommendation performance, while integrating the neighborhood information by weight can alleviate the problem of excessive noise.

Fig. 2
figure 2

P@20 and MRR@20 on three datasets

5 Conclusion

In this article, we proposed a sequence-aware and neighborhood information enhanced graph neural network (SAN-GNN) for session-based recommendation. Specifically, it constructs two hierarchical graph models and designs a neighborhood information extractor and a sequence-aware module, respectively, to capture the sequential dependencies and the neighborhood information of items present in each session and to incorporate them into the item representation as auxiliary information. In the sequence-aware module, we specifically design a Ta-LSTM for capturing the sequential dependencies of the session sequences, with the main improvement that it takes into account the interaction intervals between items. Meanwhile, we build the SGA module to combine the advantages of GNN in capturing complex item transition relations and RNN in establishing sequence information to enhance the recommendation effect. Comprehensive experiments show that our proposed model significantly outperforms eight baseline models on the three public datasets. In the future, we plan to study further neural network frameworks to effectively combine sequential dependencies and item transitions, and further verify the recommendation performance of our proposed SAN-GNN model in other fields.