1 Introduction

Recommender systems are widely used for short video recommendations, online shopping and news recommendations due to their ability to predict users' interests based on their historical behaviour. Most existing recommendation systems rely on users' personal information and long-term historical behaviour data as input features, use machine learning and deep learning related techniques to predict what users are interested in. However, in many modern online platforms, capturing a user's historical browsing behaviour results in a performance loss and the preference of the user's current session can easily be overwhelmed by the long history of behaviour. Consequently, session recommendation systems have been created to improve the user experience. As a sub-task of the recommendation system, the advantage of the session recommendation system is that it relies only on the user's click behaviour for the current session. The only input features that can be used for session recommendation are the user and item ID information, which makes it more difficult to capture the user's interest. Session recommendation aims to predict items of interest to the user in chronological order based on a given sequence of short-term behaviour of anonymous users.

In early research, Markov chain-based approaches were first proposed for session recommendation [1]. Rendle et al. [1] combined matrix factorization with Markov chains to capture the interests of users. With the advancement of deep learning techniques, many methods based on deep learning have been proposed for session recommendation, which are mainly based on recurrent neural networks (RNNs), attention networks and their intricate fusion [2,3,4]. Since sequences in session recommendation have other more complex transition relationships in addition to simple time-dependent relationships, extracting the transition relationships in session sequences only in temporal order is not sufficient for session recommendation. Methods based on graph neural networks (GNNs) have demonstrated the effective of capturing the complex transition relationships of items in a given session [5,6,7,8]. GNNs have a strong ability to model the dependencies between nodes in a graph. The GNN-based recommendation model differs from previous models in that it models session sequences as session graphs and employs graph encoders to further mine the rich hidden information between items in the session graph to obtain good prediction results. Although the graph neural network-based approach can more accurately model the transformation relationships of items in a session sequence, the GNN-based recommendation model is not as accurate as the previous model. However, GNN-based recommendation models still face the problem of data sparsity. Due to the lack of long-term historical user behaviour data, session-based recommendations can only be made using user interaction records generated from short-term sessions, but the number of user interactions in a session is very limited and far less than long-term user behaviour data. This lack of available data prevents session-based recommendation models from learning accurate user preferences, resulting in sub-optimal recommendation performance.

Graph comparison learning uses the intrinsic relationships of the data itself and learns the features of the data itself through different augmented views, without relying on manually annotated data, which has a great advantage in solving the data sparsity problem. Most of the current recommendation models based on graph contrastive learning adopt random node or edge dropping to generate contrast views. However, since the session sequences in session-based recommendation are extremely sparse, randomly dropping nodes or edges is likely to disrupt the current session context, therefore, the traditional data augmentation methods of graph contrastive learning are not suitable for session-based recommendation. Existing graph contrastive learning methods for session-based recommendation focus on utilizing information from other sessions to help generate different views [9,10,11,12]. S2-DHCN employed hypergraphs to generate two enhanced views, constructed global graphs to extract information from other sessions, and utilized contrastive learning as an auxiliary task to improve the recommendation performance of the main task [9]. COTREC proposed a framework based on contrastive learning to enhance the accuracy of session recommendations and used information from other sessions to generate session views and item views for contrastive learning [10]. SimCGNN increased the diversity of sessions with the same last item by using them as negative samples and designing a contrastive module based on cosine similarity [11]. While these methods successfully prevent the corruption of session context information caused by traditional graph augmentation techniques, the incorporation of additional session information may introduce items that are irrelevant or even contradictory to the user’s current interests. This interference impedes the accurate modeling of user interests, ultimately leading to suboptimal performance. A series of papers in collaborative filtering recommendation that investigate whether data augmentation in contrastive learning is necessary [13,14,15]. They believe that the effectiveness of graph contrastive learning in recommendation tasks is mainly due to the contrastive loss. Therefore, they proposed several simple but effective data augmentation methods. Although these studies are not specifically tailored to the field of session-based recommendation, they have great reference value and can provide valuable insights for further exploration in session-based recommendation.

To address the aforementioned issues, we propose a new session-based recommendation model, called Session-based Recommendation with Multi-layer Aggregated Contrastive Learning (SR-MACL). We not only abandon the graph augmentation methods of randomly dropping nodes or edges that are frequently used in traditional contrastive learning, but also do not generate enhanced views by introducing information from other sessions. We just use a simple and effective noise-based multi-level aggregated embedding enhancement to create contrastive views. In our model, two views share initial embeddings and adjacency matrices. Then, the complex transition patterns of the session sequence are modelled by stacking Star Graph Neural Networks (SGNN). Based on the Gated Graph Neural Network, a star node is added to consider non-adjacent items to solve the problem of long-distance transition information propagation. Uniform noise is added to the learned representations at each layer. We then generate a new contrastive view by aggregating the representations of each layer, thereby achieving more effective representation-level data augmentation without disrupting the context of the session sequence. Through contrastive learning, we maximize the mutual information between the learned session representations of the two session embeddings to improve the performance of item/session feature extraction. Then, we unify the recommendation task and self-supervised task under one framework through multi-task learning. By jointly optimizing these two tasks, we learn more robust embedding representations to accurately predict the next item that the user is interested in. The main contributions of this paper are as follows:

  1. (1)

    We introduce noised-based contrastive learning to alleviate the data sparsity problem in session-based recommendation.

  2. (2)

    We propose a novel multi-layer aggregated contrastive learning method that can avoid the influence of irrelevant items when introducing information from other sessions, achieve more effective representation-level data augmentation, and provide a new perspective on how to apply graph contrastive learning to session-based recommendation.

  3. (3)

    The experimental results show that our model outperforms the state-of-the-art baseline models and demonstrates significant performance improvements.

The remaining parts of this paper are organized as follows. Section 2 introduces the related work of contrastive learning and session-based recommendation. Section 3 details the implementation of our SR-MACL model. Then, Sect. 4 presents a series of experiments, including performance comparisons between our proposed SR-MACL model and other baseline models, and demonstrates the effectiveness of our model. At last, we summarize our current work and looks forward future work in Sect. 5.

2 Related Work

We provide an overview of related research on session-based recommendation, primarily categorized into four areas: traditional methods, deep learning-based methods, GNN-based methods, and GCL (graph contrastive learning)-based methods. We then focus on research related to GNN-based and GCL-based, as these are most closely related to our study.

2.1 Session Recommendation

In earlier studies, Markov chain-based approaches convert the session sequence into a Markov chain, and subsequently predict the following action of the user based on their prior action. FPMC captures sequential patterns and long-term user preferences by combining matrix factorization and first-order Markov chains [2]. However, the Markov chain-based approach only employs the relationship of adjacent items to model the sequential transformation of session sequences and does not consider the connection between disjoint items. Subsequent research has shifted its focus to using deep learning techniques to learn transition relationships between items. For example, GRU4REC utilized GRU to model session sequences. On the other hand, techniques based on recurrent neural networks possess notable sequential assumptions, which fail to encapsulate the transitive connections between items that are widely separated in the session sequence. In addition, the attention mechanism has been widely used in SBR [3], which can be used to distinguish items in the session based on their importance. It also can be combined with other methods such as RNN to emphasize user’s main intention [16, 17]. Regarding modelling sessions as graph structured data, SR-GNN was the pioneer method [5]. This approach transforms a session sequence into an unweighted directed graph and utilizes the gated graph neural network (GGNN) to comprehend intricate item transitions within a session [18]. NISER pointed out the presence of popularity bias in GNN-based recommendation models and provided evidence that this issue is partly associated with the size or norm of the learned item and session graph representations (embedding vectors). And proposed a training procedure to alleviate this problem by employing normalized representations [19]. This approach has been employed in a subsequent series of GNN-based models. GCE-GNN acquires knowledge from two levels of item representations obtained from the session graph and the global graph correspondingly [6]. The session graph is formed using the current session, while the global graph is formed by consolidating all item sequences and their adjacent items. SGNN-HN argued that previous methods ignore information from items that are not directly connected and suffer from the commonly observed overfitting problem in graph neural networks [7]. Therefore, the star graph neural network (SGNN) is used to learn the complex transition relationships between items in the session sequence. To avoid overfitting, the highway network (HN) is used to select embeddings from item representations in the form of weights. GC-HGNN employed hypergraphs to construct global graphs and obtain global contextual information through hypergraph convolution [8]. The graph attention network is employed by GC-HGNN to incorporate local information, while the attention mechanism is utilized to handle merged features and learn the final representation of the session sequence.

2.2 Contrastive Learning in Recommender Systems

Contrastive learning has shown impressive performance in computer vision and natural language processing. Recently, it has gained traction in many fields of artificial intelligence, including graph neural networks, leading to a series of significant advances [20,21,22,23,24]. Due to the ability of contrastive learning to learn general features from unlabelled data, contrastive learning is an effective method to address the problem of data sparsity [25,26,27], making it a popular research direction in the field of recommender systems. Currently, many graph contrastive learning recommendation models have been proposed, and significant effects have been achieved [9, 11, 12, 28,29,30,31].

SGL generated different augmented views through node or edge dropout and random walks [28], and then maximized the consistency of the representations learned by the graph encoder under different views. DCRec proposed a new debiased recommendation contrastive learning paradigm (DCRec) [29], which incorporates global information into the augmented views through adaptive perceptual augmentation. The paradigm combines sequence pattern encoding with modelling of global collaborative relationships through adaptive consistency perception enhancement. CL4SRec conducted contrastive learning by generating different data augmentations based on sequence construction through item cropping, masking, and re-ordering [30]. S2-DHCN used hypergraph to generate two augmented views, constructed a global graph to extract information from other sessions, and employed contrastive learning as an auxiliary task to improve the recommendation performance of the main task [9]. COTREC proposed a contrastive learning-based framework to enhance the accuracy of session recommendation and utilized information from other sessions to generate session views and item views for contrastive learning [10]. SimCGNN increased the specificity of sessions with the same last item by using sessions with the same last item as negative samples and designed a contrast module based on cosine similarity to enhance the difference between sessions with the same last item [11]. CGL employed the self-supervised module that combines global and session graphs, decoupled the current session's intention, enriched item representations, and designed a label confusion method to prevent overfitting [12]. Existing graph contrastive learning session-based recommendation methods focus on introducing information from other sessions to generate the global graph to construct augmented views. However, introducing other session information may introduce items that are irrelevant or even opposite to the user's current interests, which interferes with accurately modelling the user's interests.

3 Methodology

This section commences by elucidating the fundamental concepts of session-based recommendation and graph construction. Subsequently, we explicate our model in detail, the overall architecture is shown in Fig. 1.

Fig. 1
figure 1

Overview of the SR-MACL model

3.1 Problem Definition

Session-based recommendation primarily aims to provide next-item recommendations for anonymous users. Therefore, the model needs to accurately capture the user's general interest based on the session sequence. Session-based recommendation systems aim to provide attractive recommendations for anonymous users, so the model must meticulously apprehend the user’s interests by relying on short sessions rather than the complete historical record of interactions.

Assuming there are m items and n sessions, let \({\text{V}}=\left\{{{\text{v}}}_{1},{{\text{v}}}_{2},...,{{\text{v}}}_{{\text{m}}}\right\}\) and \({\text{S}}=\left\{{{\text{s}}}^{1},{{\text{s}}}^{2},...,{{\text{s}}}^{{\text{n}}}\right\}\) represent the sets of items and sessions, respectively. \({{\text{s}}}^{{\text{i}}}\) represents the i-th session. Each session \({{\text{s}}}^{{\text{i}}}\) is an ordered sequence of clicks arranged in chronological order \(\left[{{\text{v}}}_{1}^{{\text{i}}},{{\text{v}}}_{2}^{{\text{i}}}{,...,{\text{v}}}_{{\text{k}}}^{{\text{i}}}\right]\), where \({{\text{v}}}_{{\text{j}}}^{{\text{i}}}\in {\text{V}}\) represents the j-th clicked item in the i-th session, and k represents the length of the session. Our goal is to predict the next item \({{\text{v}}}_{{\text{k}}+1}^{{\text{i}}}\) for any session \({{\text{s}}}^{{\text{i}}}\). The process involves creating a session representation using the item representations within the session. From this, probabilities are determined by measuring the similarity between the session representation and all item embeddings. Finally, the model performs top-N recommendation based on these probabilities.

3.2 Graph Construction

For each session sequence S, we can model it as a session graph \({\text{G}}=\left({\text{V}},{\text{E}}\right)\), where V represents the nodes in session S \(=\left\{\left\{{{\text{v}}}_{1}^{{\text{s}}},{{\text{v}}}_{2}^{{\text{s}}}{,...,{\text{v}}}_{{\text{m}}}^{{\text{s}}}\right\},{{\text{v}}}_{{\text{s}}}\right\}\), \({{\text{v}}}_{{\text{s}}}\) represents the star node, and the edge \(\left({{\text{v}}}_{{\text{i}}}^{{\text{s}}},{{\text{v}}}_{{\text{i}}+1}^{{\text{s}}}\right)\) represents items clicked at two adjacent time points in a session. First, we divide the edges into input and output edges and assign a normalized weight, which is calculated as the occurrence frequency of the edge divided by the out-degree of the starting node of the edge. An example of the graph construction is shown in Fig. 2. Inspired by [32], we treat the items in the session sequence as satellite nodes, and we add a star node vs to capture long-range dependencies between non-adjacent nodes. The edges between satellite nodes are unidirectional, while the edges between star nodes and satellite nodes are bidirectional. The construction of the star graph is shown in Fig. 1.

Fig. 2
figure 2

An illustration for the construction of the session graph

3.3 Model Overview

In this section, we propose SR-MACL, a session-based recommendation model based on multi-layer aggregation enhanced contrastive learning. The architecture of SR-MACL is shown in Fig. 2. We divide the model into three modules, namely the session representation module, multi-layer aggregation contrast module, and multi-task learning module. We first give a brief overview, and later sections will describe these three modules in detail. The session representation module is used to generate the session embedding for the main task. The contrastive module is used to generate the session embedding for the auxiliary task and through contrastive learning achieves information exchange between the two session embeddings. Finally, the multi-task learning module combines the two tasks to jointly optimize the model.

3.4 Session Representation Module

3.4.1 Initialization

Before learning item representations, we need to encode all items in V into a unified embedding space \({{\text{R}}}^{{\text{d}}}\), where d represents the size of the embedding dimension. We embed each session S and item \({{\text{v}}}_{{\text{i}}}\) in the same space. Note that the initialization of satellite nodes and star nodes is different. We directly use the embedding of the unique item in the session as the representation of the satellite node:

$${{\text{h}}}^{0}=\left[{{\text{v}}}_{1},{{\text{v}}}_{2},{{\text{v}}}_{3},...,{{\text{v}}}_{{\text{k}}}\right]$$
(1)

Here, \({{\text{v}}}_{{\text{i}}}\in {{\text{R}}}^{{\text{d}}}\) represents the d-dimensional embedding of the satellite node i in the star graph. The initial embedding of the star node is obtained by averaging the initial embeddings of the satellite nodes:

$${{\text{v}}}_{{\text{s}}}^{0}=\frac{1}{{\text{k}}}{\sum }_{{\text{i}}=1}^{{\text{k}}}{{\text{v}}}_{{\text{i}}}$$
(2)

3.4.2 Item Embedding Learning

We employ Star Graph Neural Network (SGNN) to learn the satellite nodes in the star graph, mainly updating the satellite node embeddings by propagating information from neighbouring nodes and star nodes.

First, we consider the information from the neighbouring nodes. For the satellite nodes \({{\text{v}}}_{{\text{i}}}\) in the star graph, the update function is shown as follows:

$${{\text{a}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}={{\text{A}}}_{{\text{s}},{\text{i}}:}{\left[{{\text{v}}}_{1}^{\left({\text{l}}-1\right)},{\cdots ,{\text{v}}}_{{\text{k}}}^{\left({\text{l}}-1\right)}\right]}^{{\text{T}}}{\text{W}}+{\text{b}}$$
(3)
$${{\text{z}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}=\upsigma \left({{\text{W}}}_{{\text{z}}}{{\text{a}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}+{{\text{U}}}_{{\text{z}}}{{\text{v}}}_{{\text{i}}}^{\left({\text{l}}-1\right)}\right)$$
(4)
$${{\text{r}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}=\upsigma \left({{\text{W}}}_{{\text{r}}}{{\text{a}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}+{{\text{U}}}_{{\text{r}}}{{\text{v}}}_{{\text{i}}}^{\left({\text{l}}-1\right)}\right)$$
(5)
$$ \widetilde{{{\text{v}}_{{\text{i}}}^{{\text{l}}} }} = {\text{tanh}}\left( {{\text{W}}_{{\text{o}}} {\text{a}}_{{{\text{s}},{\text{i}}}}^{{\text{l}}} + {\text{U}}_{{\text{o}}} \left( {{\text{r}}_{{{\text{s}},{\text{i}}}}^{{\text{l}}} \odot {\text{v}}_{{\text{i}}}^{{\left( {{\text{l}} - 1} \right)}} } \right)} \right) $$
(6)
$$ \widehat{{{\text{v}}_{{\text{i}}}^{{\text{l}}} }} = \left( {1 - {\text{z}}_{{{\text{s}},{\text{i}}}}^{{\text{l}}} } \right) \odot {\text{v}}_{{\text{i}}}^{{\left( {{\text{l}} - 1} \right)}} + {\text{z}}_{{{\text{s}},{\text{i}}}}^{{\text{l}}} \odot { }\widetilde{{{\text{v}}_{{\text{i}}}^{{\text{l}}} }} $$
(7)

where \({\text{W}}\), \({\text{W}}_{{\text{z}}}\), \({\text{W}}_{{\text{r}}}\), \({\text{W}}_{{\text{o}}} \in {\text{R}}^{{{\text{d}} \times 2{\text{d}}}}\) and \({\text{U}}_{{\text{z}}}\), \({\text{U}}_{{\text{r}}}\), \({\text{U}}_{{\text{o}}} \in {\text{R}}^{{{\text{d}} \times {\text{d}}}}\) are trainable parameters. \({\text{z}}_{{{\text{s}},{\text{i}}}}\) and \({\text{r}}_{{{\text{s}},{\text{i}}}}\) are the update gate and reset gate, respectively. \(\left[ {{\text{v}}_{1}^{{\left( {{\text{l}} - 1} \right)}} , \cdots ,{\text{v}}_{{\text{k}}}^{{\left( {{\text{l}} - 1} \right)}} } \right]\) is the node embedding list of session S in the l-1th layer, ⊙ denotes element-wise product, and σ represents a Sigmoid activation function. \({{\text{A}}}_{{\text{s}}}\in {{\text{R}}}^{{\text{k}}\times 2{\text{k}}}\) denotes the concatenation of the adjacency matrices of the input and output edges. For a session \({\text{s}}=\left[{{\text{v}}}_{2},{{\text{v}}}_{3},{{\text{v}}}_{6},{{\text{v}}}_{3},{{\text{v}}}_{5}\right]\), corresponding matrix \({{\text{A}}}_{{\text{s}}}\) is shown in Fig. 1. \({{\text{A}}}_{{\text{s}},{\text{i}}:}\in {{\text{R}}}^{1\times 2{\text{k}}}\) corresponds to two columns in the adjacency matrix \({{\text{A}}}_{{\text{s}}}\) of the node \({{\text{v}}}_{{\text{s}},{\text{i}}}\). \({{\text{z}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}\) and \({{\text{r}}}_{{\text{s}},{\text{i}}}^{{\text{l}}}\) are the update gate and the reset gate, respectively, which decide which information should be kept or dropout. The final state \(\widehat{{{\text{v}}}_{{\text{i}}}^{{\text{l}}}}\) is a combination of the previous hidden state \({{\text{v}}}_{{\text{i}}}^{\left({\text{l}}-1\right)}\) and the candidate state \(\widetilde{{{\text{v}}}_{{\text{i}}}^{{\text{l}}}}\). Update gate updates all satellite nodes in the star graph.

Next, we consider how to integrate the information of the star nodes into the satellite nodes to capture long-range dependencies. We use gate mechanisms to fuse the information of adjacent nodes \(\widehat{{{\text{v}}}_{{\text{i}}}^{{\text{l}}}}\) and star nodes \({{\text{v}}}_{{\text{s}}}^{{\text{l}}-1}\).

$${{\text{v}}}_{{\text{i}}}^{{\text{l}}}=\left(1-{{\text{a}}}_{{\text{i}}}^{{\text{l}}}\right)\widehat{{{\text{v}}}_{{\text{i}}}^{{\text{l}}}}+{{\text{a}}}_{{\text{i}}}^{{\text{l}}}{{\text{v}}}_{{\text{s}}}^{{\text{l}}-1}$$
(8)

Here, \({{\text{a}}}_{{\text{i}}}^{{\text{l}}}\) is the weight estimated for the importance of adjacent node \(\widehat{{{\text{v}}}_{{\text{i}}}^{{\text{l}}}}\) and star node \({{\text{v}}}_{{\text{s}}}^{{\text{l}}-1}\). Therefore, we implement \({{\text{a}}}_{{\text{i}}}^{{\text{l}}}\) as follows:

$$ {\text{a}}_{{\text{i}}}^{{\text{l}}} = \frac{{\left( {{\text{W}}_{1} \widehat{{{\text{v}}_{{\text{i}}}^{{\text{l}}} }}} \right)^{{\text{T}}} {\text{W}}_{2} {\text{v}}_{{\text{s}}}^{{{\text{l}} - 1}} }}{{\sqrt {\text{d}} }} $$
(9)

\({{\text{W}}}_{1}\), \({{\text{W}}}_{2}\in {{\text{R}}}^{{\text{d}}\times {\text{d}}}\) are the weight matrix, \(\widehat{{{\text{v}}}_{{\text{i}}}^{{\text{l}}}}\) and \({{\text{v}}}_{{\text{s}}}^{{\text{l}}-1}\) are the item representations corresponding to the satellite node \({{\text{v}}}_{{\text{i}}}\) and star node \({{\text{v}}}_{{\text{s}}}\), respectively, and \(\sqrt{{\text{d}}}\) is the scaling coefficient.

Inspired by XSimGCL [26], we achieve data augmentation by adding noise to the representation of the satellite nodes. Formally, for a satellite node \({{\text{v}}}_{{\text{i}}}\) and its representation in l-th layer, we can implement the following representation-level augmentation:

$${{\text{v}}}_{{\text{i}}}^{{\text{l}}+{\text{n}}}={{\text{v}}}_{{\text{i}}}^{{\text{l}}}+{\Delta }_{{\text{i}}}^{\mathrm{^{\prime}}}$$
(10)
$$ \odot = {\text{X}} \odot {\text{sign}}\left( {{\text{v}}_{{\text{i}}}^{{{\text{l}} + {\text{n}}}} } \right),\,{\text{X}} \in {\text{R}}^{{\text{d}}} \sim {\text{U}}\left( {0,1} \right) $$
(11)

\({\Delta }_{{\text{i}}}^{\mathrm{^{\prime}}}\) is the added scaled noise vector and \({\parallel \Delta \parallel }_{2}=\epsilon \). \(\epsilon \) is a small constant. Geometrically, by adding the scaled noise vector, rotating the original vector at a small angle can be achieved, as shown in Fig. 3. Each rotation corresponds to a deviation of \({{\text{e}}}_{{\text{i}}}\) and generates an augmented representation. Since the angle of rotation is small enough, the representation after adding noise preserves most of the information from the original representation, while introducing variation. We choose to generate noise from a uniform distribution, which provides uniformity to the augmentation.

Fig. 3
figure 3

An illustration for the noise-based representation enhancement

After updating the embedding representation of satellite nodes, we also need to update the embedding representation of star nodes. We use an attention mechanism to distinguish the importance of different satellite nodes. The importance of each satellite node is determined by the similarity between the star node and the satellite node:

$$ {\upbeta }_{{\text{i}}} = {\text{softmax}}\left( {\frac{{\left( {{\text{W}}_{3} {\text{v}}_{{\text{i}}}^{{\text{l}}} } \right)^{{\text{T}}} {\text{W}}_{4} {\text{v}}_{{\text{s}}}^{{{\text{l}} - 1}} }}{{\sqrt {\text{d}} }}} \right) $$
(12)

\({{\text{W}}}_{3}, {{\text{W}}}_{4}\in {{\text{R}}}^{{\text{d}}\times {\text{d}}}\) are the weight matrix. we update the embedding representation of star nodes by calculating the linear combination of satellite nodes corresponding to each coefficient:

$${{\text{v}}}_{{\text{s}}}^{{\text{l}}}=\sum_{{\text{i}}=1}^{{\text{k}}}{{\upbeta }_{{\text{i}}}{\text{v}}}_{{\text{i}}}^{{\text{l}}}$$
(13)

To alleviate the overfitting problem in graph neural networks, we apply the highway network [33] in the last layer of the SGNN. The highway gate is used to calculate the final hidden state \({h}^{f}\), the weighted sum of the initial embedding \({h}^{0}\) of the satellite node and the embedded \({v}^{L+n}\) of the last layer. Highway network can be described as follows:

$$ h^{{\text{f}}} = {\upgamma } \odot h^{0} + \left( {1 - {\upgamma }} \right) \odot v^{L + n} $$
(14)
$$\upgamma =\upsigma \left({{\text{W}}}_{5}\left[{h}^{0}||{v}^{L+n}\right] \right)$$
(15)

where \(\odot\) is the element product, \(\upsigma \) is a Sigmoid function, || represents the concatenation operation, and \({{\text{W}}}_{5}\in {{\text{R}}}^{{\text{d}}\times 2{\text{d}}}\) is a trainable weight matrix.

3.4.3 Session Embedded Learning

After obtaining the embedding representations of satellite nodes and star nodes, we can get the item embeddings x \(\in {{\text{R}}}^{{\text{d}}\times {\text{k}}}\) from the corresponding satellite nodes \({h}^{{\text{f}}}\in {{\text{R}}}^{{\text{d}}\times {\text{m}}}\). Then, we consider the user’s global and current preferences to generate the final session representation as the user’s preference. Similar to previous research [5, 6], we take the last item in the session sequence ie.,\({{\text{S}}}_{{\text{last}}}={{\text{x}}}_{{\text{k}}}\) as the user recent preference.

For the user's global preference, we consider generating a session embedding that represents the global preference by aggregating the embeddings of all the satellite nodes in the session sequence. Since different items have different levels of importance for modelling user preferences, we use a soft attention mechanism to weight the importance of each item. It is worth noting that the importance of each item in the session sequence is jointly determined by the star node \({{\text{v}}}_{{\text{s}}}\), the current item \({{\text{x}}}_{{\text{i}}}\), and the user's recent preference \({{\text{S}}}_{{\text{last}}}\):

$${{\text{u}}}_{{\text{i}}}={{\text{q}}}_{0}^{{\text{T}}}\upsigma \left({{\text{q}}}_{1}{{\text{x}}}_{{\text{i}}} +{{\text{q}}}_{2}{{\text{v}}}_{{\text{s}}}+{{\text{q}}}_{3}{{\text{S}}}_{{\text{last}}}\right)$$
(16)
$$ {\text{S}}_{{\text{g}}} = \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{k}}} {\text{u}}_{{\text{i}}} {\text{x}}_{{\text{i}}} $$
(17)

\({{\text{q}}}_{0}\in {{\text{R}}}^{{\text{d}}}\), \({{\text{q}}}_{1}\in {{\text{R}}}^{{\text{d}}\times {\text{d}}}\), \({{\text{q}}}_{2}\in {{\text{R}}}^{{\text{d}}\times {\text{d}}}\) and \({{\text{q}}}_{3}\in {{\text{R}}}^{{\text{d}}\times {\text{d}}}\) are trainable parameters. We then combine the user's global preference \({{\text{S}}}_{{\text{g}}}\) with the current interest \({{\text{S}}}_{{\text{last}}}\) as the final session presentation.

$${{\text{S}}}_{{\text{h}}}={{\text{q}}}_{4}\left[{{\text{S}}}_{{\text{g}}}||{{\text{S}}}_{{\text{last}}}\right]$$
(18)

|| represents the concatenation operation and \({{\text{q}}}_{4}\in {{\text{R}}}^{{\text{d}}\times 2{\text{d}}}\) is a trainable weight matrix.

3.5 Multi-layer Aggregation Contrastive Module

Data augmentation is a key component of contrastive learning, where we abandon traditional image augmentation methods and instead employ a simple yet effective noise-based embedding augmentation and multi-layer aggregation approach to create views for contrastive learning. Specifically, both views have the same initial embedding and adjacency matrix. We employ cross-layer contrastive learning to generate different contrastive views. We aggregate the embeddings of different layers of the item obtained through multiple SGNN layers as a new view of the item \({v}^{c}\). The aggregation method is mean aggregation:

$${{\text{v}}}_{{\text{i}}}^{{\text{c}}}=\frac{1}{{\text{L}}}\sum_{{\text{l}}=1}^{{\text{L}}}{{\text{v}}}_{{\text{i}}}^{{\text{l}}+{\text{n}}}$$
(19)

The contrastive module is basically the same as the generating module with only two differences. One is the different input of the highway network. The other is that we add position coding to the project in the contrastive module to integrate the sequence information into the presentation. The highway network of the contrastive module is shown below:

$$ {\text{h}}_{{\text{c}}}^{{\text{f}}} = {\upgamma }_{{\text{c}}} \odot h^{0} + \left( {1 - {\upgamma }_{{\text{c}}} } \right) \odot {\text{v}}^{{\text{c}}} $$
(20)
$${\upgamma }_{{\text{c}}}=\upsigma \left({{\text{W}}}_{6}\left[{h}^{0}||{{\text{v}}}^{{\text{c}}} \right] \right)$$
(21)

where \(\odot\) is the element product, \(\upsigma \) is a Sigmoid function, || represents the concatenation operation, and \({{\text{W}}}_{6}\in {{\text{R}}}^{{\text{d}}\times 2{\text{d}}}\) is a trainable weight matrix. The final session representation \({{\text{S}}}_{{\text{c}}}\) in the contrastive module is generated according to Eqs. (1618). It should be noted that the two modules share the same star node \({{\text{v}}}_{{\text{s}}}\).

3.6 Multi-Task Learning Module

Graph contrastive learning is a multi-task learning method that can improve the performance of session recommendation models by simultaneously optimizing multiple objectives. In our model, the main task is the next-item recommendation, and contrastive learning serves as an auxiliary task to help extract general features of items from different contrastive views. We unify the two tasks and jointly optimize them:

$${L}_{total}={L}_{main}+{\lambda L}_{cl}$$
(22)

where \(\lambda \) controls the magnitude of the contrast loss.

For the next-item recommendation task, we have generated the embedded representation of the session sequence through the session representation generation module, and then use a prediction layer for next-item recommendation. To alleviate the common popularity bias problem in recommendation, we apply layer normalization separately to the session embedding \({{\text{S}}}_{{\text{h}}}\) and the item embedding \({{\text{v}}}_{{\text{i}}}\), and then calculate the product of the normalized session embedding \(\widetilde{{{\text{S}}}_{{\text{h}}}}\) and the normalized item embedding \(\widetilde{{{\text{v}}}_{{\text{i}}}}\) to obtain the recommendation score. Finally, we use the Softmax function to obtain the final output probability \(\widehat{{\text{y}}}\) for all items \(\widehat{{{\text{z}}}_{{\text{i}}}}\):

$$\widetilde{{{\text{S}}}_{{\text{h}}}}={\text{LayerNorm}}\left({{\text{S}}}_{{\text{h}}}\right)$$
(23)
$$\widetilde{{{\text{v}}}_{{\text{i}}}}={\text{LayerNorm}}\left({{\text{v}}}_{{\text{i}}}\right)$$
(24)
$$\widehat{{{\text{z}}}_{{\text{i}}}}={\widetilde{{{\text{S}}}_{{\text{h}}}}}^{{\text{T}}}\widetilde{{{\text{v}}}_{{\text{i}}}}$$
(25)
$$\widehat{{\text{y}}}={\text{Softmax}}\left(\widehat{{\text{z}}}\right)$$
(26)

\(\widehat{{{\text{z}}}_{{\text{i}}}}\) represent the recommended score for all candidate items \({{\text{v}}}_{{\text{i}}}\in {\text{V}}\). We employed the cross-entropy loss function as the loss function of the main task, which can be expressed as:

$${\text{L}}\left(\widehat{{\text{y}}}\right)=-\sum_{{\text{i}}=1}^{{\text{m}}}{{\text{y}}}_{{\text{i}}}{\text{log}}\widehat{{{\text{y}}}_{{\text{i}}}}+\left(1-{{\text{y}}}_{{\text{i}}}\right){\text{log}}\left(\left(1-\widehat{{{\text{y}}}_{{\text{i}}}}\right)\right)$$
(27)

where \({{\text{y}}}_{{\text{i}}}\) represents the probability that the item \({{\text{v}}}_{{\text{i}}}\) in the next click item.

Contrastive learning can be viewed as maximize the mutual information between two potential representations. We adopt InfoNCE [34] as our contrastive loss function, and different representations of the same session sequence as our positive pair \(({\text{ie}}.,\left\{\left(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}},\widetilde{{{\text{S}}}_{{\text{c}},{\text{i}}}}\right)|{\text{i}}\in {\text{S}}\right\})\). \(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}}\mathrm{ and }\widetilde{{{\text{S}}}_{{\text{c}},{\text{i}}}}\) are the normalized session representations generated by the session presentation module and the contrastive module, respectively. The negative pairs are other sessions \(({\text{ie}}.,\left\{\left(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}},\widetilde{{{\text{S}}}_{{\text{c}},{\text{j}}}}\right)|{\text{i}},{\text{j}}\in {\text{S}},{\text{i}}\ne {\text{j}}\right\})\) in the same batch. We simply implement the \({\text{sim}}\left({\text{a}},{\text{b}}\right)\) function as the dot product between two vectors:

$${{\text{L}}}_{{\text{cl}}}=-{\text{log}}\frac{{\text{exp}}\left({\text{sim}}\left(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}},\widetilde{{{\text{S}}}_{{\text{c}},{\text{i}}}}\right)/\uptau \right)}{{\sum }_{{\text{j}}=1,{\text{j}}\ne {\text{i}}}^{{\text{B}}}{\text{exp}}\left({\text{sim}}\left(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}},\widetilde{{{\text{S}}}_{{\text{c}},{\text{j}}}}\right)/\uptau \right)}$$
(28)
$${\text{sim}}\left(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}},\widetilde{{{\text{S}}}_{{\text{c}},{\text{i}}}}\right)=\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}}\widetilde{{{\text{S}}}_{{\text{c}},{\text{i}}}}$$
(29)

where τ is the temperature parameter and B is the size of the batch. Contrast loss encourages consistency between \(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}}\) and \(\widetilde{{{\text{S}}}_{{\text{c}},{\text{i}}}}\), which are positive samples of each other, while minimizing consistency between \(\widetilde{{{\text{S}}}_{{\text{h}},{\text{i}}}}\) and \(\widetilde{{{\text{S}}}_{{\text{c}},{\text{j}}}}\), which are negative samples of each other. Optimizing information loss is actually maximizing the tight lower bound of mutual information. Finally, the entire training process of SR-MACL is shown in Algorithm 1.

Algorithm 1
figure a

The whole procedure of SR-MACL

4 Experiments

4.1 Datasets and Preprocessing

To thoroughly evaluate the proposed approach, we selected three public datasets containing user interactions: Diginetica, Tmall, and Nowplaying. These three datasets differ in size, sparsity, and scenario.

  • Diginetica is from the CIKMCup in 2016, which consists of typical transaction data.

  • Tmall is from the 2015 IJCAI competition, which contains anonymous users on the Tmall online shopping platform shopping log.

  • Nowplaying describes music listening behaviour extracted from Twitter.

Our processing of the datasets is consistent with previous work [5, 7, 35]. Specifically, sessions of length 1 and entries with less than 5 occurrences were filtered in all three public datasets. In addition, for each session sequence\({\text{S}}=\left\{{{\text{v}}}_{1}^{{\text{s}}},{{\text{v}}}_{2}^{{\text{s}}}{,...,{\text{v}}}_{{\text{m}}}^{{\text{s}}}\right\}\), we process the splits into sequences and the corresponding labels, i.e.,\(\left(\left[{{\text{v}}}_{1}^{{\text{s}}}\right],{{\text{v}}}_{2}^{{\text{s}}}\right)\),\(\left(\left[{{\text{v}}}_{1}^{{\text{s}}},{{\text{v}}}_{2}^{{\text{s}}}\right],{{\text{v}}}_{3}^{{\text{s}}}\right)\),…,\(\left(\left[{{\text{v}}}_{1}^{{\text{s}}},{{\text{v}}}_{2}^{{\text{s}}},...,{{\text{v}}}_{{\text{n}}-1}^{{\text{s}}}\right],{{\text{v}}}_{{\text{n}}}^{{\text{s}}}\right)\). The processed dataset statistics are shown in Table 1.

Table 1 Statistics of the dataset

4.2 Evaluation Metric

As described in [2, 7, 36], the evaluation indicators include: P@20 and MRR @20. P@20 is widely used as a measure of predictive accuracy. It represents the percentage of correctly recommended items among the top 20 items as defined by Eq. (30) and is used to measure the accuracy of the recommendation.

$${\text{P}}@20=\frac{1}{{\text{N}}}\sum\limits_{{\text{i}}=1}^{{\text{N}}}{{\text{y}}}_{{\text{i}}}$$
(30)

N is the total number of sessions, and \({{\text{y}}}_{{\text{i}}}\) indicates whether the top 20 recommended results in the session contain the target item. If the recommended item contains the corresponding label, the value is 1. Otherwise, it is 0.

MRR@20 (Mean Reciprocal Rank) is calculated based on the average rank of the target items in the top 20 recommendations. As soon as the rank surpasses 20, the reciprocal rank’s value is 0. The MRR metric considers the order in which the recommendations are sorted, where a larger MRR value indicates that the correct recommendation is at the front of the sorted list. As shown in Eqs. (31) and (32):

$${\text{MRR}}@20=\frac{1}{{\text{N}}}\sum\limits_{{\text{i}}=1}^{{\text{N}}}{\text{Rec}}\left({\text{i}}\right)$$
(31)
$${\text{Rec}}\left({\text{i}}\right)=\left\{\begin{array}{c}\frac{1}{{\text{Rank}}\left({\text{i}}\right)}, Rank\left({\text{i}}\right)\le 20 \\ 0 , Rank\left({\text{i}}\right)>20\end{array}\right.$$
(32)

where Rank(i) is the rank of tags in session i, Rec(i) is the reciprocal of the rank of target items in session i. If the rank is greater than 20, the value of Rec(i) is set to 0. Both evaluation metrics P@20 and MRR@20 have larger values representing better recommendation performance.

4.3 Baseline Algorithm

We compared our method with ten baseline models, which can be divided into three categories: (1) traditional deep learning models: GRU4REC, NARM, STAMP; (2) graph neural network model: SR-GNN, SGNN-HN, GCE-GNN and GC-HGNN; (3) graph comparison models: COTREC, S2-DHCN, CGL.

  • GRU4REC [2]: It applied RNN to SRS for the first time together with GRU, demonstrates the effectiveness of deep learning methods in SRS

  • NARM [3]: It incorporated an attention mechanism that apprehends the user’s primary goal and integrated it with persistent behavioural features to form a final representation to predict the next item.

  • STAMP [4]: It is an approach that uses a simple multilayer perceptron (MLP) augmented by an attention mechanism to capture both the general interest of the user and the current interest of the current session.

  • SR-GNN [5]: It is first application of graph neural networks for SBR, which first introduces gated graph neural networks (GGNNs) to capture complex item transitions. To generate the next item for the current session, it employed the same idea as STAMP, and used the attention mechanism to capture the general and current interests of the user.

  • SGNN-HN [6]: It used a star node to capture long-range dependencies and employed a highway network (HN) to adaptively select embeddings from item representations to prevent overfitting.

  • GCE-GNN [7]: It employed local graph and global utilization graph attention networks (GAT) to capture item transfer relations from local and global contexts and used reverse location encoding to generate session representations of SBR.

  • GC-HGNN [8]: It constructed global graph and models information from other sessions using hyper-graph convolution and fused global and local information by pooling.

  • S2-DHCN [9]: It employed hypergraph convolution neural networks and graph attention networks to obtain global contextual and local information and used attention mechanisms to process fused features to learn the final representation of session sequences.

  • COTREC [10]: It proposed a self-supervised collaborative training method based on contrastive learning as a secondary task to alleviate the data sparsity problem and retained the complete session information by generating enhanced intra-session and inter-session views.

  • CGL [12]: It constructed a self-supervised module to enrich item representation using global graph and decoupling learning and designed the label obfuscation method to prevent overfitting.

4.4 Parameter Settings

For a fair comparison, we employed the same data preprocessing method for all baseline models. The initial parameters were all initialized using a Gaussian distribution with mean 0 and standard deviation 0.1. All baseline models use L2 regularization as a penalty term with values of 10–5. All models have an embedding size of 100 and use Adam as the optimizer with an initial learning rate of 0.001 and decay by 0.1 after every 3 epochs. A search is performed on the validation set to obtain its optimal value. Randomly selected 10% of the data in the training set is used as the validation set. The layer of the model is three.

Table 2 shows the overall performance of SR-MACL compared to the baseline models, in which the best results are highlighted in bold, and the second-best results are italic. We use the average of the results from five runs as the final result, the values in parentheses represent the standard deviations. By comparing the experimental results, we can draw the following three experimental conclusions:

  1. (1)

    The graph neural network-based models outperform the traditional deep learning-based models (RNN-based, Attention-based), which show the powerful ability of graph neural networks in modelling the transduction relations of session sequences and also show that the graph neural network approach is more suitable for session recommendation.

  2. (2)

    As shown in Table 2, the performance of existing session recommendation models based on graph contrastive learning is not as good as the session models based on graph neural networks alone. This indicates that the approach of using other session information to generate augmented views for comparison learning may introduce information about irrelevant items resulting in sub-optimal model performance.

  3. (3)

    SR-MACL outperforms the other baseline models on all three datasets (except for Nowplaying's P@20 metric), which indicates that our proposed approach of augmenting item representations for comparison learning by multi-layer aggregation outperforms the above approach of using other session information to generate augmented views for comparison learning.

Table 2 Comparison of the performance of the baseline models

4.5 Ablation Experiments

We further analysed the model by experimentally analysing the impact of each component in SR-MACL on the model performance. We design two SR-MACL variants: SR-MACL-HW, SR-MACL-CL and compare these variants with the full SR-MACL model on the Diginetica, Nowplaying, and Tmall datasets. It should be noted that the ablation experiments were all conducted with the layer number is 3.

  • SR-MACL-HW: Removal of high-speed networks

  • SR-MACL-CL: Removal of the contrastive learning module

The experimental results are shown in Table 3, in which the best results are highlighted in bold. The two key components of our model SR-MACL, highway network and contrastive learning, both contribute to the model performance improvement. Highway network has the greatest impact on the model performance because it can solve the overfitting problem caused by using three stacked graph encoders. In addition, we also demonstrate the effectiveness of contrastive learning in our model by the above experiments.

Table 3 Impact of different components

4.6 Effect of the Hyper Parameter

We introduce a hyper parameter \(\uplambda \) in SR-MACL to balance the contrast module. We experimentally investigate the performance of SR-MACL at different values to explore the impact of the contrast module, the values range is [0, 0.1, 0.01, 0.001]. The experimental results are shown in Fig. 4. We can see that the results on three used datasets are similar. When λ = 0.001, the model achieved the best results on all three datasets, and there is a significant decrease in model performance as the value increases, which we believe is due to excessive contrastive loss that interferes with the learning of the main task. Also, when λ = 0, the model performance decreases somewhat compared to the best performance, which suggests that the addition of the contrast module learns the richer representation and thus improves the model performance.

Fig. 4
figure 4

Performance comparison under different contrastive loss parameters

4.7 Impact of Aggregation Operations

We conducted further analysis of the model to investigate the influence of different aggregation methods on its performance. Aggregating item representations within the session sequence is vital for session-based recommendation. Consequently, we conducted several comparative experiments to assess the impact of various aggregation methods on the model's performance.

  • Mean-Pooling: The average pooling is used to aggregate the items represents in the session sequence to generate session embedding.

  • Max-Pooling: The max pooling is used to aggregate the items represents in the session sequence to generate session embedding.

  • SA-Pooling: The soft attention pooling is used to aggregate the items represents in the session sequence to generate session embedding.

As can be seen from Table 4, in which the best results are highlighted in bold, the aggregate methods Meanpooling and Max-Pooling do not achieve satisfactory results. Compared with the above two aggregation methods, the aggregate methods SA-Pooling can get better results because it assigns a different weight to each item in the session sequence. This enables the aggregation of features based on their relative importance, resulting in improved results.

Table 4 Effects of different aggregation approaches

4.8 Data Sparsity

We compare the performance of models trained with different proportions of data and list them in Table 5, in which the best results are highlighted in bold. The experimental results show that our model performs significantly better on smaller data sets than other models that do not use contrastive learning. This shows that our model helps mitigate the data sparsity problem.

Table 5 Experiment results trained on the sparse

5 Conclusions and Future Work

Existing session-based recommendation methods based on graph contrast learning usually incorporate other session information to generate augmented views to construct graph contrastive learning, which inevitably introduces irrelevant item information and interferes with accurately modelling user interests, resulting in sub-optimal model performance. We propose a new session-based recommendation method based on multi-layer aggregated augmented contrast learning, namely (SR-MACL). In SR-MACL we construct a contrast view by adding noise to the embedding representation and forming a contrast embedding representation by multi-layer aggregation, which not only effectively solves the problem that traditional graph enhancement methods can destroy the context of the whole session, but also avoids the interference of irrelevant items. The experimental results show that our model outperforms other session recommendation models and provides a new way of thinking for the application of graph contrast learning to session recommendation. In the current work, we focus on how to enhance the item representation, and in future work we intend to further investigate how to enhance the session representation in session-based recommendation.