1 Introduction

With the rapid development of e-commerce and social media platforms in the last few years, recommender systems have gathered notable attention [1, 2]. They provide a methodology to identify user’s requirements and predict the interest by mining the user’s history and their interactions with items (e.g., purchase, watch, click, and read). Recommender systems can take various forms depending upon the application, e.g., playlist generator for video and music services (Netflix, YouTube), friend suggestions on Instagram and Facebook, and product suggestion on eBay and Amazon. One of the most common and general approaches for recommendation is Collaborative Filtering (CF) [3, 4], which assume similar users have similar preferences and hence they like similar items. This approach models explicit feedback (e.g., ratings) or implicit feedback (e.g., clicks, read) to reconstruct the user’s interactions. Recently, approaches based on Graph Neural Networks (GNNs) have been demonstrated to be highly effective on various tasks defined over relational data, such as protein structure and knowledge graphs [5]. The main idea of GNN is to produce the representation of a node by aggregating features from its neighbouring nodes iteratively, as shown in Fig. 1. Each GNN layer gathers all k-hop nearby node embeddings (messages) and summarizes them via an aggregation function (e.g., sum). After aggregation, the node’s current state is updated. Many of these approaches treat recommendation tasks as link prediction in bipartite graphs via matrix completion [6, 7]. The bipartite graph can be represented as an adjacency matrix between user and item nodes, where the task is to predict entries inside the matrix (also known as link prediction). Recently, many researchers contributed towards the development of GNN-based collaborative filtering for modelling user-item interactions in the form of a message passing neural network between user and item nodes [8, 9].

Fig. 1
figure 1

Graph Neural Network with message passing up to k hop neighbours. Each neighbouring node or edge share information and impact each other’s updated embedding

A wide range of techniques including CF based approaches for recommender systems solely focus on rating information provided by users. Despite the popularity of these approaches, they have limited performance in real-world applications as they neglect side information such as static features of nodes (user’s and item’s profile), and surrounding context information (e.g., mood, time, weather) that can improve performance by enhancing the personalization in recommender systems. The surrounding context reflects the fact that user choices change with time and are highly dependent on the context under which they interact with the item. For example, time and weather information highly impact the choice of users in restaurant recommendation, while the user’s mood influences which song they are most likely to listen. As such, it is important to develop context-aware recommender systems that can effectively accommodate the static features of users, as well as surrounding context information while making predictions [10]. The contextual prefiltering technique [11] filters the originally available data based on current context information, and recommendation is based on the filtered data. On the other hand, the contextual postfiltering paradigm [12] takes the recommendation results from the two-dimensional recommendation techniques and filters these results based on the current context. In [13], a context based recommendation problem is mapped to a tensor completion task, which is inspired from CF approach (matrix completion), but it suffers from high complexity. SocialMF [14] integrates a trust factor as a social context between users in the social network to enhance the performance of the matrix factorization approach. Following this line of research, several deep learning based matrix factorization approaches have been proposed for context-aware recommendation tasks [15,16,17].

Fig. 2
figure 2

User’s interaction with the item (e.g., movie) is surrounded by certain context (e.g., weather, mood, weekend) that influence user’s opinion on item. This data takes the form of a 3D matrix between user, item and context

The existing approaches for context-aware recommendation are incapable of capturing dynamic user-item-context deep interaction, and discount the fact that the same person can behave differently when interacting with the same item under different context [18]. It is therefore reasonable to expect an improvement in the quality of personalized recommendations when incorporating dynamic context information. This is the key focus and motivation underlying this work. The Fig. 2 represents the user’s interaction with items considering the knowledge about the surrounding context. This can also be represented as a bipartite graph between users and items, with edges labelled with context and ratings/opinion. We introduce a novel GNN based matrix completion approach with an attention mechanism that effectively integrates the following three kinds of information from the graph (Fig. 2), between user and item nodes:

  • user’s opinion/rating on items;

  • context information on edges between users and items;

  • Static features of users and items.

In particular, we leverage a context-aware graph convolutional autoencoder for matrix completion. Our graph convolutional auto encoder learns from the static features of nodes and user-item interaction information (rating), and context. We also introduce an attention factor for three kinds of embeddings (static feature-based, opinion-based, and context-based) generated by the encoder. The resulting embeddings are given as input to the decoder with the objective to reconstruct a matrix with minimum loss. A preliminary version of this work has appeared as a conference article [19]. This work extends the original article by including :

  • multiple aggregation functions for user-item opinion graph inside a customized weight sharing graph convolutional network;

  • the attention mechanism for integrating multiple representations for users and items, i.e., opinion, contextual and static feature representation;

  • a performance evaluation of the proposed algorithm on two additional datasets for music and travel recommendation;

  • an extended analysis of the algorithm to include the impact of the attention mechanism for the aggregation of multiple representations.

2 Related Work

The vast majority of the work in the field of context-aware recommendation frameworks has been devoted to the improvement of matrix factorization (MF) approaches. These approaches work by decomposing the user-item interaction matrix into lower dimension matrices [20, 21]. Despite of good performance, these approaches are unable to capture the user/item-context correlation as they consider context as features of the user and item [22]. Neural Factorization machine (NFM) is a deep learning method to model high-order nonlinear feature interactions for sparse data [15]. In [23], a neural network model has been proposed that captures the impact of context on users and items. It learns the importance of context, but the simplicity of this model limits the ability to capture the real influence of the relationship between features.

Recently, GNN based approaches have been introduced to tackle recommendation tasks on graph-structured representations of the problem [24]. These methods are suitable for modelling the interaction of nodes on graph structural features in a flexible and explicit way. Fi-GNN [25] utilizes a graph structure to naturally represent the characteristics of multiple feature fields, in which every node corresponds to a feature field, and these different fields can interact through edges to model the node interaction in graph. STAR-GCN [7] stacks multiple identical GCN encoder-decoders combined with intermediate supervision to improve the final prediction performance. GCMC [6] leverages the bipartite graph between user and item nodes to learn the node representations. Both GCMC and STAR-GCN treat equally all neighbours of a node. IGMC [26] is an inductive approach for user-item matrix completion recommendation tasks, which do not consider any side information.

Previous GNN based collaborative filtering approaches [27, 28] are unable to capture the collaborative filtering effect, as they discard the collaborative signals that are hidden in user-item interaction. In [8], NGCF model successfully encodes user-item high-order connectivity by exploiting user-item bipartite graph. GCF-YA [29] is a deep graph neural network implementation of collaborative filtering, based on information propagation and attention mechanism to predict missing links between users and items. GraphRec [30] tackles social recommendation by aggregating the historical behaviour of individuals from user-user and user-item bipartite graph for recommendation.

Context information on the user has been successfully used to improve recommendation performance [16, 31]. Recently, we have seen work on dynamic graphs that integrate interaction times as context information [32,33,34]. DGCF [35] integrates the time interval between the previous and current interaction of user-item pairs inside their embedding to get the latest node representations for recommendation. An inductive deep learning approach DyRep, which is used to learn from the temporally evolving interaction between user item nodes. These approaches solely consider time information and are hence limited to integrate any other context information.

The above GNN based approaches consider the rating information as the user’s opinion on the edges between the user and item nodes in a bipartite graph. Some approaches only consider user and item static features, or integrate time as a context to capture a dynamically evolving environments. All these approaches ignore the surrounding context information that can improve performance. In the following, we show how it is possible to extend such approaches to consider dynamic and time-varying contextual features influencing recommendations.

3 Problem Definition

We have categorized data for context-aware recommendation into four categories: items, users, context, and interactions. Context can be defined as the surrounding knowledge that is associated with the user-item interaction, e.g., time, company, mood, location, etc. In this work, we have defined a 3D rating/opinion interaction matrix between user, item and context \(A_{uvc} \in {\mathbb {R}}^{N_u\times N_v\times N_c}\), where \(N_u\) is total number of users, \(N_v\) represents the total number of items and \(N_c\) is total number of different contexts (as shown in Fig. 2). The rating scale ranges from one to five stars such that \(A_{uvc} \in \{1,\ldots 5\}^ {N_u \times N_v \times N_c}\) except for InCarMusic dataset, where maximum rating is six. User and items are associated to multiple static features describing the characteristics of individuals. For example, static user features are gender, age, and static product features can be colour, brand, category etc. Let \(N_{F_u}\) and \(N_{F_v}\) represents the total number of features of users and items, respectively. The importance of the contextual features varies from person to person and from item to item.

Given such data, the recommendation problem is then cast as a task aiming to predict the existence of a labelled link between a user and an item considering the knowledge about the surrounding context. This work aims to introduce context information to matrix completion tasks with mechanisms for finding which context attributes are important for a target user and items. Details of the learning model are discussed in Sect. 4.

4 Context-Aware GNN model

Fig. 3
figure 3

High-level architecture of the proposed context-aware graph convolutional autoencoder. User’s opinion on item is modeled using local weight sharing GCN. User and item features as well as user-context and item-context are modeled using dense neural network. While user-context-item interaction is modeled with GCN with global weight sharing

In this section, we present our link prediction model for bipartite graph between users and items with context information on edges. We extend the graph convolutional autoencoder in [6] (\(GCMC+feat\), in the following). \(GCMC+feat\) leverages rating information using a 2D user-item opinion/rating matrix along with static node features, ignoring the context information on edges. The major contribution of our approach, dubbed as context-aware graph convolutional matrix completion (\({cGCMC}_F\)), is to utilize context features on the edges. The proposed architecture has three main blocks, shown in Fig. 3. From top to bottom: the first block represents the input data, i.e., user’s opinion/rating on items, the profile of users and items, user-item-context interaction graph with edges labeled with context and rating, and the favourite context of users and items. The second block represents the graph encoder. Inside the graph encoder, \(GCMC+feat\) operates on 2D user-item rating matrix, while \({cGCMC}_F\) is our proposed extension that leverages context information on edges and maps user-item-context interaction to a 3D matrix. The graph encoder is composed of two graph convolutional neural network layers and two dense neural network layers. Each layer operates on different data to produce user and item representations with respect to rating opinion, static node features, and context information. This multiple perspective representation for each user and item is accumulated without attention weights, in our algorithms \(cGCMC^{old}\) and \(cGCMC_{F}^{old}\) [19]. While in \(cGCMC\) and \(cGCMC_{F}\), we provide the accumulation along with the attention mechanism. Further details regarding the encoder part are explained in Sect. 4.1. The decoder (discussed in Sect. 4.2) utilizes the encoded representations to predict the link in a bipartite graph.

4.1 Graph Encoder

Our graph encoder takes the following data in input:

  1. 1.

    User’s Opinion on Items. The matrix \(A \in {\mathbb {R}}^{N_u \times N_v + R}\) represents user’s rating/opinion on items. This matrix is composed of \(A_r\) sub-matrices where \(A_r \in {\mathbb {R}}^{N_u \times N_v}\) and \(r \in \{1,2\ldots ,R\}\).

    $$\begin{aligned}&A=\begin{bmatrix} \begin{bmatrix} A_1 \end{bmatrix}_{N_u \times N_v}&\begin{bmatrix} A_2 \end{bmatrix}_{N_u \times N_v}&----&\begin{bmatrix} A_r \end{bmatrix}_{N_u \times N_v} \end{bmatrix}\end{aligned}$$
    (1)
    $$\begin{aligned}&A_r[u][v]=1 \iff (u,v)=r : r \in \{1,2\ldots ,R\} \end{aligned}$$
    (2)
  2. 2.

    Static User’s features. The matrix \(U_F \in {\mathbb {R}}^{N_u \times N_{F_u}}\) consists of normalized static feature attributes for users.

  3. 3.

    Static Item’s features. The matrix \(V_F \in {\mathbb {R}}^{N_v \times N_{F_v}}\) consists of normalized static feature attributes for items.

  4. 4.

    Surrounding Context of User-Item Interaction (\(A_{uvc}\)). We have represented user-item-context interaction using 3D matrix \(\in {\mathbb {R}}^{N_u \times N_v \times N_c}\). This binary matrix contains information about the surrounding context under which the user has provided a specific opinion on the item. For example, if user \(U_A\) has rated item \(V_B\) with rating 5 under context \(c_1,c_2,c_3 \in \{c_1, c_2, c_3,\ldots , N_c\}\), then this matrix contain an entry set to 1 for \(U_A\), \(V_B\) and \(c_1, c_2, c_3\).

  5. 5.

    Favourite Context of Users. The matrix \(U_C \in {\mathbb {R}}^{N_u \times N_c}\) denotes the importance of context for individual users. We use information from the matrix A (Eq. 1) to give more weight (\(\alpha \)) to the context in which a user has given the high rating, compared to the context under which the user has rated less.

  6. 6.

    Favourite Context of Items. The matrix \(V_C \in {\mathbb {R}}^{N_v \times N_c}\) use \(A_r\) in a similar way as \(U_C\) above. The value of the context attributes of an item is high if it is more likely to get a high rating under a specific context. Thus, giving more importance to the context attributes under which an item is rated highly.

Next, we explain how a graph encoder operates on the matrices defined above, to learn the representations of users and items with respect to rating, context and static features.

4.1.1 User-Rating-Item Representation

The user opinions represented in the adjacency matrix A (Eq. 1) map the user’s likeliness for items in the bipartite graph. We have a local weight sharing graph convolutional layer for modelling user’s opinion. The local weight sharing mechanism allows having different convolutional weights based on the edge types. The number of weight matrices is equal to the possible available rating levels R. The customized message propagation for graph convolutions uses an edge type-specific parameter matrix \(W_r\). After the message propagation step, we aggregate the incoming messages at each node by two alternative types of aggregation functions: sum and stack.

  • stack aggregation: concatenating all edge specific matrices along their first dimension.

  • sum aggregation: performing an addition of all edge-specific matrices.

Overall, this edge specific message propagation is more effective compared to the general global message propagation. Our model selection experiments considered summation and concatenation as alternatives, and we have selected the former for its best overall performance (in validation). Details of this spectral convolutional layer are defined in the following:

$$\begin{aligned} z_{u}^{o}&= \underset{i: 0 \rightarrow R}{Agg} (GCN(X_v,A_i)) = \sigma \left( \underset{i: 0 \rightarrow R}{Agg} \left( \tilde{A_i} X_v {W_{i}^{v}}\right) \right) \end{aligned}$$
(3)
$$\begin{aligned} z_{v}^{o}&= \underset{i: 0 \rightarrow R}{Agg} (GCN(X_u,{A_{i}^{T}})) =\sigma \left( \underset{i: 0 \rightarrow R}{Agg} \left( {{\tilde{A}}_i}^T X_u W_{i}^{u}\right) \right) \end{aligned}$$
(4)

where \(X_u\) and \(X_v\) are the one-hot unique vectors for the user and item node. The term R is the maximal rating a user can give to an item, \(W_{i}^{u}\) and \({W_{i}^{v}}\) represents R trainable weight matrices and \(\sigma \) is non linear activation function such as ReLU. The matrix \(\tilde{A_i}\) and \({{\tilde{A}}_i}^T\) are the normalized adjacency matrix \({A_i}\) and its transpose, respectively.

$$\begin{aligned} \tilde{A_i} = D^{-1/2} A_i D^{-1/2} \,\,\, \forall \,\,\, i= \,0\,\, to\,\, R \end{aligned}$$
(5)

where the term D represents a diagonal degree matrix, containing the square root of degree on diagonal. Similarly, \(A_i^T\) is normalized to get \({{\tilde{A}}_i}^T\) (using Eq. 5).

4.1.2 Context Representation

The user-item-context interaction matrix \(A_{uvc}\) is normalized by dividing each context attribute with the total count of context attributes recorded at the time of user-item interaction. The normalized context attributes are further accumulated to get \(A_c \in {\mathbb {R}}^{N_u\times N_v}\).

$$\begin{aligned} A_{c}\left[ u \right] \left[ v \right] = \sum _{i=0}^{N_{c}^{uv}} \frac{c_{i}^{uv}}{N_{c}^{uv}} \end{aligned}$$
(6)

where u and v are user and item indexes in the matrix, \(N_{c}^{uv}\) represents the count of occurrences of context c when user u has rated item v, \(c_{i}^{uv}\) denotes the individual context value under which user u has rated item v.

We propose to leverage graph convolutions to model user-context-item interactions in the matrix \(A_c\), with the same message propagation rule as used for modelling user’s opinion (Eq. 3) and (Eq. 4)) but with a single global weight matrix. We represent the user and item representation with respect to context attributes as \(z_{u}^{c_1}\) and \(z_{v}^{c_1}\), respectively. The user’s behaviour varies with the change in the surrounding context, which makes them react differently to the same item under different contexts. Similarly, an item gets a different rating when the surrounding context changes. This makes the context information naturally dynamic. For modelling this dynamic user-context and item-context relation, we performed a statistical analysis of training data and identify \(\alpha \) importance factor for each user and item, respectively. The \(\alpha \) factor gives more importance to the favourite context of users and items. We have stored the extracted user preferences in \(U_C\) :

$$\begin{aligned} U_C[u][c]=\sum _{i,j}^{N_u,N_{c}^{uv_{i}}} A_{uvc}[u][v_i][c_j]* \alpha [r] : r \in \left\{ 1,\cdots ,R \right\} \end{aligned}$$
(7)

where \(N_u\) denotes the neighbours of user u, \(N_c^{uv}\) represents the number of context attributes in which the user provides opinion r. We have obtained the context importance for each item in a similar way (Eq. 7) and stored in \(V_C\). Both matrices are normalized to have values between 0 to 1. We have the simplest dense neural network layer to process this information. The weight matrices chosen for this purpose are randomly and uniformly distributed and node dropout is applied to the hidden layers to prevent overfitting. The operations on this layer are defined as :

$$\begin{aligned} {z_{u}^{c_2}}&= \sigma ( U_C W_{3}^{c} + b_c) \end{aligned}$$
(8)
$$\begin{aligned} {z_{v}^{c_2}}&= \sigma ( V_C W_{4}^{c} + b_c) \end{aligned}$$
(9)

To get the final user’s and item’s context representation, we have integrated \({z_{u}^{c_1}}\) with \({z_{u}^{c_2}}\), and \({z_{v}^{c_1}}\) with \({z_{v}^{c_2}}\).

$$\begin{aligned} {z_{u}^{c}}&= \sigma \left( \left[ z_{u}^{c_1} \oplus z_{u}^{c_2}\right] W_{5}^{c}+b_c\right) \end{aligned}$$
(10)
$$\begin{aligned} {z_{v}^{c}}&= \sigma \left( \left[ z_{v}^{c_1} \oplus z_{v}^{c_2}\right] W_{6}^c+b_c\right) \end{aligned}$$
(11)

where as W represents trainable weight matrices and b is a bias.

4.1.3 User’s and Item’s Profile Representation

The static features of user and item nodes are represented as \(U_F\) and \(V_F\), respectively. We have not given these features directly as input in the graph convolution layer as they degrade the performance in case of sparse user-item content features. Therefore, we have a separate dense neural network layer to get the static feature representation for user and item nodes.

$$\begin{aligned} z_{u}^{f}&= \sigma \left( U_F W_{3}^{f} + b_f\right) \end{aligned}$$
(12)
$$\begin{aligned} z_{v}^{f}&= \sigma \left( V_F W_{4}^{f} + b_f\right) \end{aligned}$$
(13)

where \(W_{3}^{f}\) and \(W_{4}^{f}\) represent trainable weight matrices and \(b_f\) is a bias.

4.1.4 Accumulation with Attention

We have accumulated the user’s representation from rating/opinion (Eq. 3), features (Eq. 12) and context (Eq. 10) perspective. Here, we introduce the learnable attention weights for the three representations in \({cGCMC}_{F}\). In \({cGCMC}^{old}\) [19], we have accumulated these embeddings without considering any learnable attention weights. The last layer of the graph encoder is a dense neural network layer and is responsible for producing the final embedding with or without attention weights. For \({cGCMC}_{F}\) user’s final representation is defined as:

$$\begin{aligned} z_u = \sigma \left( \left[ w^{o}_{u} * z_{u}^{o} \oplus w^{c}_{u} * z_{u}^{c} \oplus w^{o}_{u} * z_{u}^{f}\right] W_{6}+b\right) . \end{aligned}$$
(14)

Similarly, the item’s representations from rating/opinion, context and feature perspective are concatenated after having attention weights to get the final item embedding.

$$\begin{aligned} z_v = \sigma \left( \left[ w^{o}_{v} * z_{v}^{o} \oplus w^{c}_{v} * z_{v}^{c} \oplus w^{f}_{v} * z_{v}^{f}\right] W_{7}+b\right) . \end{aligned}$$
(15)

4.2 Decoder

We use a bilinear decoder that takes context-aware embedding of user-item interaction and reconstructs rating matrix (\({\hat{A}}\)) between users and items. Here, we address this problem as a classification task and each rating is treated as a separate class. The decoder produces a probability distribution over all classes through a bilinear operation:

$$\begin{aligned} {\hat{A}}_{ij}=\sum _{r \in R} p ({\hat{A}}_{ij}=r) : p ({\hat{A}}_{ij}=r) = \frac{e^{u_iQ_rv_{j}^{T}}}{\sum _{k \in R}{e^{u_iQ_kv_j^T}}} \end{aligned}$$
(16)

where \(Q_r\) are R trainable matrices of dimension \(D \times D\), D is the hidden dimension of user’s and item’s embedding obtained from encoder and R are the available rating levels. In our setting, we defined \(Q_r\) as:

$$\begin{aligned} Q_r= \sum _{k=1}^{n_b} \alpha _{kr}W_s \end{aligned}$$
(17)

Here, k represents the number of linear functions which are chosen to be lower than the rating level, to avoid overfitting. The term \(\alpha _{kr}\) is learnable \(W_s\) represents the weight matrix.

We have tested our model with different settings and represented them with different names: cGCMC and \({cGCMC}_F\). cGCMC models the effect of context along with an opinion matrix, while \({cGCMC}_F\) brings the context effect with opinion as well as static features. We have tested both models with and without attention mechanism. We found that the attention mechanism improved the performance.

4.2.1 Rating Prediction and Model Training

We evaluate the performance of the proposed algorithm using MAE (Eq. 18) and RMSE (Eq. 19) metrics with respect to the rating assigned by the user to their interaction with the item. The choice of these metrics over classification based ones is driven by the nature of the ratings, which is ordinal rather than multinomial. Hence it is important to capture how closely the prediction approximates the expected rating (which is not the case for classification-based metrics). Our model is trained in end-to-end fashion by minimizing the root mean square error between the actual (\(A_{ij}\)) and reconstructed rating (\({\hat{A}}_{ij}\)).

$$\begin{aligned} MAE= & {} \sum _{i,j}{\frac{{({\hat{A}}_{i,j}-A_{i,j}})}{n}} \end{aligned}$$
(18)
$$\begin{aligned} RMSE= & {} \sqrt{\sum _{i,j}{\frac{{({\hat{A}}_{i,j}-A_{i,j}})^2}{n}}} \end{aligned}$$
(19)

where n represents the cardinality of user-item pairs.

5 Experiments

5.1 Datasets

To demonstrate the effectiveness of our proposed algorithms cGCMC and \({cGCMC}_F\), we conduct experiments on five real-world publicly available datasets for movies, music and travel. We summarize the statistics of datasets in Table 1, where density is defined as the ratio between the number of edges and the cardinality of the (user,items) pairs.

Table 1 The statistical information defining number of users, items and context attributes along with the edge density and rating levels for each of the datasets used in our experiments

LDOS-CoMoDaFootnote 1 is a popular movie dataset collected from survey. This dataset contains user’s opinions on a movie considering the surrounding context. The context information includes location (home, friend’s house, public place), time (morning, afternoon, evening, night), day-type (working day, weekend, holiday), weather (sunny, cloudy, rainy, stormy, snowy), decision (movie choices by themselves or users were given a movie), mood (positive, negative, neutral), season (summer, winter, spring, autumn), endEmo i.e., emotional state at the end of watching movie (sad, happy, angry, surprised, neutral, scared, disgusted), domEmo i.e., emotional state experienced most when watching movie (sad, happy, angry, surprised, neutral, scared, disgusted), interaction (\(1^{st}\) interaction with a movie, \(N^{th}\) interaction with a movie), physical (ill, healthy), companion (alone, friends, partner, family, colleagues, parents, public). Besides this information, LDOS-CoMoDa also has profile features for users (gender, age, city, country) and movies (director, language, actor, genre).

DePaulMovieFootnote 2 is a movie dataset collected by researchers of the DePaul University, with ratings acquired by survey. Students have been asked to rate movies subject to 3 context variables: location (home, Cinema), time (weekend, weekday), and companion (partner, family, alone) information. This dataset does not have user’s and item’s profile features.

Travel-STSFootnote 3 dataset contains information about places visited by tourists. The context information includes distance (nearby, far away), time available (half a day, one day, more than one day), temperature (warm, hot, burning, cool, cold, freezing), season (summer, winter, spring, autumn), crowdedness (empty, crowded, not crowded), mood (happy, active, sad, lazy), budget (high spender, budget traveler, price for quality), weather (sunny, cloudy, rainy, clear sky, thunderstorm, snowing), companion (with children, with friends/colleagues, alone, with family, with girlfriend/boyfriend), weekend (weekday, weekend), travel goal (visiting friends, religion, business, health care, education, social event, scenic/landscape, hedonistic/fun, activity/sport), means of transport (bicycle, car, public transport, no transportation means) and knowledge of surrounding (returning visitor, completely new area, citizen of the area). This dataset also contains user profile features (age, gender).

InCarMusic\(^{3}\) dataset consists of music tracks recommended to passengers based on the surrounding contextual information. The context information includes driving style (sport driving, relaxed driving), road type (highway, city, serpentine), landscape (mountains, coast line, urban, country side), sleepiness (sleepy, awake), traffic conditions (busy road, free road, traffic jam), mood (happy, active, sad, lazy), weather (sunny, cloudy, rainy, snowing), and natural phenomena (day time, morning, night, afternoon).

Tijuana Restaurant\(^{3}\) is a restaurant dataset gathered via a survey consisting of 8 inquiries from persons about various neighbouring cafes. Every restaurant picked was assessed multiple times, one for every possible context setting. The context information includes combinations of time and location (\(c_1\): weekday and school, \(c_2\): weekday and home, \(c_3\): weekday and work, \(c_4\): weekend and school, \(c_5\): weekend and home, and \(c_6\): weekend and work).

The density value in Table 1 represent a fraction of positive links between the nodes. Tijuana-Restaurant dataset has a few number of nodes connected with a high number of edges, while LDOS-CoMoDA dataset has a greater number of nodes connected with few edges (compared to other datasets). Overall, the effect of high or low density values on the performance of our models is shown to be negligible in Sect.6.

5.2 Implementation Setup

Our Pytorch implementationFootnote 4 of the cGCMC and \({cGCMC}_F\) models is publicly available. We have used \(60\%\) data as a training set, \(20\%\) as a validation set and \(20\%\) as a test set for each dataset. The data splitting is performed five times. Each time the data is shuffled with a different random seed before dividing into splits. The average performance of all algorithms after five runs with different random splits is presented in Sect. 6.

5.2.1 Computational cost

We report the computational costs (in seconds) of cGCMc and \(\hbox {cGCMC}_F\), obtained by computing the average time required by a single training epoch and the average time required by the prediction step (i.e., on the whole testset). Results are presented in the Table 2.

Table 2 The average time (sec.) taken by cGCMC and \(\hbox {cGCMC}_{F}\) for each dataset

5.2.2 Hyper-parameters

We have evaluated our approach under different configurations. The best value for each hyper-parameter is shown in bold. We have searched the embedding size for the user’s opinion representation \(d_o\) in \([300, 400, \mathbf{500} , 600]\)), static features representation \(d_f\) in \([5, \mathbf{10} , 15, 20, 25]\) and contextual representation \(d_{c_1}\) in \([50, 100, \mathbf{150} , 200, 250]\) (for GCN) and \(d_{c_2}\) in \([5, \mathbf{10} , 15, 20, 25]\) (for the dense layer) as shown in Table 3. We have chosen batch size from [\(\mathbf{40} , 80, 120, 150, 200\)]. The last layer of the encoder is set to produce embeddings of size 75. The node dropout (\(P_{drop}\)) rate is tuned in \([0.3, 0.4, 0.5, 0.6, \mathbf{0}.7 ]\). \(P_{drop}\) is the probability to randomly drop all outgoing messages from specific nodes to train under the denoising setup. The \(\alpha \) importance factor defined as \(\ [0.2, 0.3, 0.5, 0.7,0.8] \; \forall \; r \in R\), initially chosen randomly considering the fact: \(\alpha [r_1]< \alpha [r_2] \iff r_1<r_2\). We can choose from any set of initial values provided that it satisfies the fact: the context in which the user gives a high rating should have more weight. The attention weights for opinion, feature, and context representations are first set to random values and then learned to give appropriate weights for each of these representations before combining them. All neurons use ReLU nonlinearity and Adam is employed as the optimization algorithm. The model is trained for 200 epochs.For baseline algorithms, all parameters are initialized as mentioned in the corresponding papers.

Table 3 \(\hbox {cGCMC}_{F}\) encoder and decoder layers and their respective best output dimension hyperparameter values

5.3 Benchmarks

In the evaluation phase, we have evaluated the test set using predictive performance in terms of mean absolute error (MAE) and root mean square error (RMSE). We compare our approach with several link prediction algorithms from the literature as follows :

  • \(\mathbf {SocialMF}\) [14] is a matrix factorization approach that exploits user-user trust information along with user opinion on the item to predict items for users.

  • \(\mathbf {SVD++}\) [36] improves the conventional SVD approach by allowing the joint use of explicit (e.g., user’s rating opinion), and implicit (e.g., purchases, visited items) information.

  • \(\mathbf {PMF}\) [37] is a matrix factorization approach for sparse datasets. This exploits the user-item interactions only to learn user and item embeddings, while forgoing the context features.

  • \(\mathbf {BiasedMF}\) [38] is an improvement to traditional matrix factorization and it incorporates bias for user, item, and global bias factors

  • \(\mathbf {GCMC}\) [6] models user’s opinion leveraging the rating matrix between users and items for matrix completion task.

  • \(\mathbf {GCMC}\)+\(\mathbf {feat}\) [6] extended GCMC by integrating static features inside the user and item nodes for link prediction in a bipartite graph.

  • \({\mathbf {GraphRec}}_{{\mathbf {uv}}}^{{\mathbf {uu}}}\) [30] algorithm exploits the social relation between users along with user-item interactions for link prediction in user-item bipartite graph.

6 Performance Comparison

Table 4 presents a comparison between the previous version of our algorithm (subscript with ’old’) with the extended version, and Table 5 presents the performance comparison of our approach with other state-of-art algorithms. Our two datasets (LDOS-CoMoDa and Travel-STS) contain user and item (description) features along with the user’s opinion on items and context information. For the other three datasets (DePaul, InCarMusic, Tijuana-Restaurant), we have only user’s opinion on the item and contextual information. The algorithms that are integrating user’s and item’s feature information are not applicable to the later category of datasets (indicated inside tables with the NA mark, as in “Not Applicable”).

Table 4 Test performance comparison with state-of-art algorithms. Best results are marked in bold letters
Table 5 Test-set performance comparison with state-of-art algorithms. Best results are marked in bold
  • A clear performance difference can be seen between the old and extended versions of our model on all datasets (provided in Table 4). This is purely due to the newly introduced attention factor in the last layer of the encoder.

  • Basic matrix factorization approaches, PMF and BiasedMF, that solely model user-item interaction as isolated instances, ignore side information thus limiting their representation ability. These approaches perform worse compared to all baseline algorithms on all datasets because of their limitation to integrate knowledge about surroundings.

  • The SVD++, SocialMF, and \(GraphRec_{uv}^{uu}\) perform better than basic matrix factorization approaches as they capture and integrate knowledge about an individual user in the form of social trust or by using implicit feedback. Despite of integrating side information, these approaches perform worse than our method because of the advantageous effect of surrounding contextual learning.

  • When comparing our proposed algorithm with GNN based approaches (GCMC and \(GCMC+feat\)), we can identify a significant improvement in performance motivated by the capability of providing context-aware recommendations.

Overall, our model outperforms all baseline approaches on all datasets, providing sufficient grounding to state the importance of being able to take into consideration the surrounding knowledge of the context to provide accurate recommendations.

6.1 Impact of Context Modeling

The major contribution of our approach is to organize context features on edges with user-item interaction in an effective way. We have used the \(\alpha \) importance factor to learn favourite surrounding context features for target user and item for context-aware link prediction. We hence execute ablation study, to validate the rationality and usefulness of \(\alpha \). We already explained how context importance varies from person to person and different context attributes effect differently on the items. The Fig. 4 demonstrates the positive effect of capturing this importance factor in our model. This is clearly due to prioritizing the contexts which are important for users and items by giving them more weight.

Fig. 4
figure 4

Effect of importance factor \(\alpha \) on cGCMC in terms of MAE

6.2 Impact of Attention Weights

We have three kinds of representations for the individual user and item (opinions, feature, and context as mentioned in Sect. 4.1). For the accumulation of these three representations, we determine that the concatenation of the representations is better in performance compared to summation. That is why we mentioned the results with concatenation only. We have introduced learnable attention weights for each representation before accumulating them. These learnable weights provide a different significance for each representation (i.e., opinion, contextual, and feature representation for users and items) in the final embedding. User’s (or Item’s) opinion representation contains information about the neighbouring nodes with respect to opinion information. Similarly, the contextual representation contains neighbouring nodes with respect to contextual information. The final representation for the user is an accumulation of these along with a dense feature representation. We believe that these representations have their own impact on the final node representation with some factors, which we call learnable weight. It might be possible that for some users opinion-based neighbourhood is preeminent than context-based neighbourhood and vice versa. We performed an ablation study of this design to demonstrate the effectiveness and rationality of weighted representations (in Eq. 14 and Eq. 15). The positive impact of attention weights on \(\hbox {cGCMC}_F\) (LDOS-CoMoDa and Travel-STS) and on cGCMC (DePaul, InCarMusic, Tijuana-Restaurant and Travel-STS) in terms of MAE are shown in Fig. 5.

Fig. 5
figure 5

Effect of accumulation with attention in terms of MAE

7 Conclusion

We have focused our work on emphasizing the impact of knowledge about the surrounding context on user-item interaction. To this end, we have organized context, opinion, and item features into a bipartite graph and an associated multidimensional matrix. We approached the resulting matrix completion task using a graph convolutional autoencoder. Our graph encoder captures the context information along with opinion in user-item interactions. We also showed how the model leverages context information to capture the user’s behaviour in relation to the surrounding context, giving attention to the most important contextual aspects of the user and item. Furthermore, the bilinear decoder predicts the labelled edges between the user and item. To demonstrate the effectiveness of our approach, we tested it on five public datasets, showing significant improvements over state-of-the-art baselines. We have conducted various experiments to verify how context representation gives benefit. The application of our model is not only limited to product recommender systems in smart devices, i.e., music/movie/travel/fashion recommendations. This model can also be used for several intelligent predictions by developing further for specific domains like personal medical reminders for elders and smart device setting controller based on the surrounding context.

In this work, the accumulative approach unifies all context information, neglecting the dynamic nature of some contextual attributes. This may result in losing the diversity of individual context attributes. In the future, we would like to explore multi-dimensional edge feature-based GNNs and multi-way interactions between users and items to capture more realistically the dynamic behaviours. Furthermore, we intend to investigate the use of separate embeddings for user and item contexts and to evaluate the performance on a large scale dataset. On a different side, we want to extend our model to deal with heterogeneous graphs which consist of nodes of different types and different context information on different edges.