1 Introduction

Re-ranking [1,2,3,4,5,6] is crucial in recommendation systems. It refines the initial list of recommended items to attract more user engagement. An effective re-ranking model is valuable in improving the final ranking list, which influences the user experience.

The candidate list sent for re-ranking is often generated to include a limited set of items, which are selected from the commodity pool that contains a huge number of items. Items in the candidate list are often ranked by their relevance to users’ interests. It is thus important to exploit the mutual influence [7,8,9,10] between items in the re-ranking process. Users often show successive interests in a group of relevant items, e.g., mobile phone, phone covers, wireless headset. Also, users’ attention is only paid to a small window of items when they browse the item list. An abrupt change of categories in the ranking list may give users a feeling of disorder, which will reduce the users’ browsing willingness. For example, an irrelevant item female jacket causes an abrupt change in the list mobile phone, female jacket, phone covers, wireless headset. It is thus important to exploit the local mutual relationship between items in the same scene in re-ranking. However, previous re-ranking models [11,12,13,14,15,16,17,18,19,20] take single item as the basic unit and focus on only the pairwise influence between two item. The ignorance of local mutual influence will cause abrupt changes of categories in the ranking list.

We first designed a symmetric pretraining module to learn contextual information about users and items using the Masked Language Model (MLM) approach. More specially, we preprocess the dataset to obtain sequences of users and items and mask the sequences by a certain ratio. Additionally, we classify each sequence to increase the difficulty of the pretraining task and improve the representation ability of the pretraining embeddings.

To exploit local mutual influence, we propose a new re-ranking framework that takes a ‘scene’ as the basic unit to study. A scene is formally defined as a group of items, containing a key item that decides the property of the scene. And all items in the scene are different but maintain high level preference similarity with the key item. The preference similarity can be learned by the application goal or defined by constructed knowledge graph. Given a candidate list and the preference similarity, we can construct the candidate scene list from the candidate item list under the definition of scene. Then instead of re-ranking every item, we will re-rank all scenes in the scene candidate list for recommendation.

Re-ranking the scene candidate list requires to represent scenes as representation vectors, which should be learned by using the item features and the preference similarity. We represent one scene as an attributed graph, whose nodes are items, and edges are determined by the preference similarity, and node features are naturally defined by the item features. The representation vector of one scene can then be learned by integrating the graph structure and node features. We apply thus GCNs on our constructed graph for one scene, and learn the representation vector for this scene.

Re-ranking the scenes also requires to consider the global mutual influence between different scenes. One scene can have different scene neighbors (context) in different scenarios. Like a polysemous word, it has multiple possible meanings, and can appear with different words in sentences (having different context). In our case, it is thus necessary to learn different representations for one scene, given its different context, and also wrap these representation by taking into account the mutual influence among scenes. In our method, we apply multi-head attention on the representation of all scenes to generate new representation for them which can capture the mutual influence between scenes automatically.

Re-ranking the scenes should also be personalized. Personalization is important in all kinds of recommendation applications [6, 7, 9, 21,22,23,24,25,26,27], as users have different interests. How to learn the interactive relation between users and a list of scenes will influence our re-ranking result directly. In the existing study [15], user and item representation are concatenated and then sent to a multiple-layer perception (MLP) to generate the final result. However, the weights of MLP strongly depend on the ranking position of the user-item pair. A small perturbation to items’ ranking order in the candidate ranking list can lead to different results. To overcome this, we use matrix factorization to learn the interaction between users and scenes in our work. Given the embedding of a user and embeddings of all scenes, the inner product will be conducted on all user-scene pairs. As the computation is independent of different scene embeddings, we can get a unique result for the same candidate list with different input sequences.

The objective function will decide the model’s destination, which should be consistent with the aim of the application. As mentioned before, the aim of re-ranking is to attract users to purchase or click more items. Thus user’s interest is the most important factor that we should consider in the objective function. To achieve this, we use a least square loss to make the preference score be consistent with user’s interest. The results of the matrix factorization between user and scene representations will lie in a large range, then we use a clap function to map the results between 0 and 1 to fit the binary labels. The ranking performance of the re-ranking model can be evaluated by many metrics, such as precision, recall and NDCG. But there is no metric to evaluate the local quality of the result sequence. To evaluate the quality of local mutual relationship in the re-ranking results quantitatively, we propose a new metric based on our scene definition. The results on different datasets show that our algorithm can outperform all other state-of-the-art methods and have better local sequence quality.

The contribution of this paper is summarized as follows:

  • We propose a new re-ranking framework that can exploit local mutual influence by taking a group or related items, namely scene, as the basic unit to study.

  • We propose a pre-training module to learn user and item content information in the dataset with MLM methods.

  • We apply GCN and multi-head attention to learn scene representations that integrate the information of local and global mutual influence. Matrix factorization is applied to generate the unique result for the same candidate list with different input sequences.

  • To evaluate the local mutual relationship of the result, we propose a new metric based on our scene definition and our method outperforms all other state-of-the-art methods on different datasets.

2 Related Work

In this section, we will discuss the works related to re-ranking. We divide them into two categories based on their optimization solutions. One type is constructing an objective function which can be approximately optimized by greedy methods. This group of works are also called post-processing method, because re-ranking is conducted on the candidate list as a post-processing step. The others are utilizing machine learning methods to learn a model from history data.

2.1 Post Processing Methods

Carbonell et al. [28] propose maximal marginal relevance (MMR) to perform re-ranking, which aims at diversifying the ranked list in information retrieval engines. It combines the relevance score and diversity score in the objective function to select the optimal list. Considering the interest decay of different categories, the product of the category preference and item preference is set as the objective function in IA-Select [11]. The category preference decays with the increase of the selected number of this category. xQuAD [12] considers the relevance, diversity and novelty together to generate a combined objective function with two terms, which has the same format as [28]. Recently, a new re-ranking method [13] is proposed to solve the fairness problem in the recommendation system. Their aim is to solve the discrimination against specific people community caused by the data imbalance. Kullback–Leibler (KL)-divergence is utilized as the main part of the objective function and the metric of evaluation. From their results, we can see this method will decrease the accuracy as the trade-off for fairness. However, all these methods are based on the principles of objective function, which don’t utilize the interactive history directly in the re-ranking process.

2.2 Learning to Rank Methods

The users’ history data contain huge of interactive information between users and items, which should be fully utilized in the re-ranking process. Traditional recommendation algorithms usually consider users’ primary preferences and ignore their needs for long-tail items. As a result, the display rate of long-tail items in the recommendation list is usually very low, while these items are exactly what users may be more interested in. In Kim et al. [14], the authors propose a recommendation framework based on sequences and diversity. They use the gross merchandise volume (GMV) as the objective function to learn the score of input sequence by RNN. Beam search is utilized to find the optimal results from limited candidate lists. The framework consists of two steps: the first step is to generate a sequence of items based on the user’s historical behavior and preferences. The second step is diversity enhancement of the sequence by adding long-tail items to the sequence to improve the diversity of the recommendation list.

With the great success of Transformer [29] in machine translation tasks, researchers have started to apply Transformer into recommendation systems. Pie et al.[15] is the first time to explicitly introduce the personalized information into re-ranking task in large-scale online system. They use Transformer to learn mutual influence. The self-attention mechanism can model user-specific interactions between any two items in a more efficient way than RNN-based approaches. This method concatenates user and item as input of Transformer and uses a multiple layer perception with softmax to generate the final score of every item. The final result is generated by sorting all items with the preference score.

To encode the mutual influence between items effectively, the Transformer [29] encoder is utilized to generate the embeddings in Pie et al. [15]. They concatenate user and item features as input and use a multiple layer perception with softmax to generate the final score of every item. The final result is generated by sorting all items with the preference score. IRGPR [16] design a heterogeneous graph that combines by item relationship graph and user-item scoring graph withing capturing item-item relationships and user-item relationships to learn user and item embeddings.

3 Definition

3.1 Problem Definition

Given user set \(U = \{u_1, u_2, \ldots , u_m\}\) with m users, and item set \(V = \{v_1, v_2, \ldots , v_n\}\) with n items, we use \(C = \{c_1,c_2, \ldots , c_n\}\) to denote their categories for all items in V. The association matrix is denoted as \(R\in \mathbb {R}^{m\times n}\), where \(R_{i,j} = 1\) means user \(u_i\) likes item \(v_j\). Otherwise, we will set \(R_{i,j} = 0\). However, \(R_{i,j} = 0\) doesn’t indicate user \(u_i\) will not buy \(v_j\). It may be because user \(u_i\) hasn’t found \(v_j\) yet. The goal of re-ranking is that, given the candidate list \(\{v_{i1}, v_{i2}, \ldots , v_{ik}\}\) for every user \(u_i\), we generate a list with new item order that can make the user click or buy more items than the candidate list.

3.2 Scene Definition

To utilize the local mutual influence between items explicitly, we should define the local relationship firstly. In our work, we mainly consider the two kinds of the local mutual relationship: diversity and similarity. We try to construct a group of adjacent items, namely ‘scene’, to satisfy these two properties. A scene should have a key item that indicates the main character of the scene. And all items in one scene are different but maintain high preference similarity with the key item. For the choice of preference similarity, it will vary for different applications and situations. For example, the preference similarity will be highly related to the categories for a movie application, or depend on the content for a news website. And there are mainly two ways to construct the preference relationship between items. One type is to learn relationship based on the application goal, and the other is to extract the similarity from the constructed knowledge graph. However, the preference similarity from all kinds of methods and context can be represented as a graph structure, and the preference similarity between item \(v_i\) and \(v_j\) can be denoted as \(pr(v_i,v_j)\).

Fig. 1
figure 1

The details of Mask Pretraining Module. The left part denotes the item pretraining block and the right denotes the user pretraining block

Therefore, given a scene \(S = \{v_{s1},v_{s2},..,v_{sk}\}\) with the key item \(v_{o}\in S\), the property of this scene can be formulated as follows:

$$\begin{aligned} \begin{aligned}&c_i \ne c_j, \quad \forall v_i,v_j \in S,\\&pr(v_i, v_{o}) > \gamma , \quad \forall v_i \in S, \end{aligned} \end{aligned}$$
(1)

where \(\gamma \) is the threshold to make sure that the items in one scene have compact preference relationship. Under this definition, the scene can keep the properties of diversity and similarity locally. Then given the maximal cardinality \(n_s\) of scene, the candidate item list \(\{v_{i1}, v_{i2}, \ldots , v_{ik}\}\) for user \(u_i\) can be constructed as scene candidate list \(\{S_1,S_2, \ldots ,S_{k_s}\}\). More details about the construction of scene can be found in the experiment settings.

4 Proposed Method

4.1 Mask Pretraining for Embedding

Obtaining accurate initial user and item embeddings is essential for improving the performance of the re-ranking model. Direct initialization using techniques such as random initialization and xavier_uniform initialization is the most commonly used method for acquiring the initial embeddings. Unfortunately, direct initialization has been known to cause varying model performance and instability, leading to reduced accuracy in the final ranking list. To mitigate this issue, researchers in the NLP and CV fields have adopted pretraining methods [30, 31] to acquire robust weights and embeddings that capture global structural knowledge from the dataset [32]. These methods involve performing generic tasks such as generation and classification tasks and using the resulting model weights and input embeddings to improve downstream tasks. In our approach, we utilize the Mask Language Model (MLM) to pretrain user and item embeddings on the dataset. The MLM aims to predict the masked tokens in a given sequence, which allows the model to learn contextual relationships between tokens. This method ensures that the pretraining embeddings capture the semantic meaning of the user and item features, leading to better downstream task performance.

In NLP, the input is typically a sequence of tokens, making it possible to use pretraining techniques like MLM. To apply the MLM to re-ranking, we use the sequences of items that users have previously interacted with as item sequences. Likewise, we use the sequences of users that have interacted with the same item as user sequences. To improve pretraining performance, we mask a certain proportion of tokens in both the item and user sequences and predict the masked tokens. Additionally, we classify each sequence to increase the difficulty of the pretraining task and improve the representation ability of the pretraining embeddings. Specifically, we classify each item sequence into the corresponding user category, and each user sequence into the corresponding item category. To be more specific, the total number of user categories is equal to the number of users in the dataset, while the total number of item categories is equal to the number of items. This process ensures that each sequence is classified into a meaningful category, enabling the pretraining embeddings to capture the semantic meaning of the user and item features more accurately.

Due to the symmetry of the pretraining process for users and items mentioned above, we will mainly elaborate on the pretraining process for items. The pretraining module is shown as Fig. 1. Given a list of items which the user has interacted with \(\textbf{u}= \{ v_{1}, v_{2}, \ldots , v_{k}\}\), we randomly mask some items and use the special token [MASK] to replace these items. It is important to note that the first and last items are not masked since they serve as significant prompt positions. To enable information extraction, the special token [CLS] is concatenated to the front of the list, thereby serving as the classification header for the entire list.

For each \(v_{i} \in \textbf{u}\), we first transform each item ID to embedding as follows:

$$\begin{aligned} \textbf{p}_{i} = \text {TokEmb}(v_{i}) + \text {PosEmb}(i), \end{aligned}$$
(2)

where \(\text {TokEmb}(\cdot )\) denotes the token embedding layer and \(\text {PosEmb}(\cdot )\) denotes the learnable position encoding. Then, \(\textbf{u}\) can be transformed to embedding \(\textbf{P}= \{ \textbf{p}^{cls}, \textbf{p}_{1}, \textbf{p}_{2}^{mask},..,\textbf{p}_{k}\}\). Next, we use two transformer encoder block as encoder layer to learn history information of user-a as follows:

$$\begin{aligned} \textbf{P}^\prime = \text {TransformerEncoder}(\textbf{P}), \end{aligned}$$
(3)

The position of mask item can be restored by contextual information through global learning of transformer. Then, we use a linear layer and softmax to compute probabilistic output as follows:

$$\begin{aligned} y_{i}^{mask} = \text {softmax}(\text {Linear}(\textbf{p}_{i}^{mask})), \end{aligned}$$
(4)

where \(y_{i}\) is a |V|-dimension vector and \(\textbf{p}_{i}^{mask} \in \textbf{P}^\prime \). Note that we only calculate the mask position. We use similar way buy different linear layer to compute sequence classification probability as follows:

$$\begin{aligned} y^{cls} = \text {softmax}(\text {Linear}(\textbf{p}^{cls})), \end{aligned}$$
(5)

note that the [CLS] is fixed to appear at position-0 in item sequence of a user.

We then evaluate the cross-entropy error over all item labeled examples on mask for a user as follows:

$$\begin{aligned} L^{mask} = -\sum _{i \in \mathcal {M}}\sum _{j = 1}^{|V|} y_{ij} \text {log}y_{ij}^{mask}, \end{aligned}$$
(6)

where \(\mathcal {M}\) is the mask set of item sequence of a user and \(y_{ij}\) denotes the category label corresponding of item-i that a user has interacted with in this sequence. And we use similar way to compute item sequence classification as follows:

$$\begin{aligned} L^{cls} = -y\log y^{cls}, \end{aligned}$$
(7)

where y is the user category label.

Finally, we compute all user error as follows:

$$\begin{aligned} L_{pre} = \sum _{k = 1}^{|U|} (L_k^{mask} + L_k^{cls}). \end{aligned}$$
(8)

The above process we can get item embeddings, user embedding process is similar. After pretraining, we obtain user embeddings \(\textbf{P}= \{\textbf{p}_1, \textbf{p}_2, \ldots , \textbf{p}_m\}\) and item embeddings \(\textbf{Q}= \{\textbf{q}_1, \textbf{q}_2, \ldots , \textbf{q}_n\}\) that can be further used for the next tasks.

4.2 Local Mutual Influence

Learning the representation of scene automatically is the first task we should tackle in our framework. Given the property of scene as (1), we use graph convolutional networks (GCNs) to learn representation in scene (Fig. 2).

Fig. 2
figure 2

The framework

To apply GCNs on our problem, we construct the graph G for GCNs with the preference similarity score \(pr(v_i,v_j)\) for all \(v_i, v_j \in V\). According to the construction of the scene in (1), we only utilize the high preference similarity score to aggregate items in one scene to preserve similarity. To make the graph G be consistent with the items in one scene, we construct a truncated graph as follows:

$$\begin{aligned} G_{i,j} = {\left\{ \begin{array}{ll} pr(v_i,v_j) &{} \quad \text {if} \quad pr(v_i,v_j) > \gamma \quad \text {and}\quad i\ne j\\ 0 &{} \quad \text {otherwise} \end{array}\right. }, \end{aligned}$$
(9)

where \(\gamma \) is set as the same as (1). Under this way, only highly related items will be connected in the graph G, which will make the graph G be sparse. Moreover, we need to point out the most important property of a scene is to get accurate user preference, which is indicated by the key item in the scene. So we construct the adjacent matrix A in our work with a weight \(\alpha \) to control the influence of neighbors as follows:

$$\begin{aligned} A = I + \alpha \text {Norm}(G(u)), \end{aligned}$$
(10)

where I is the identity matrix and \(\text {Norm}(G)\) is the normalization function that every entry will be divided by the L1-norm of its corresponding row vector. As every scene is a subset of the items set V, we just need a subgraph from the graph G to generate the scene representation by GCNs. Given one scene \(S= \{v_{s1},v_{s2}, \ldots v_{sk}\}\) with key items \(v_o\), we can get the representation \(\textbf{s}\) of this scene as:

$$\begin{aligned} \textbf{s}= A_o\textbf{P}_S\textbf{W}, \end{aligned}$$
(11)

where \(A_o \in \mathbb {R}^{1\times k}\) is the row vector of adjacent matrix A for the key item \(v_o\), \(\textbf{P}_S\in \mathbb {R}^{k\times d}\) is the stack of features \(\{\textbf{q}_{s1},\textbf{q}_{s2}, \ldots ,\textbf{q}_{sk}\}\) for all items in the scene S which are obtained in pretraining module, and d is the dimension of item features, \(\textbf{W}\in \mathbb {R}^{d\times d_s}\) is the learned parameter to project the combined item features into a new space with dimension \(d_s\). By applying the GCNs on the scene like this, we can learn the scene representation automatically during the training process.

4.3 Global Mutual Influence

To avoid the influence of input order, we apply the Transformer encoder in our task, whose main part is the multi-head attention to learn the mutual relationship between different samples. For a list of samples, attention treats each sample equally without incurring the information of sample sequence. Given the scene matrix \(\textbf{S}\in \mathbb {R}^{k\times d_s}\), which is the stack of all scene representation \(\{\textbf{s}_1, \textbf{s}_2, \ldots , \textbf{s}_k\}\) in candidate list, we can get the attention for \(\textbf{S}\) as follows:

$$\begin{aligned} Att(\textbf{S}) = softmax \left(\frac{\textbf{S}\textbf{W}_Q(\textbf{S}\textbf{W}_K)^\top }{\sqrt{d_s}} \right)\textbf{S}\textbf{W}_V, \end{aligned}$$
(12)

where \(\textbf{W}_Q \in \mathbb {R}^{d_s\times d_k}\), \(\textbf{W}_K\in \mathbb {R}^{d_s\times d_k}\) and \(\textbf{W}_V\in \mathbb {R}^{d_s\times d_k}\) are learned parameters to generate the queries, keys and values

respectively, and softmax is to map the influence between different samples into range [0, 1]. Under this way, we can get a new representation for each scene which can learn the mutual influence with other scenes automatically.

The mutual influence between different scenes is complicated. To learn different kinds of mutual relations, we apply the multi-head attention to generate the scene embedding \(\textbf{S}_M\in \mathbb {R}^{k \times d_{s}}\) containing h independent mutual relations as follows:

$$\begin{aligned} \begin{aligned}&S_M = [Att(\textbf{S})^1;Att(\textbf{S})^2; \ldots ;Att(\textbf{S})^h],\\ \text {where } Att&(\textbf{S})^i \!= \!softmax \left(\frac{\textbf{S}\textbf{W}^i_Q(\textbf{S}\textbf{W}^i_K)^\top }{\sqrt{d_s}} \right)\textbf{S}\textbf{W}^i_V \text { } \forall i\!\in \! [1:h]. \end{aligned} \end{aligned}$$
(13)

In the (13), \([\cdot \text { };\cdot ]\) denotes the concatenation operation that concatenates all matrices by row, h is the number of heads we use, \(\{\textbf{W}^i_K, \textbf{W}^i_Q, \textbf{W}^i_V\}\) are the learned parameters for i-th attention \(Att(\textbf{S})^i\), and the attention dimension is set as \(d_k = \frac{d_s}{h}\). Feed-Forward Network (FFN) is applied on the output of multi-head attention to improve the quality of representations as:

$$\begin{aligned} \textbf{S}_E = F\!F\!N(\textbf{S}_M) = \max (0, \textbf{S}_M\textbf{W}_{F}^1 + \textbf{b}^1)\textbf{W}_{F}^2 + \textbf{b}^2, \end{aligned}$$
(14)

where \(\{\textbf{W}_{F}^1\in \mathbb {R}^{d_s\times d_{f}},\textbf{W}_{F}^2 \in \mathbb {R}^{d_f\times d_{s}},\textbf{b}^1,\textbf{b}^2\}\) are learned parameters in FFN, and \(d_f\) is the dimension of FFN. The multi-head attention and FFN will be applied multiple times to generate the scene embeddings. And we use \(\textbf{S}_E^{n_e}\) to denote the final scene embeddings generated by \(n_e\) blocks of multi-head attention and FFN. With this structure, we can learn the mutual influence between different scenes automatically.

4.4 User-Scene Interaction

For a user \(u_i\), we apply multiple layer perception on its feature \(\textbf{p}_i\) obtained in pre-training module to generate hidden embedding \(\textbf{e}_i\) as follows:

$$\begin{aligned} \textbf{e}_i = M\!L\!P(\textbf{p}_i) = \textbf{W}_U\textbf{p}_i+ \textbf{b}_U, \end{aligned}$$
(15)

where \(\{\textbf{W}_U\in \mathbb {R}^{d_U\times d_s}, \textbf{b}_U\}\) are learned parameters to project the user feature into a new space with the same dimension as \(S_E^{n_e}\).

Then we can get the preference score for every scene embedding independently as follows:

$$\begin{aligned} \textbf{r}_i = \textbf{e}_i\otimes \textbf{S}_E^{n_e} = \{\textbf{e}_i^\top {\tilde{\textbf{s}}}_1,\textbf{e}_i^\top {\tilde{\textbf{s}}}_2, \ldots ,\textbf{e}_i^\top {\tilde{\textbf{s}}}_k\}, \end{aligned}$$
(16)

where \(\otimes \) is the vector-wise inner product, and \({\tilde{\textbf{s}}}_i\) is the embedding of i-th scene in \(\textbf{S}_E^{n_e}\). However, there is no constraint on the value range of \(\textbf{r}_i\), which cannot be optimized by existing loss functions. Instead of directly utilizing \(\textbf{r}_i\) as the preference score, we refine it to be more extensible. Under this way, we can get the preference score \({\tilde{\textbf{r}}}_i\) from \(\textbf{r}_i\) as follows:

$$\begin{aligned} {\tilde{\textbf{r}}}_i = clip(\textbf{r}_i), \end{aligned}$$
(17)

where clip is the function to truncate values of all elements in the range [0, 1]. Once we get the predicted scores \({\tilde{\textbf{r}}}_i\) for user \(u_i\), we can sort it to generate the scene list. For the inner sequence of one scene, we sort items as the order they join in the scene.

4.5 Model Optimization

There are mainly two kinds of objective functions for the re-ranking problem with learning-to-rank framework: list-wise and point-wise. For the list-wise objective function, there is only one score to evaluate the quality of the whole ranking list. The main drawback of list-wise objective function is that, we need to input all possible orders of the candidate list to get the best result.

Different from list-wise objective function, the point-wise one will generate a preference score for each basic unit of the input. Then we can sort the list by the score without time-consuming computation. However, the key problem is that the predicted score of positive item may be far less than the label, especially when the candidate list is large or there are multiple positive items in the candidate list. To overcome this problem, for a user \(u_u\), we apply the least square loss on the difference between the predicted preference score and scene labels. For every scene, we use the label of the key item in each scene as this scene’s label. Then, given the scene list \(s = \{s_1,s_2, \ldots ,s_k\}\) and predicted score \({\tilde{\textbf{r}}}_u\), we can define the loss for user \(u_u\) as

$$\begin{aligned} L_s^u = \sum _{i=1}^k\Vert R_{u,i_o } - \tilde{r}_{u,i}\Vert ^2 \end{aligned}$$
(18)

where \(\tilde{r}_{u,i}\) is the i-th scene’s preference score in \({\tilde{\textbf{r}}}_u\), and \(v_{i_o}\) is the key item of scene \(s_i\).

To overcome the overfitting and gradient explosion, we apply L2-norm on all learned parameters, and have the overall loss function L as follows:

$$\begin{aligned} L = \sum _{u,s}L_s^u +\lambda \sum _{\textbf{W}\in \textbf{W}^*}\Vert \textbf{W}\Vert ^2 \end{aligned}$$
(19)

where \(\textbf{W}^*\) is the set of all parameters to learn, and \(\lambda \) is the weight to control the contribution of regulation term.

5 Experiment

In this section, we will deliberate the details of our experiment settings and results. Moreover, we will also give the performance comparison of different parameters in our algorithm. Last, online test on real a recommendation system will be given with analysis.

5.1 Dataset

We use three public datasets to evaluate the performance of our algorithm and other baselines for comparison. Two of them are MovieLens-1 M and MovieLens-20 M [33] from the movie website with user’s ratings. We set the rating equal to or bigger than 3 as a positive label, otherwise we set it as a negative label. The third dataset is a video game recommendation dataset.Footnote 1 For every dataset, we filter out users whose positive items are less than 15. We sort them by timestamp for every user and set the first 80% positive data as the training set and the last 20% positive data as the test set. In the training process, we use every positive item as the key item and randomly select \((n_s-1)\) different but relevant items to construct the scene. Moreover, at least 2/3 scenes in one batch are positive and the negative ones are generated randomly for every epoch in the training process. To simulate the re-ranking in the testing process, we use a basic recommendation method to generate the relevance score for all user-item pairs. For every user, we select the top 60 items and add all remaining positive items of the testing set in the end to generate the candidate list.

5.2 Parameter Selection

The parameters in our model are set by searching as follows: \(\alpha \in \{0.001,0.01, 0.1\}, \) \(\gamma = 0.6, \) \(d_s \in \{10,20,30,40,50\}, \) \(h \in \{1,2,5\}, \) \(n_s\in \{1,2,3,4,5\}, \) \(n_e\in \{1,2,4,8\}, \) \(\lambda \in \{0.0001,0.001,0.01\}, \) and \(d_f \in \{10,20,30,40,50\}. \)

To evaluate the quality of local mutual relation quantitatively, we propose a new metric, namely scene ratio (SR@N), based on our definition of scene. Given the recommendation list \(V^r = \{v^r_1,v^r_2, \ldots ,v^r_N\}\), we construct scenes from end to end. The item \(v^r_i\) is set as the key item, and i is set as 1 at first. Then we find the largest k, that \(\{v^r_i, v^r_{i+1}, \ldots ,v^r_{i+k}\}\) is a scene with key item \(v^r_i\). And k can be 0, where the cardinality of the scene is 1. Then we set \(i=k+i+1\) to repeat this process until \(v^r_N\) is in the last scene. We use PS to denote the set of scenes whose cardinality is larger than 1 and NS to denote the set of scenes which only contain one item. The we can get the SR as follows:

$$\begin{aligned} S\!R@N = \frac{|P\!S|}{|P\!S|+|N\!S|}, \end{aligned}$$
(20)

where \(|\cdot |\) is the cardinality of the set.

In our experiments, we set \(N=\{5,10,15\}\) for precision, recall, NDCG and SR. For each metric, we firstly compute the accuracy for each user on the testing data, and then report the average for all of them.

5.3 Metrics

We use three metrics to evaluate the ranking performance of our algorithm, namely NDCG@N, precision@N and recall@N. NDCG@N is widely used in ranking task to evaluate the performance of the sequential list. Precision@N and Recall@N together can testify the quality of the recommendation list.

5.4 Baselines

We compare our algorithm with several state-of-the-art baselines that are widely used or recently published to evaluate the performance of our algorithm.

  • BPRMF [34]: This is a recommendation method with pair-wise loss based on the personal visiting history. It proposed an optimization metric based on maximum a posteriori estimation is proposed. In our all experiments, we use this algorithm as the basic method to generate the candidate list for re-ranking.

  • MMR [28]: This algorithm proposes to use maximal marginal relevance (MMR) to do re-ranking. In this method, they combine the relevance score and diversity score in the objective function to select the optimal list. In our experiments, we use MMR-0.1 and MMR-0.9 to show the results with different weights (0.1 and 0.9, respectively) of the diversity score.

  • \({\textbf{C}_{\mathbf {K\!L}}}\) [13]: This method is proposed to solve the fairness problem in the recommendation system. Their aim is to solve the discrimination against specific people community caused by the data imbalance. Kullback-Leibler(KL)-divergence is utilized as the main part of the objective function and the metric of evaluation. We set the weight of KL-divergence as 0.1 in our experiments.

  • PRM [15]: This method utilizes the Transformer encoder to generate the item embeddings, for encoding the mutual influence between items. The user and item representation are concatenated and then sent to a multiple layer perception with softmax to generate the final score of every item. The final results is generated by sorting all items with the preference score.

Table 1 Experiment results on MovieLens-1 M
Table 2 Experiment results on MovieLens-20 M
Table 3 Experiment results on video game
Table 4 Ablation experiment on MovieLens-1 M
Table 5 Ablation experiment on MovieLens-20 M
Table 6 Ablation experiment on Video Games

5.5 Main Results

The evaluation results of our model and of the baselines are presented in Table 1 for MovieLens-1 M dataset, in Table 2 for MovieLens-20 M dataset and Table 3 for Game Video dataset. From the details of all results, we get the following observations:

  • From the comparison between BPRMF and our algorithm on NDCG, precision and recall, we find that our re-ranking strategy improves the performance significantly. The main reason is that, standard recommendation algorithms usually only focus on the main interest of the user, and cannot overcome the long-tail problem. Therefore, items on the top of the list will more likely have the same category, which makes users lose their interests due to the low diversity.

  • From the results of MMR and \(C_{K\!L}\), we find that the performance of post-process methods is similar to that of the basic recommendation method BPRMF, when the weight of inference score is large. And our method outperforms them a lot, as the way they utilize contextual information is fixed and they cannot learn the complicated relationship comprehensively.

  • From the comparison between MMR-0.9 and our algorithm, we see that MMR makes the list too diverse so that adjacent items do not have any relations. But MMR can improve the performance of recommendation list generated by BPRMF, proving that the candidate list before re-ranking is lack of diversity.

  • From the comparison between PRM and our algorithm, we see that our algorithm outperforms it on all metrics. The main reason is that our algorithm not only utilizes the global mutual influence but acondalso exploits the local mutual influence. Moreover, our algorithm learns the interaction between users and items better, as matrix factorization can model the interaction for different scenes independently.

  • From the comparison between all baselines and our algorithm on metric SR, we see that none of them considered the local mutual influence. The superior performance of our methods comparing to baselines on other different metrics demonstrates the importance of the consideration of the local mutual influence.

5.6 Ablation Experiments

We set up a variety of different experiments to verify the validity of our model components:

  • w/o pretraining We use random initial embedding to replace the pretraining embedding.

  • w/o MF We use concatenation to replace the matrix factorization.

The results of the ablation experiment are presented in Table 4 for MovieLens-1 M dataset, in Table 5 for MovieLens-20 M dataset and Table 6 for Game Video dataset. From the details of all results, we get the following observations:

  • The pretraining module can effectively improve the performance of our model because it can learn user history information and user information of items in the dataset. Pretraining module utilizes two tasks for training, including history sequence classification and recovering the [MASK], which makes the representations learned by the pretraining module more robust to represent the user and item information from the dataset. On the contrary, randomly initialized embedding is affected by different initialization methods and does not contain contextual information about the dataset.

  • Compared to concatenation, matrix factorization is not sensitive to the ranking position of the user-item pair. The inner product is broadly used in the recommendation to compute user’s preferences and the computation is independent of different scene embeddings.

6 Conclusion

In this paper, we propose a new framework for re-ranking, which utilizes the local mutual influence between items. We define a group of related items, namely scene, as the basic unit for re-ranking. For each scene, there is at least one key item to denote the property of this scene. All other items in this scene are different but have preference relationship with the key item. We first design a symmetrical pretraining module for user and item embedding to learn context information from the dataset. To exploit the local mutual relations, we use GCNs to generate the scene representations. Moreover, we also incur the global mutual influence by applying multi-head attention on representations of all scenes. To learn the interactive relation between user and scenes independently, we apply matrix factorization on all user-scene pairs. We also propose a new metric to evaluate the local mutual relationship quantitatively. The results on different public dataset show that our algorithm outperforms all other state-of-the-art methods significantly.