# A Variational Latent Variable Model with Recurrent Temporal Dependencies for Session-Based Recommendation (VLaReT)

## Abstract

This paper presents an innovative deep learning model, namely the Variational Latent Variable Model with Recurrent Temporal Dependencies for Session-Based Recommendation (VLaReT). Our method combines a Recurrent Neural Network with Amortized Variational Inference (AVI) to enable increased predictive learning capabilities for sequential data. We use VLaReT to build a session-based Recommender System that can effectively deal with the data sparsity problem. We posit that this capability will allow for producing more accurate recommendations on a real-world sequence-based dataset. We provide extensive experimental results which demonstrate that the proposed model outperforms currently state-of-the-art approaches.

## Keywords

Recurrent networks Latent variable models Deep learning recommender systems## 1 Introduction

Recommender Systems (RS) aim to enhance user experience and provide accurate personalized recommendations when used as smart tools in e-commerce applications [1, 2]. Recent studies on RS have been mainly focused on classic collaborative neighborhood techniques or matrix factorization methods. These techniques work well under conditions where a solid user profile is available; this allows for ameliorating the fundamental challenges RS is faced with, such as the cold-start problem and data sparsity [3]. In recent years, researchers have examined the use of RS in a broader range of applications in order to produce accurate suggestions in contexts that have never been introduced in the past, or in complex problems as for example in sequence-based and session-based recommendation.

Session-based recommendation is a newly introduced challenge in the context of RS, firstly presented in the RecSys Challenge, 2015 [4]. In the session-based context, an RS delivers recommendations taking into account only the users’ actions in a current session [5]. To achieve this, the RS processes the historical data of users captured in an active session, utilizing at the same time only a slight piece of information that presents the behavior of the current user in order to predict their next move (recommended item).

The success of Deep Neural Networks (DNNs) on image/speech recognition [6] constitutes the main motivation that has inspired use of such models in the context of RS. Moreover, the utilization of Recurrent Neural Networks (RNNs) for modeling variable-length sequence data has gained tremendous attention, and has also been used to deal with the session-based problem [4]. The main difference between feed-forward deep models and RNNs is that RNNs construct a recurrent latent state by leveraging appropriate connections between the units of the network. In a session-based problem, the RS considers as the initial input of the RNN the first item that a user selects when opens a session. Subsequently, each sequential click that follows is used to generate an output that relies upon previous clicks; this is essentially the recommendation the model generates. The main challenges of session-based recommendation are: (i) the large set of available items, that can be in the orders of millions; and (ii) the scalability issues that arise just because the click-stream datasets are vast; thus the time needed to train the model is enormous. In order to tackle the abovementioned challenges, RS use ranking loss functions to train the neural networks and recommend only a set of the top-k items to a user.

This study outlines a model namely Variational Latent Variable Model with Recurrent Temporal Dependencies for Session-Based Recommendation (VLaReT) that utilizes scalable (amortized variational) Bayesian inference [7] to increase the performance of classic RNN session-based RS by allowing for one to deal with data sparsity. Specifically, the proposed approach treats the inferred latent variables of the system as stochastic ones imposed some prior distribution; this helps the RS engine to tackle uncertainty over sparse data, thus producing more accurate results. In addition, the Amortized Variational Inference (AVI) technique, introduced in [8], is used in the proposed model to enable scalability of Bayesian inference to real-world datasets. We provide strong experimental results that demonstrate that the proposed model outperforms modern rival methodologies in terms of accuracy, without suffering from scalability issues.

The remainder of this paper is structured as follows: Sect. 2 presents an overview of current literature while Sect. 3 describes the methodology of the proposed model. Section 4 evaluates the model in a challenging public benchmark dataset, and compares the best performing method with state-of-the-art models. Finally, the last section concludes the paper, summarizing the contributions presenting at the same time future steps.

## 2 Related Work

Current literature has been mainly focused on neighborhood and matrix factorization models. The work presented in [2] utilizes item-based collaborative filtering approaches to deal with the key challenges of RS. Item-based methods analyze the user-item matrix to identify the relationships that exist between the various items, and then utilize those correlations to produce recommendations. A list of different methodologies for calculating similarities between items are examined in [2, 3, 9], with evaluation outcomes presenting that item-based techniques perform better than the user-based approaches in terms of accuracy.

The study in [9] outlines a series of latent models based on matrix factorization (MF) techniques. These techniques represent both users and items as vectors in the same space, and combine scalability with high accuracy when modeling real-world scenarios. When explicit ratings are not available, RS use MF approaches that provide extra information to facilitate inference of user preferences. According to the literature, MF methods yield better results when compared with neighborhood models; the main reason for this is that MF can combine various kinds of data, such as confidence levels and temporal dynamics.

Shani et al. [10] argues that Markov decision processes (MDPs) can provide an enhanced methodology ready to be utilized in RS and deal with the sequential optimization problem. The proposed MDP model presented in [10] takes into consideration the long-term effects and the estimated value of each suggestion; this has allowed for it to outperform the classic Markov Chain model when implemented on commercial websites.

Nowadays, deep learning approaches have been successfully applied to image and speech recognition [11]. Wang et al. [12] were among the first to present a model that leverages deep learning methods to learn the patterns connecting the content and the ratings matrix, so as to address the data sparsity challenge. Experimental results on a series of real-world scenarios from various contexts have exhibit that the use of the model of [12] performs better than state-of-the-art alternatives.

Salakhutdinov et al. [13] states that most of the current collaborative filtering techniques can’t deal with large datasets. To address this problem, they used the Restricted Boltzmann Machine (RBM), which is a two-layer undirected graphical models that can model tabular data. A set of learning and inference methodologies are introduced for the RBM model, which is applied on the Netflix dataset; the results presented in [13] demonstrate its superior performance against mainstream Singular Value Decomposition (SVD) models.

The work in [14] claims that click prediction is one of the main challenges in the World Wide Web, and that most studies in current literature have been focused on dealing with this problem using machine learning techniques. In a real-world commercial website, users’ behavior depends on how they acted in the past; thus, the authors in [14] present an innovative model based on a RNN that takes into account users’ previous steps. The proposed model was evaluated on click-through logs of a commercial engine, with the results showing advances on click prediction accuracy compared against sequence-independent approaches.

Furthermore, Hidasi et al. [4] outlines an RNN to deal with long sequence-type data that can be obtained from ecommerce websites. As previously mentioned, in such cases of sequential data modeling, the frequently used MF techniques are not accurate enough. The proposed model introduces various alterations on the classic RNN, such as the Gated Recurrent Unit (GRU), and the ranking loss function used for model training. These are designed in a way that also takes into account practical aspects of the session-based recommendation task. For evaluation purposes, the model is executed on two datasets; the first one is the RecSys Challenge 2015 dataset, and the second one is a dataset collected from the OTT video service platform. Experimental results show that the proposed model outperforms item-KNN, which is the best-performing approach from the large corpus of collaborative filtering techniques that are not based on elaborate machine learning models.

Moreover, the work presented in [5] analyzes deeper the RNN-based models for session-based recommendations, and introduces two techniques that improve the model’s performance. The proposed work was evaluated on the RecSys Challenge 2015 dataset, and the final outcomes were compared with the results presented in Hidasi et al. [4] indicating that the proposed model performs much better. In addition, Jannach et al. [15] demonstrates how the heuristics-based nearest neighbor (kNN) framework, utilized in session-based recommendation, can lead to better accuracy compared against the classic approach proposed in [4]. Experimental results indicate that the hybrid proposed model that combines the kNN approach with the classic methodology introduced in [4] leads to better results.

Finally, the work in [16] introduces innovative ranking loss functions custom-made for RNNs applied in recommendation frameworks. The proposed model was evaluated on various datasets such as the RecSys 2015 dataset; the final outcomes indicate an increase in the system’s accuracy when training the model with novel ranking loss functions compared with the previously mentioned approach presented in [4].

## 3 Proposed Approach

The leading contribution of this study lies on the development of a novel deep learning model, capable to extract abstract temporal dynamics from sparse user session-based sequence data and then use that information to generate accurate recommendations.

The VLaReT model formulates the session-based recommendation challenge as a sequence-based prediction problem. Let us denote as \(\{ x_{i} \}_{i = 1}^{n}\) a user session, where *x*_{ i } is the *i*th clicked item; then, we formulate the session-based recommendation as the problem of predicting the score vector \(y_{i + 1} = [y_{i + 1,j} ]_{j = 1}^{m}\) of the available items to users, where \(y_{i + 1,j} \in R\) is the predicted score of the *j*th item. We are keen on recommending more than one item at a time; therefore, at each time point we select the *top*-*k* items to present back to the user. The core inferential engine we develop in this work is a novel deep learning model for predicting the vector *y*_{i+1}.

### 3.1 Methodological Background

*i,*and then predicts a score vector for the next user action. The recurrent units’ activation vectors of the GRU-based network,

*, are updated at time*

**h***i*using the following formula:

*z*

_{ i }is the update gate output, which controls when and to what degree an update to a latent state of the recurrent units should be made; it is given by:

*τ*is the logistic sigmoid function. On the other hand, the

*r*

_{ i }, which is given in Eq. (4), is the output of the reset gate of the GRU network; it decides when the internal memory of the GRU units must be reset. We have

*W*, *U*, *W*_{ z }, *U*_{ z }, *W*_{ r } and *U*_{ r } in the above-mentioned equations are trainable network parameters.

### 3.2 Model Formulation

*Ν*(ξ|μ,Σ) is a multivariate Gaussian density with mean

*μ*, covariance matrix Σ and identity matrix

*I*.

*q(h)*take the form of Gaussians with means and isotropic covariance matrices parameterized via GRU networks as follows:

*θ*; hence, we now have:

*ξ*,

*ζ*] denotes the concatenation of vectors

*ξ*and

*ζ*. The values of the hidden variables

*h*

_{ i }can be calculated by posterior samples from the inferred posterior density.

Let us continue on the output layer of the proposed model. According to the literature, item ranking [4, 17, 18] can either be pointwise, pairwise or listwise. The proposed approach utilizes various ranking loss functions, such as the matrix factorization method Bayesian Personalized Ranking (BPR) presented in [18], which is a pairwise ranking loss, as well as the cross-entropy loss function and the TOP1 function introduced in [4]. In general, pointwise ranking finds the score of items independently, while pairwise ranking first compares the score of pairs of a positive and a negative item, and then applies the score of the positive item to be higher than the negative one for all the pairs. Listwise ranking uses the scores of all items and compares them to the best ordering.

*w*_{ y } are trainable parameters of the output layer of the model.

### 3.3 Training Algorithm

*KL*[

*q*||

*p*] is the KL divergence between the distribution

*q*and the distribution

*p*, as show in formula below:

The challenge here is that the posterior expectation *E*[*L*_{ S }] cannot be computed analytically. This is due to the non-conjugate formulation of the proposed approach, which stems from its nonlinear assumptions, e.g. the fact we employ nonlinear activation functions. As a result, training the entailed parameter sets *θ* is not possible. To resolve these problems, one has to resort to approximating this posterior expectation by means of drawing Monte Carlo (MC) [20] samples. However, such a naïve approximation suffers from unacceptably high variance, that would prohibit the learning algorithm from converging to a good solution.

AVI deals with these issues by means of a smart re-parameterization of the MC samples of the postulated Gaussian posterior density [8]. Specifically, the drawn MC samples are now expressed as differentiable functions of the parameters sets *θ* and some random noise variance *ε*; thus, the problematic posterior expectation *E*[*L*_{ S }] is now sampled over a low-variance random noise variable. Then, to perform inference by means of maximization of the ELBO (12), we can resort to an off-the-shelf stochastic gradient descent algorithm. Specifically, in this work we use Adagrad as the stochastic gradient algorithm of choice, following the suggestions of [21].

## 4 Experimental Evaluation

### 4.1 VLaReT Model Configuration

The proposed model was implemented and trained in Theano [22] on an Intel Xeon 2.5 GHz Quad-Core server with 64 GB RAM and an NVIDIA Tesla K40 GPU accelerator. In addition, the model was evaluated using the RecSys Challenge 2015 dataset which it was split into test and training sets following the same procedure as in [4]. The training dataset comprises 7,966,257 sessions of 31,637,239 clicks on 37,483 items and the test dataset contains 15,324 sessions of 71,222 click actions on the equal items.

To experimentally evaluate our model, we utilize a variety of loss functions to perform its training; these include BPR, cross-entropy, and TOP1. Moreover, to implement Adagrad in the context of our approach, we perform session-parallel mini-batch training, and apply a dropout value at each time step in order to reduce over fitting [23]. VLaReT is trained using a specific number of epochs in order to minimize losses, and at the same time to avoid randomizing the order of sessions in each epoch. The latest state is set to zero when a session is completed. We use Adaptive Normalization to transform the time series into data sequences, as suggested in [24]. Finally, during training a Nesterov momentum [25] is applied; parameter initialization is effected using the Glorot uniform technique [26].

According to [4], computing a score for every item in the available list would limit the scalability of the training algorithm of our approach. To alleviate this computational burden, it is essential to sample the output and calculate the score only for a small subset of items. Moreover, for the output, we compute the scores for some negative samples and adjust the weights so that the output is highly ranked; therefore, items are sampled based on their popularity. Our model uses the items from the other training examples of the mini-batch as negative examples. The benefits of this training algorithm setup are that we can reduce the computational time by omitting sampling; hence, matrix operations become quicker and can scale to large datasets. As also pointed out in [4], this approach is essentially reminiscent of popularity-based sampling, since the likelihood of an item being in the other training samples of the mini-batch is proportional to its popularity. Finally, our methodology uses only single-layer recurrent GRUs; this is motivated from the related findings of presented in [4, 5], which show that adding additional layers does not improve the performance of the RNN model in the context of session-based recommendation.

### 4.2 Performance Metrics

The accuracy of the obtained recommendations was evaluated using the same evaluation metrics as the ones presented in [4]. Recall@20 is the main employed evaluation metric. It expresses the proportion of test cases where the desired item lies between the *top*-*20* recommended items; it does not take into consideration the projected rank of an item. The second metric used is the MRR@20 (Mean Reciprocal Rank), which describes the average of reciprocal ranks of the desired items; it is set to zero if the rank is above 20. This metric takes into consideration the order of the item, which is crucial in cases where the rank of recommendation matters to the systems users.

### 4.3 Considering Various Loss Functions

Best parameterization settings

Loss function | BPR | Cross-entropy | TOP1 |
---|---|---|---|

# Latent units | 750 | 1000 | 1500 |

Step size | 0.1 | 0.1 | 0.05 |

Momentum | 0.3 | 0 | 0 |

Recall@20 | 0.7971 | 0.6250 | 0.6507 |

MRR@20 | 0.7845 | 0.2727 | 0.3527 |

Best performance of VLaReT when utilizing various loss functions

Method | Recall@20 | MRR@20 |
---|---|---|

BPR | 0.7971 | 0.7845 |

TOP1 | 0.625 | 0.2727 |

Cross-entropy | 0.6507 | 0.3527 |

### 4.4 VLaReT-BPR Versus Baselines

In this section, the best performing configuration of our approach, namely the VLaReT-BPR model variant, is compared against the best (baseline) algorithms presented in [4, 5, 15, 16].

Comparison of the VLaReT-BPR model against various baseline algorithms

Method | Recall@20 | MRR@20 |
---|---|---|

GRU w/BPR Loss | 0.6322 | 0.2467 |

GRU w/TOP1 Loss | 0.6206 | 0.2693 |

M2 | 0.7129 | 0.3091 |

M4 | 0.6676 | 0.2847 |

WH-1 | 0.6910 | 0.2650 |

WH-2 | 0.6660 | 0.2760 |

GRU-SAMP1 | 0.7112 | 0.3059 |

GRU-SAMP2 | 0.7102 | 0.3107 |

| | |

### 4.5 Adjusting the Size of Hidden Units

Accuracy of the VLaReT-BPR with different hidden units

# of hidden units | Recall@20 | MRR@20 |
---|---|---|

100 | 0.5756 | 0.2127 |

500 | 0.7354 | 0.6793 |

| | |

900 | 0.7801 | 0.7563 |

1000 | 0.7760 | 0.7443 |

1250 | 0.7712 | 0.7318 |

1500 | 0.7629 | 0.7271 |

2000 | 0.7326 | 0.6737 |

## 5 Conclusions

This work introduced an innovative model that couples deep learning approaches with Variational Bayes to tackle the increased complexity that exists in RS when using session-based datasets and at the same time to deal with data sparsity. The proposed model, called VLaReT, augments the benefits of RNN-driven session-based recommendation by utilizing a variational inference notion for scalable inference under uncertainty. Indeed, as we theoretically explained and experimentally showed, combining a Bayesian inference technique with RNNs that use GRU layers provides strengths to analyze temporal patterns that exist in sequence-based data, and to deal with uncertainty in sparse data when producing recommendations. Evaluation was performed utilizing various setups on a real-world benchmark dataset. Final outcomes indicate that proposed model using a BPR loss function reaches the best ever reported performance, and outperforms the current state-of-the-art approaches. Future work will be based on validating our methodology on longer session-based datasets and on using additional samples.

## References

- 1.Konstan, J.A., Riedl, J.: Recommender systems: from algorithms to user experience. User Model. User-Adap. Inter.
**22**(1–2), 101–123 (2012)CrossRefGoogle Scholar - 2.Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, pp. 285–295 (2001)Google Scholar
- 3.Ning, X., Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based recommendation methods. In: Recommender Systems Handbook, pp. 37–76. Springer US (2015)Google Scholar
- 4.Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with recurrent neural networks. CoRR, abs/1511.06939 (2015)Google Scholar
- 5.Tan, Y. K., Xu, X., & Liu, Y.: Improved recurrent neural networks for session-based recommendations. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 17–22 (2016)Google Scholar
- 6.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Berg, A.C.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.
**115**(3), 211–252 (2015)CrossRefGoogle Scholar - 7.Sotirios P., Chatzis: A coupled Indian Buet process model for collaborative filtering. In: Journal of Machine Learning Research: Workshop and Conference Proceedings, vol. 25: ACML 2012, pp. 65–79 (2012)Google Scholar
- 8.Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of ICLR’14 (2014)Google Scholar
- 9.Koren, Y., Bell, R.M., Volinsky, C.: Matrix factorization techniques for recommender systems. IEEE Comput.
**42**(8), 30–37 (2009)CrossRefGoogle Scholar - 10.Shani, G., Brafman, R.I., Heckerman, D.: An MDP-based recommender system. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pp. 453–460 (2002)Google Scholar
- 11.Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems, pp. 2980–2988 (2015)Google Scholar
- 12.Wang, H., Wang, N., Yeung, D.Y.: Collaborative deep learning for recommender systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘15, pp. 1235–1244 (2015)Google Scholar
- 13.Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning, pp. 791–798 (2007)Google Scholar
- 14.Zhang, Y., Dai, H., Xu, C., Feng, J., Wang, T., Bian, J., Wang, B., Liu, T.Y.: Sequential click prediction for sponsored search with recurrent neural networks. arXiv preprint arXiv:1404.5772 (2014)
- 15.Jannach, D., Ludewig, M.: When recurrent neural networks meet the neighborhood for session-based recommendation. In: Proceedings of the RecSys, 17 (2017)Google Scholar
- 16.Hidasi, B., & Karatzoglou, A. (2017). Recurrent neural networks with Top-k gains for session-based recommendations. arXiv preprint arXiv:1706.03847
- 17.Steck, H.: Gaussian ranking by matrix factorization. In: Proceedings of the 9th ACM Conference on Recommender Systems, pp. 115–122 (2015)Google Scholar
- 18.Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 452–461 (2009)Google Scholar
- 19.Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L.K.: An introduction to variational methods for graphical models. In Learning in Graphical Models, M.I. Jordan (Ed.). Kluwer, Dordrecht, pp. 105–162 (1998)Google Scholar
- 20.Salakhutdinov, R. and Mnih, A.: Bayesian probabilistic matrix factorization using Markov Chain Monte Carlo. In: Proceedings of ICML’11 (2011)Google Scholar
- 21.Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(July), pp. 2121–2159 (2011)Google Scholar
- 22.Team, T.T.D., Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., … Belopolsky, A.: Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016)
- 23.Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1019–1027 (2016)Google Scholar
- 24.Ogasawara, E., Martinez, L.C., De Oliveira, D., Zimbrão, G., Pappa, G.L., Mattoso, M.: Adaptive normalization: a novel data normalization approach for non-stationary time series. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2010)Google Scholar
- 25.Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Networks
**12**(1), 145–151 (1999)CrossRefGoogle Scholar - 26.Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS (2010)Google Scholar