Deep variational models for collaborative filtering-based recommender systems

Deep learning provides accurate collaborative filtering models to improve recommender system results. Deep matrix factorization and their related collaborative neural networks are the state of the art in the field; nevertheless, both models lack the necessary stochasticity to create the robust, continuous, and structured latent spaces that variational autoencoders exhibit. On the other hand, data augmentation through variational autoencoder does not provide accurate results in the collaborative filtering field due to the high sparsity of recommender systems. Our proposed models apply the variational concept to inject stochasticity in the latent space of the deep architecture, introducing the variational technique in the neural collaborative filtering field. This method does not depend on the particular model used to generate the latent representation. In this way, this approach can be applied as a plugin to any current and future specific models. The proposed models have been tested using four representative open datasets, three different quality measures, and state-of-the-art baselines. The results show the superiority of the proposed approach in scenarios where the variational enrichment exceeds the injected noise effect. Additionally, a framework is provided to enable the reproducibility of the conducted experiments.


Introduction
Recommender Systems (RSs) is an artificial intelligence field that provides methods and models to predict and recommend items to users (e.g.: films to persons, e-commerce products to costumers, services to companies, Quality of Service (QoS) to Internet of Things (IoT) devices, etc.) (Beel et al., 2013).Current popular RSs are Spotify, Netflix, TripAdvisor, Amazon, etc. RSs are usually categorized attending to their filtering strategy, mainly demographic (Bobadilla et al., 2021), content-based (Deldjoo et al., 2020), context-aware (Kulkarni and Rodd, 2020), social (Shokeen and Rana, 2020), Collaborative Filtering (CF) (Bobadilla et al., 2020a;Beel et al., 2013) and filtering ensembles (Forouzandeh et al., 2021;C ¸ano and Morisio, 2017).CF is the most accurate and widely used filtering approach to implement RSs.CF models have evolved from the K-Nearest Neighbors (KNN) algorithm to the Probabilistic Matrix Factorization (PMF) (Mnih and Salakhutdinov, 2007), the non-Negative Matrix Factorization (NMF) (Févotte and Idier, 2011) and the Bayesian non-Negative Matrix Factorization (BNMF) (Hernando et al., 2016).Currently, deep learning research approaches are growing in strength: they provide improvement in accuracy compared to the Machine Learning (ML)-based Matrix Factorization (MF) models (Rendle et al., 2020).Additionally, deep learning architectures are usually more flexible than the MF-based ones, introducing combined deep and shallow learning (He et al., 2017), integrated content-based ensembles (Narang and Taneja, 2018), generative approaches (Bobadilla et al., 2020b;Gao et al., 2021), among others.
Deep Matrix Factorization (DeepMF) (Xue et al., 2017) is a neural network model that implements the popular MF concept.DeepMF was designed to take as input a user-item matrix with explicit ratings and nonpreference implicit feedback, although current implementations use two embedding layers whose inputs are, respectively, user and items.The experimental results evidence the DeepMF superiority over the traditional approaches based on ML-focused RS, particularly the most used MF models: PMF, NMF, and BNMF.Currently, DeepMF is a popular model that is rapidly replacing the traditional MF models based on classical ML.Additionally, DeepMF has been used in the RS field to combine social behaviors (clicks, ratings,. . . ) with images (Wen et al., 2018), and a social trust-aware RS has been implemented by using DeepMF to extract features from the user-item rating matrix for improving the initialization accuracy (Wan et al., 2020).QoS predictions have also been addressed by using DeepMF (Zou et al., 2020).To learn attribute representations, a DeepMF model has been used that creates a low-dimensional representation of a dataset that lends itself to a clustering interpretation (Trigeorgis et al., 2016).Finally, the classical matrix completion task has been addressed by using the DeepMF approach (Fan and Cheng, 2018).
The not so widely spread Neural Collaborative Filtering (NCF) model (He et al., 2017) may be seen as an augmented DeepMF model, where deeper layers are added to the 'Dot' one.Additionally, the 'Dot' layer can be replaced by a 'Concatenate' layer.Figure 1 shows the explained concepts.NCF slightly outperforms the DeepMF accuracy results, but it increases the required runtime to train the model and to run the forward process: it is necessary to execute the 'extra' Multi-Layer Perceptron (MLP) on top of the 'Dot' or 'Concatenate' layers.Moreover, compared to DeepMF, the NCF architecture adds new hyper-parameters to set: mainly the number of hidden layers (depth) and their size (number of neurons in each layer) of the MLP architecture.In a different setting, Variational AutoEncoders (VAEs) act as regular autoencoders.They aim to compress the input raw values into a latent space representation by means of an encoder neural network, whereas the decoder neural network makes the opposite operation seeking to decompress from latent space to output raw values.The main difference between classical autoencoders and VAEs is the latent space design, meaning, and operation.Classical autoencoders do not generate structured latent spaces, whereas VAEs introduce a statistical process that forces them to learn continuous and structured latent spaces.In this way, VAEs turn the samples into parameters of a statistical distribution, usually the means and variance of a Gaussian distribution.This concept is illustrated in Figure 2. From the parameters in the multivariate distribution, we draw a random sample and a latent space sample is obtained for each training input (center of fig.2).The stochasticity of the random sampling improves the robustness and forces the encoding of continuous and meaningful latent space representations, as it can be seen in fig.3, where it is shown the difference between a regular autoencoder latent space representation and its equivalent VAE one.
Due to their properties, VAEs have been used as generative deep learning models in the image processing field.Reconstruction of a multispectral image has been performed by means of a VAE (Liu et al., 2020a) that parameterizes the latent space of Gaussian distribution parameters.VAEs have been also used to create superresolution images as in Liu et al. (2020c), where a model is proposed to encode low-resolution images in a dense latent space vector that can be decoded for target high resolution image denoising.The blur image problem using VAE is tackled in Liu et al. (2020b) by adding a conditional sampling mechanism that narrows down the latent space, making it possible to reconstruct high resolution images.Moreover, in Zhang et al.  (2021), the authors propose a flexible autoencoder model able to adapt to varying data patterns with time.By importing the VAE concept from image processing, several papers have used these models to improve RS results.For instance, denoising and variational autoencoders are tested in Liang et al. (2018), where the authors reported the superiority of the VAE option against other models, or in Nisha and Mohan (2019), where variational autoencoders are combined with social information to improve the quality of the recommendations.
The aim of this paper is to propose a neural architecture that joins the best of the DeepMF and NCF models with the VAE concept.This novel models will be called, respectively, Variational Deep Matrix Factorization (VDeepMF) and Variational Neural Collaborative Filtering (VNCF).In contrast with the autoencoder and Generative Adversarial Network (GAN) approaches in the CF field (Gao et al., 2021), we shall not use the generative decoder stage and we maintain the regression output layer presented in the DeepMF and the NCF models.The main advantage in the use of the VAE operation is the robustness that it confers to the latent representation.This robustness can be seen by observing fig. 3.If we consider each dot drawn as a train sample representation in the latent space, then test samples are most likely to be correctly classified in the VAE model (right graph in fig. 3) than being correctly classified in the regular autoencoder model (left graph in fig.3).In short, the variational approach stochastically 'spreads' the samples in the latent space, improving the chances of classifying correctly the training samples.
In our proposed RS CF scenario, we expect that rating values can be better predicted when a variational latent space has been learnt, because this space covers a wider, more robust, and more representative latent area.Whereas with a traditional autoencoders each sample would be coded as a value in the latent space (white circle in fig.4), the VAE encodes the parameters of a multivariate distribution (e.g.mean and variance of both the blue and the orange Gaussian distributions in fig.4).From the learnt distribution parameters, random sampling is carried out to generate stochastic latent space values (gray circles in fig.4).Each epoch in the learning process generates a new set of latent space values.Once the proposed model has been trained, when a user, item tuple is presented to the model, the obtained latent space value (green circle in fig.4) can be better predicted in the VAE scenario than in the regular autoencoder scenario: the random sampled values (gray circles) of the enriched latent space will help to associate the predicted sample (green circle) with their associated training samples (white circle), making the prediction process much more robust and accurate.
Figure 4: Latent space representation of the proposed variational model.From the learnt means and variances of the multivariate Gaussian distribution, a random sampling process is run to spread the latent space sample values (gray circles) that will help to accurately predict the unknown sample rating values (green circle).
Current CF-based variational autoencoders usually obtain raw augmented data: mainly synthetic ratings from user to items or generated relevant versus not relevant votes from users to items (Liang et al., 2018;Gao et al., 2021).This strategy forces us to sequentially run two separated models: the generative model (GAN or VAE) that provides augmented data, and the regression CF model that makes predictions and recommendations.This approach presents three main drawbacks: 1) complexity, as two separate models are necessary, 2) large time consumption, and 3) sparsity management.As we will explain deeper in the following section, our proposed model does not generate raw augmented data.On the contrary, its innovation is based on the use of a single model to internally manage both augmentation and prediction aims.Particularly significant is the way in which the proposed model addresses the sparsity problem: we do not make augmentation on the sparse raw data (ratings cast from users to item), but an internal 'augmentation' process in the dense latent space of the model (figs.3 and 4).Each sample that is randomly generated from the latent space feeds the model regression layers.Thereby, we propose a model that first generates stochastic variational samples in a dense latent space, and then these generated samples act as inputs of the regression stage of the model.
To test these ideas, the hypothesis considered in this paper is that the augmented samples will be more accurate and effective if they are generated in an inner and dense latent space rather than in a very sparse input space.It is important to realize that enriching the inner latent space can improve the recommendation results, but it also injects noise to the latent space that may potentially worsen the results.It is expected that the proposed approach will work better with poor latent spaces, whereas when it is applied to rich spaces, the spurious entropy added by the variational stage could worsen recommendations.Thus, medium-size CF datasets, or large and complex ones are better candidates to improve their results when the variational proposal is applied, whereas large datasets with predictable data distributions will probably not benefit from the noise injection of the variational architecture.
The rest of the paper has been structured as follows.In section 2, the proposed model is explained.Section 3 shows the experiments' design, results and their discussions.Finally, section 4 contains the main conclusions of the paper and the future works.

Proposed model
The proposed neural architecture will mix the VAE and the DeepMF (or the NCF) models.From the VAE we take the encoder stage and its variational process, and from the DeepMF or the NCF model we use its regression layers.This is an innovative approach in the RS field, since the VAE and GAN neural networks have only been used as a posteriori stage to make data augmentation, i.e. to obtain enriched input datasets to feed the CF DeepMF or NCF models.Hence, the traditional approach needs to separately train two models, first the VAE and then the DeepMF/NCF networks.
In sharp contrast, our proposed approach efficiently joins the VAE and the Deep CF regression concepts to obtain improved predictions with a single training process.In the learning stage, the training samples feed the model (left hand side of fig.5).Each training sample consists of the tuple user, item, rating (rating casted by the user to the item).In the DeepMF/NCF architecture, each user is represented by his/her vector of voted ratings, and each item is represented by its vector of received ratings.The model learns the ratings (third element in the tuples) casted by the users to the items (first and second elements in the tuples).In other words, the ratings are outputs of the neural network (right hand side of fig.5).
Figure 5: Proposed VDeepMF/NCF approach.CF samples are encoded in the latent space by means of a variational process and then predictions are obtained by using a regression neural network.To fix the notation, let us suppose that our dataset contains U users and I items.In general, the aim of any deep learning model for CF-based prediction is to train a (stochastic) neural network that implements a function h : R U × R I → R.

Formalization of the model
This function h operates as follows.Let us codify the u-th user of the dataset (resp.the i-th item) using onehot-encoding as the u-th canonical basis vector e u (resp.the i-th canonical basis vector e i ).Then, h(e u , e i ) ∈ R seeks to predict the score that the u-th user would assign to the i-th item.To train this function h, in the learning phase the neural network is fed with a set of training tuples u, i, r of user u that rated item i with a score r and the function h is trained to fit h(e u , e i ) = r.
Our proposal for the VDeepMF consist on decomposing h has a combination of a 'Embedding', followed by a 'Variational' stage and a final 'Dot' layer, as shown in fig.6).The first 'Embedding' layer (left hand side of fig. 6) is borrowed from the natural language processing field (He et al., 2017).The idea is that this layers provides a fast translation of users and items into their respective representations in the latent spaces.To be precise, this layer implements a function Embedding that maps a pair (e u , e i ) into a pair of dense vectors (v u , w i ) ∈ R L × R L that represents the u-th user and the i-th item, being L > 0 the dimension of the representations.
It is worth mentioning that, even though from a conceptual point of view the 'Embedding' layer is a regular MLP dense layer, to save time and space, these 'Embedding' layers are typically implemented through lookup tables.In this way, instead of feeding the network with the one-hot encoding of the user u (resp.the item i), we input it via its ID as user (resp.as item).The lookup table efficiently recovers the u-th (resp.i-th) column of the embedding matrix that contains v u (resp.w i ) so that the translation can be conducted in a more efficient way than with a standard MLP layer by exploiting the sparsity of the input.
Figure 6: Proposed VDeepMF architecture.The NCF architecture will have identical 'Embedding' and 'Variational' layers to the VDeepMF one; it will just replace the 'Dot' layer for a 'Concatenate' layer, followed by an MLP.
The variational process is carried out by the 'Variational' stage (labeled as 'variational layers' at the middle of fig.6).From the latent space representation (v u , w i ) ∈ R L × R L of the the u-th user and the i-th item, two separated dense layers return the mean and variance parameters of two gaussian multivariate distribution.In this way, if fix a latent space dimension K > 0, the first part of this 'Variational' stage (left part of the middle rectangle of fig. 6) computes a map where µ 1 (v u ), µ 2 (w i ) represent the means of the associated gaussian distributions to the user and the item respectively, and σ 2 1 (v u ), σ 2 (w i ) their variance.Thus, the output of the 'Variational' stage (left right of the middle rectangle of fig. 6) is a pair of random vectors (P µ1(vu),σ 2 1 (vu) , Q µ2(wi),σ 2 (wi) ) where Here, N (µ, Σ) denotes a K-dimensional multivariate normal distribution of mean vector µ and diagonal covariance matrix Σ, so that our covariance matrix is always diagonal.Each time a sample is drawn, the 'Variational' stage thus returns a pair (p, q) ∈ R K × R K , which represents the stochastic latent representations associated to (v u , w i ).
The final 'Dot' layer (labeled as 'regression layer' at right hand side of fig. 6) in the VDeepMF model is very simple.It is a linear layer that simply computes the dot product of the latent vectors p and q.Therefore Dot(p, q) = p • q.
In the case of VNCF, this simple layer is replaced by a fully connected MLP that extracts non-linear relations from p and q.
Therefore, summarizing the process, the proposed VDeepMF model h computes . This is a random variable that, when sampled, returns a natural number that should be interpreted as the predicted rating by h for the user u regarding item i.

Implementation of the model
The model described in Section 2.1 has been implemented in Keras (Chollet et al., 2015), a widely used Python library for deep learning and neural computing.For the sake of reproducibility, the code framework that implements the architecture shown in fig.6 (both in their VDeepMF and NCF versions) and the experiments explained in the next section is available at the GitHub repository1 .Additionally, as an example, Listing 1 shows the source code of the proposed VDeepMF kernel: lines 8 to 13 implement the user side of the fig.6 architecture, whereas lines 15 to 20 do the same job on the item side of fig. 6. Please note the use of the Keras Embedding layers in lines 9 and 16.Lines 10-12 and 17-19 carry out the 'Variational' stage.In particular, both the user and the item Lambda layers (lines 12 and 19) run the variational process.They use the sampling function (lines 3 to 6) to combine the mean and variance latent values, which make use of the Keras backend random normal procedure to implement the stochasticity (line 5).Finally, the latent values of users and items are combined by means of the 'Dot' layer (line 22) to produce the final output.22 dot = Dot ( axes =1) ([ item_vec , user_vec ]) Listing 1: VDeepMF kernel code.

Empirical evaluation
In this section, we describe the empirical experiments carried out to evaluate the performance of the variational approach in the DeepMF and NCF models.

Experimental setup
The experimental evaluation has been performed over four different datasets to measure the performance of the proposed method over different environments.The selected datasets are: FilmTrust (Guo et al., 2013), an small dataset that contains the ratings of thousands of items to movies; MovieLens 1M (Harper and Konstan, 2015), the gold standard dataset in CF-based RS; MyAnimeList (Azathoth, 2018), a dataset extracted from Kaggle2 that contains the ratings of thousands of users to anime comics; and Netflix (Bennett et al., 2007), a popular dataset with hundred of millions ratings used in the Neflix Prize competition.Table 1 show the main parameters of these datasets.The corpus of these datasets has been randomly splitted into training ratings (80% of the ratings) and test ratings (20% of the ratings).

Dataset
Number The evaluation of the proposed method has been analyzed from three different points of view: the quality of the predictions, the quality of the recommendations, and the quality of the recommendation lists.
Due to the stochastic nature of the variational embedded space of the proposed method, the test predictions used to evaluate the proposed method have been computed as the average of the 10 predictions performed for each pair of user u and item i.

Experimental results
Table 2 includes the quality of the predictions performed by the proposed model.Best values for each dataset are highlighted in bold.Table 2a contains the MAE (eq.( 1)), table 2b contains the MSE (eq.( 2)), and table 2c contains the R 2 score (eq.( 3)).We can observe that the proposed variational approach improves the prediction capability of DeepMF in all datasets except of Netflix and reports worse predictions when it is applied to NCF.
We justify these results by taking into account the features of the deep learning models used and the properties of each dataset.On the one hand, the larger the size of the dataset, the less necessary it is to enrich the votes with the proposed variational approach.In other words, when the dataset is small, the amount of Shannon entropy (Shannon and Weaver, 1949) that it contains might be quite limited.By using a variational method to generate new samples, we add some extra entropy that enriches the dataset, giving the chance to the regressive part of exploiting this extra data.However, large datasets usually present a large entropy in such a way that the regressive models can effectively extract very subtle information from them.In this setting, if we add a variational stage, instead of adding new relevant variability to the dataset, we only add noise that muddies the underlying patterns.For this reason, the variational approach is of no benefit in huge datasets like Netflix.
On the other hand, the NCF model is more complex than the DeepMF one, so data enrichment has less impact for complex models that are able to find more sophisticated relationships between data than simpler models.In fact, based on these results, we can assert that including the variational approach into a simple model such as DeepMF is equivalent to using a more complex model such as NCF.7 contains the precision and recall results.In FilmTrust (fig.7a) we can observe that the proposed variational approach reports a huge benefit for the DeepMF model and significantly worsens the results of the NCF model.In MovieLens (fig.7b) and MyAnimeList (fig.7c) the same tendency than in FilmTrust is observer, but, in this case, the proposed VDeepMF model is the model that computes the best recommendations for these datasets.In Netflix (fig.7d) the proposed variational approach decreases the quality of the recommendations.These results are consistent with those analyzed when measuring the quality of the predictions.Consequently, it is evident that the proposed variational approach works adequately when the dataset is not too large and the model used is not too complex.

FilmTrust MovieLens
Additionally, Figure 8 contains the nDCG results.From it, we can observe the same trends as those shown    in fig. 7.In FilmTrust (fig.8a), the quality of the recommendation lists do not vary independently of whether the variational approach is used or not.In MovieLens (fig.8b) and MyAnimeList (fig.8c), the combination of the variacional approach with simple modeling such as DeepMF, provides the best results.In Netflix (fig.8d), the variational approach significantly worsens the quality of the recommendation lists.Finally, table 3 shows the total time and epochs required by each model to be fitted to each dataset using a Quadro RTX 8000 GPU.Best time for each dataset is in bold.We can observe that including a variational layer to the model significantly reduces the required time for fitting.Variational models are able to generate Shannon entropy that is transferred to the regression stage, leading to a more effective training that requires fewer epochs to be fitted.Therefore, the fitting time needed to reach acceptable results is substantially lower.

Conclusions
In the latest trends, accuracy of RSs is being improved by using deep learning models such as deep matrix factorization and neural collaborative filtering.However, these models do not incorporate stochasticity in their design, unlike variational autoencoders do.Variational random sampling has been used to create augmented input raw data in the collaborative filtering context, but the inherent collaborative filtering data sparsity makes it difficult to get accurate results.This paper applies the variational concept not to generate augmented sparse data, but to create augmented samples in the latent space codified at the dense inner layers of the proposed neural network.This is an innovative approach trying to combine the potential of the variational stochasticity with the augmentation concept.Augmented samples are generated in the dense latent space of the neural network model.In this way, we avoid the sparse scenario in the variational process.
The results show an important improvement when the proposed models are applied to middle-size representative collaborative filtering datasets, compared to the state-of-art baselines, testing both prediction and recommendation quality measures.In contrast, testing on the huge Netflix dataset not only leads to no improvement, but the recommendation quality worsens: increasing Shannon entropy in rich latent spaces causes that the negative effect of the introduced noise exceeds its benefit.Therefore, the proposed deep variational models should be applied for seeking to a fair balance between their positive enrichment and their negative noise injection.The results presented in this work can be considered as generalizable, since they were analyzed in four representative and open CF datasets.Researchers can reproduce our experiments and easily create their own models by using the provided framework referenced in section 2. The authors of this work are committed to reproducible science, so the code used in these experiments is publicly available.
Among the most promising future works, we propose the following: 1) Introducing the variational process in the alternative inner layers of the relevant architectures in the collaborative filtering area, 2) Screening the learning evolution in the training process, since it is faster than the classical models but it also requires early stopping in the training stage, 3) Providing further theoretical explanations of the properties of the CF datasets, in terms of Shannon entropy or other statistical features, that ensure a good performance of the proposed models, 4) Applying probabilistic deep learning models in the CF field to capture complex non-linear stochastic relationships between random variables, and 5) Testing the impact of the proposed concept when recommendations are made to groups of users.

Figure 3 :
Figure 3: Representation of an autoencoder latent space for the MNIST dataset (left side) versus the equivalent VAE latent space representation (right side).
The architectural details of the proposed models are shown in fig.6.For simplicity, only the Variational Deep Matrix Factorization (VDeepMF) architecture is shown in this figure.The corresponding model for NCF, named Variational Neural Collaborative Filtering (VNCF), is analogous to the VDeepMF one: it has the same 'Embedding' and 'Variational' layers and we should only replace the 'Dot' layer of DeepMF by a 'Concatenate' layer followed by a MLP.
. random_normal ( shape =( batch_size , latent_dim ) , mean =0., stddev =1) 6 return z_mean + K .exp ( z_var ) * epsilon 8 user_input = Input ( shape =[1]) 9 user _embeddi ng = Embedding ( num_users , latent_dim ) ( user_input ) 10 u s e r _ e m b e d d i n g _ m e a n = Dense ( latent_dim ) ( u ser_emb edding ) 11 u s e r _ e m b e d d i n g _ v a r = Dense ( latent_dim ) ( u ser_emb edding ) 12 u s e r _ e m b e d d i n g _z = Lambda ( sampling ) ([ user_embedding_mean , u s e r _ e m b e d d i n g _ v a r ]) 13 user_vec = Flatten () ( u s e r_ e m b e d d i n g _ z ) 15 item_input = Input ( shape =[1]) 16 item _embeddi ng = Embedding ( num_items , latent_dim ) ( item_input ) 17 i t e m _ e m b e d d i n g _ m e a n = Dense ( latent_dim ) ( i tem_emb edding ) 18 i t e m _ e m b e d d i n g _ v a r = Dense ( latent_dim ) ( i tem_emb edding ) 19 i t e m _ e m b e d d i n g _z = Lambda ( sampling ) ([ item_embedding_mean , i t e m _ e m b e d d i n g _ v a r ] , latent_dim ) 20 item_vec = Flatten () ( i t e m_ e m b e d d i n g _ z )

Figure 7 :
Figure 7: Quality of the recommendations measured by precision and recall.The higher the better.

Figure 8 :
Figure 8: Quality of the recommendations lists measured by NDCG.The higher the better.

Table 1 :
of users Number of items Number of ratings Scores Main parameters of the datasets used in the experiments.

Table 2 :
Quality of the predictionsFurthermore, Figure

Table 3 :
Fitting time using a Quadro RTX 8000.