Introduction

The massive volume of user-item interactions data on the internet today has expedited the creation of diverse personalized recommendation models to present to users a set of unseen items that may be of interest to them [6, 19, 43, 44]. Among the several recommendation techniques available, matrix factorization (MF) is a widely adopted traditional model-based collaborative filtering technique [19, 20, 28]. Nevertheless, there are pertinent issues with these MF models. Firstly, because MF models are typically linear, their modeling capability is limited. Secondly, in making a rating-based prediction for a new user/item, one must execute an arduous optimization process to discover related user/item embeddings, which does not capture the essential non-linear user-item interactions. Thirdly, MF models are prone to overfitting problems because only a few ratings are available for most users and items. As a result, MF models are hugely regularized. However, standard \( L_1 \) or \( L_2 \) regularizers are hard to tune to obtain optimal results. To address the problems above, deep neural networks (DNN) based collaborative filtering models such as autoencoder-based recommender system (Autorec) [29], Collaborative deep learning (CDL) [37] and Collaborative Denoising Autoencoder (CDAE) [40] employ a parameterized function to translate user feedback into user embeddings. Besides, these DNN-based models can capture the non-linear interaction patterns within the user-item feedback. They can also do regularization via a different method, allowing them to predict item ratings for new users without requiring extensive iterative training.

Recently, variational autoencoder (VAE) has been adopted for collaborative filtering, which significantly yields better performance than Autorec, CDL, and CDAE. Specifically, Lee et al. [21] designed an augmented CF using ladder VAE [32] and used adversarial learning to regularize their proposed collaborative recommendation models. Also, the underfitting problem in VAE, when used to model huge, sparse, and high-dimensional data, was addressed by Liang et al. [23]. They presented Multi-VAE, a VAE-based CF framework that employs multinomial conditional likelihood. Again, a few research efforts focused on improving the objective function of VAE-based CF models by introducing flexible and robust priors other than the usual standard normal priors that lead to flawed model learning performance [16, 30]. Despite these efforts to integrate Bayesian inference and uncertainty modeling, VAE-based CF models cannot be optimized easily, mainly because of the inherent biased variational inference. Generally, VAE-based CF models are premised on the idea that the posterior can be deconstructed into numerous independent factors and that variational inference must determine the best posterior approximation within a parametric family of distributions known as the prior. Nonetheless, the models are not adaptable enough to reflect the real posterior and uncertainty of the data unless a precise family of distributions is selected. We also highlight that the KL divergence with the aggregated posterior prior (AP) adopted in [16] cannot be calculated in closed form, which limits the total potential usage of this optimal prior.

In this paper, we propose an Implicit Optimal Variational autoencoder model for collaborative filtering (IOVA-CF) to address the current limitations of VAE-based CF models. Notably, IOVA-CF utilizes a new implicit optimal prior, which aids in generating better latent representations. This implicit optimal prior (IoP) is an aggregated posterior prior (AP), which is the posterior’s expectation of the data distribution. However, unlike earlier AP, IoP computes the KL divergence via the density ratio technique rather than explicitly capturing the aggregated posterior. Besides, we can calculate the KL divergence between the aggregated posterior and the encoder model in a closed-form using the density ratio trick, making IoP an optimal prior for maximizing the objective function. Additionally, unlike earlier VAE-based recommendation models, IOVA-CF significantly alleviates the over-regularization issue. Furthermore, IOVA-CF can adequately capture the latent space’s uncertainty. In summary, the following are our major contributions:

  • We propose IOVA-CF, a novel variational collaborative filtering model that takes advantage of an implicit optimal aggregated posterior prior for effective latent representation learning. Unlike traditional MF methods and representative DNN models, IOVA-CF can capture non-linear user-item interactions and effectively generalize user feedback data.

  • The proposed implicit aggregated posterior prior can compute KL divergence in a closed-form making the IOVA-CF model’s objective function very useful and optimal. Overall leading to a very efficient optimization process.

  • Unlike previous VAE-based CF models, IOVA-CF adequately captures the latent space’s uncertainty and drastically reduces over-regularization. Furthermore, IOVA-CF produces superior latent representations from which we can generate excellent recommendations.

  • Finally, we conduct empirical evaluations with several competitive baseline models on four (4) real-world datasets to demonstrate the superior performance of IOVA-CF.

The remaining parts of this paper are organized in this manner. In the “Related works” section, we discuss the related literature. The “Preliminaries and background” section presents the paper’s definitions and notations as well as the salient fundamental concepts that form the basis of our research. In “The IOVA-CF recommendation model” section, we provide the details of our proposed IOVA-CF model. The empirical setup, analysis, and results are described in the “Empirical study” section. In the “Conclusion” section, we state our concluding remarks.

Related works

This section reviews traditional collaborative filtering techniques and deep neural network-based collaborative filtering frameworks, focusing on related deep generative models. Finally, we compare our study to these existing models, underlining the significant differences.

Traditional collaborative filtering models

Traditional CF methods intend to predict the preferences of users for items by learning from user-item historical interactions, employing either explicit feedback (e.g., ratings and reviews) or implicit feedback (e.g., click and view) [3, 11, 19]. In general, there are two kinds of CF-based techniques: memory- and model-based methods [19]. By combining similar users’ preferences or similar items, the former method typically uses original user-item interaction data (e.g., rating matrices) to infer unseen ratings. The latter is based on the idea that a low-dimensional latent vector might represent a user’s taste or an item’s attribute. Mainly, the high-dimensional user-item rating matrix is deconstructed into low-dimensional user and item latent vectors using these model-based methods. Subsequently, recommendation prediction is made by computing the dot product of the latent vectors of the user, and item [20, 28]. Some of the leading traditional model-based CF methods include matrix factorization (MF) [20], non-negative matrix factorization (NMF) [42], factorization machine (FM) [25, 26], probabilistic matrix factorization [28], tensor factorization (TensorF) [1], SVD++ [18], collective matrix factorization (CMF) [31] and SVDFeature [5].

Deep neural networks for collaborative filtering

Current progress in deep neural networks (DNN) research has resulted in several studies adopting deep learning techniques for recommendation tasks [41, 43, 44]. The main goals of using DNN for the recommendation tasks are two-fold: non-linear user-item interaction modelling given the representations and representation modelling of users and items. According to the current research trends, DNN provides good insights into user demands, item characteristics, and historical interactions. Besides, DNN has largely alleviated the issue of limited prediction power of traditional RS models that stems from the conflict between users’ complex preferences and the simply linear modeling ability.

Some representative DNN-based works for collaborative filtering include: automated recommender system (Autorec) [29], collaborative deep learning (CDL) [37] and collaborative denoising autoencoder (CDAE) [40]. Autorec and CDAE are autoencoder-based models that take the user and item information as input and utilize an encoder parameterized by a neural network to learn a hidden representation. Reconstruction of the input from the hidden representation is achieved by a decoder parameterized by another neural network. CDL integrates the stacked denoising autoencoder (DAE) and the probabilistic matrix factorization (PMF). The CDL algorithm is used to generate item embeddings from side information. Furthermore, the neural CF [12] model generalizes MF for CF by employing a neural network to overcome the limitation of linear interactions inherent in MF. Some other models adopted the CNN architecture for interaction modeling [13, 34]. The outer product of user and item embeddings is used to construct interaction maps for these models, which helps to directly acquire the pairwise correlations between embedding dimensions. Recently, the success of Graph Neural Networks (GNNs) for modeling graph structure data has resulted in several works being proposed to model the user-item bipartite graph structure for neural graph-based representation learning. Neural Graph Collaborative Filtering (NGCF) [38], a representative GNN-based CF model, utilizes graph convolutions to model user-item behavior in the original space. Also, Disentangled Graph Collaborative Filtering (DGCF) [39] was proposed to learn useful representations of users and items from their interaction data. DGCF pays close attention to user-item linkages at the finer granularity of user intents. NGCF and DGCF produce excellent recommendation performance.

Deep generative models (DGMs) have garnered much traction due to their capacity to learn the joint density of data and learn relevant features from massive unlabeled datasets. DGMs are attractive because they are not task-specific and can assist various downstream works, such as recommendations. Generative Adversarial Nets (GAN) [9] and Variational Autoencoders (VAE) [17] are two well-known deep generative models. In recommender systems with deep generative methods, GAN is frequently utilized. [7, 8]. GAN mainly assists in ranking or matching processes without density estimation. Also, many GAN-based models have been suggested, each focused on a different element of recommendation, such as enhancing model resilience using adversarial samples or optimizing model parameter inference with adversarial learning. Besides, GANs are used for data augmentation in the RS models by obtaining actual data distribution using the minimax framework, which aids in addressing data sparsity issues.

In terms of variational autoencoders, Li et al. [22] designed the very first VAE-based deep generative recommender system referred to the collaborative variational autoencoder (CVAE). CVAE uses vanilla VAE [17] in collaborative filtering (CF) scenario to collectively model content generation and rating information. Lee et al. [21] designed an augmented CF using ladder VAE [32] and used adversarial learning to regularize their collaborative recommendation models. VAE models deteriorate because of underfitting issues when modeling massive, sparse, high-dimensional data, according to Liang et al. [23]. Liang et al. proposed a multinomial conditional likelihood-based VAE framework to mitigate this problem. Following that, VAEs utilized for CF-based recommendation tasks aim to incorporate multiple auxiliary features or to improve latent factor representation learning [2, 15]. VAE was used in a recent study [36] to learn users’ latent interest space and to produce useful new items that were not in the training set. Nonetheless, the subject of their research is beyond the scope of our work. Very recently, a few research efforts focused on improving the objective function of VAE-based RS models by introducing flexible and robust priors other than the usual standard normal priors that lead to poor model learning performance [16, 30]. Finally, we suggest that readers read these review papers [33, 43], which are extensive surveys on recommendation systems.

Table 1 Notations used in this paper

Key differences

Our work differs from the previous works in a variety of ways. The following are the essential differences: Unlike traditional CF-based recommendation models, we treat the recommendation problem in a probabilistic context, allowing Bayesian inference and capturing non-linear user-item interactions. In addition, our model differs from earlier VAE-based CF models in two critical ways. First, unlike previous models that utilized the aggregated posterior priors (AP), IoP calculates this KL divergence via the density ratio technique rather than explicitly estimating the aggregated posterior. Furthermore, IoP uses the density ratio technique to compute the KL divergence between the aggregated posterior and the encoder model in a closed-form, making it IOP an effective prior for maximizing the objective function. Second, for the item recommendation problem, we introduced novel evidence lower bound (ELBO) that aids in the production of high-quality latent representations and considerably reduces the prevalent over-regularization issue. In practice, the IOVA-CF model we present can learn excellent latent features for delivering high-quality recommendations.

Preliminaries and background

We start this section by defining and describing the notations used in this work. Next, we present the relevant preliminaries and background for our proposed model.

Definitions and notations

Here, the variable \( u\in 1,...,m\) denote users, and the items are depicted as \( i \in 1,..., n \). The user-item rating matrix is expressed as \( X \in {\mathcal {R}}^{m\times n} \). Every user u is associated with a row of the matrix (\( x_u \in {\mathcal {R}}^n \)), that denotes the bag-of-items vector of the designated user. In the case of online rating-based reviews, \( x_u \) will depict user u’s ratings, with zero values indicating items that were not rated by user u. The summary of the notation is shown in Table 1. All of the basic notations used in this paper have been expressed in a clear and concise manner.

Previous VAE-based collaborative filtering models

Generally, most of the VAE-based CF models including Multi-VAE follow a similar architecture. Specifically, the generative process of Multi-VAE begins by sampling the latent space representation \( z_u \in {\mathcal {R}}^k \) for every user u from the standard normal prior distribution (\( {\mathbb {N}}(z|0,{\mathbb {I}}) \)). Next, to assign these latent space representations \(z_{u} \) to probability distribution over items \(\pi (z_u)\in {\Delta }^{n-1}(\text {an }(n-1)-simplex)\), we apply a non-linear function \( f_\theta \) and a softmax function. The geometric space in which the n probabilities depicting a user’s taste for items exist is known as the \((n-1)-simplex\).

$$\begin{aligned} \pi (z_u)\propto \exp \{f_\theta (z_u)\} \end{aligned}$$
(1)

Here, \( x^{\prime }_{u} \in {\mathcal {R}}^n \) that is a user’s bag-of-items vector is sampled from a multinomial distribution with success probabilities \( \pi (z_{u}) \):

$$\begin{aligned} x^{\prime }_{u}\sim Mult(N_u,\pi (z_{u})) \end{aligned}$$
(2)

where \( N_u=\sum _i x_{u_i} \). The log-likelihood for \( x_{u} \) is:

$$\begin{aligned} \log {p_\theta (x^{\prime }_{u}|z_{u})}=\sum _i x_{u_i}{\log {\pi _i(z_{u})}} \end{aligned}$$
(3)

In the inference stage, the intractable posterior distributions \( p_\theta (z_{u}|x_{u}) \) is estimated. The fundamental goal is to discover a tractable and simpler variational distribution to estimate the true posterior:

$$\begin{aligned} q_\phi (z_{u}) = {\mathcal {N}}\left( \mu _\phi (x_{u}),diag\{\sigma ^2_{\phi }(x_{u})\}\right) \end{aligned}$$
(4)

where the parameters \( \mu _\phi (x_{u}) \in {\mathcal {R}}^k \) and \( \sigma _\phi (x_{u}) \in {\mathcal {R}}^k \) are computed by the inference network (encoder). Typically, in training VAE-based models, we attempt to maximize the marginal data likelihood \( P(X) = \int p(X|z)p(z) \mathrm{d}z \). A neural network is used to model the non-linear function \( f_\theta \); however, P(X) is intractable, and the optimization is difficult. This problem is resolved by employing the amortized variational inference [10]. The per-datapoint optimization process of the evidence-lower bound (ELBO) is expressed as:

$$\begin{aligned} \begin{aligned}&{\mathcal {L}}(x_u;\theta ,\phi ) \equiv {\mathbb {E}}_{q_\phi (z_u|x_u)}[\log p_\theta (x^\prime _u|z_u)]\\&-\beta KL(q_\phi (z_u|x_u)||p(z_u)) \le \log p(z_u;\theta ) \end{aligned} \end{aligned}$$
(5)

Here, \( p_\theta (x^\prime _u|z_u) \) is the generative network (decoder) which is parameterized by \( \theta \), a neural network, \( p(z_u) \) denotes the prior distribution of the latent space variables. An approximation of the real posterior is estimated with an inference network (encoder) \( q_\phi (z_u|x_u) \) also parameterized with a neural network \( \phi \). Note that \( \beta \)-annealing technique is a typical heuristic utilized for training VAEs when there is concern that the model is underutilized. In this case, \( \beta \)-annealing is achieved by gradually annealing the KL term over many gradient updates of \( \theta \), \( \phi \) and recording the best \( \beta \) value when its performance reaches the peak.

Limitation of VAE-based RS models

The primary objective in training VAE-based RS models, including Multi-VAE, is to maximize the reconstruction component using regularization by KL divergence between the inference network (encoder) and the prior distribution. The standard Gaussian distribution \( {\mathbb {N}}(z|0,{\mathbb {I}}) \) is usually selected as the prior distribution. Nevertheless, this is not an optimal and effective prior for VAE. This simple prior results in poor density estimation caused by over-regularization. This over-regularization issue causes latent variables to be ignored by the model during the generation phase. As a result, these VAE-based RS models cannot realize their full potential in generating useful recommendations.

Fig. 1
figure 1

An illustration of IOVA-CF model architecture

The aggregated posterior prior

The optimal prior that maximizes the VAE objective function is called the aggregated posterior prior (AP). This optimal prior (\( p^*_\lambda (z) \)) can be obtained by evaluating the expression:

$$\begin{aligned} p^*_\lambda (z)=\int p_D(x)q_\phi (z|x)dx \equiv q_\phi (z) \end{aligned}$$
(6)

Here, \( p_D(x) \) denotes the data distribution, \( q_\phi (z|x) \) depicts the variational posterior and \(q_\phi (z)\) is referred to as the aggregated posterior. We highlight here that when the aggregated posterior prior is used, the KL divergence as depicted in Eq. (7) cannot be computed in closed form. Hence, the potential usage of this optimal prior is greatly limited.

$$\begin{aligned} D_{KL}(q_\phi (z|x)\parallel q_\phi (z))={\mathbb {E}}_{q_\phi (z|x)}\left[ \ln \frac{q_\phi (z|x)}{q_\phi (z)}\right] \end{aligned}$$
(7)

To overcome this challenge, Tomczak et al. [35] employed a finite mixture of inference networks (encoders) to compute the KL divergence. Provided with a dataset \( {\mathbf {X}} = \left\{ x_1,...,x_N\right\} \), the aggregated posterior was modelled by an empirical distribution:

$$\begin{aligned} q_\phi (z)\simeq \frac{1}{N}\sum _{i=1}^{N}q_\phi (z|x_i) \end{aligned}$$
(8)

However, the issue with this distribution is over-fitting. Hence, they resorted to this distribution:

$$\begin{aligned} q_\phi (z)\simeq \frac{1}{K}\sum _{k=1}^{K}q_\phi (z|u_k) \end{aligned}$$
(9)

Here, K is the number of mixtures (pseudo-inputs), and \( u_k \) is the D-dimensional vector that serves as the pseudo-input for the encoder. If \( K \ll N \), over-fitting can be managed considerably. The Monte Carlo approximation can be used to compute the KL divergence with this prior called VampPrior. VampPrior results in a better density estimation than using standard Gaussian prior or a mixture of Gaussian priors.

Limitations of VampPrior

The crucial limitations of this approach are as follows: First, sensitive hyperparameters such as K and \( u_k \) are hard to tune. This issue can negatively impact the model’s performance due to difficulty obtaining the correct hyperparameter values. Second, this explicit modeling of the aggregated posterior requires high computational cost, making VampPrior less useful and ineffective. To address the limitations of using traditional priors and the shortcomings of this aggregated posterior prior in VAE-based RS models, we adopt a novel implicit strategy to model the aggregated posterior prior and utilize it for the recommendation task. We describe the details in the “4” section.

The IOVA-CF recommendation model

This section outlines our expressive prior estimation strategy that resolves the limitations of AP. We further describe our generation and inference procedures and the model optimization process. The overall architecture of IOVA-CF is illustrated in Fig. 1.

The expressive aggregated prior for IOVA-CF

As depicted in Eq. (7), the KL divergence between the inference network and the aggregated posterior prior is the expectation of the logarithm of the ratio \( q_\phi (z|x)/q_\phi (z) \). To estimate the ratio of these two distributions, we adopt the density ratio technique. Using this technique, we can avoid the problematic explicit modeling of AP. Particularly, we employ D(xz) (a probabilistic binary classifier) to estimate the ratio \( q_\phi (z|x)/q_\phi (z) \) (see Fig. 1).

However, direct use of the density ratio trick for recommendation task with high-dimensional data will result in poor performance because the density ratio trick has serious limitations in high dimensional settings. For instance, if we have a high-dimensional data x, \( q_\phi (z|x)/q_\phi (z) \) also yields an high-dimensional density ratio. Since the \( q_\phi (z|x) \) is a conditional distribution of z given x, the density ratio trick has to use a probabilistic binary classifier D(xz) , which accepts x and z jointly as inputs. In fact, D(xz) estimates the density ratio of joint distributions of x and z. We rewrite the KL divergence to eliminate density ratio estimation in high dimensional data. Here, \( D_{KL}(q_\phi (z|x)\parallel q_\phi (z)) \) is expressed as:

$$\begin{aligned}&D_{KL}(q_\phi (z|x)\parallel q_\phi (z))\nonumber \\&{ ={\mathbb {E}}_{q_\phi (z|x)}\left[ \ln \frac{q_\phi (z|x)}{q_\phi (z)}\right] } \nonumber \\&{=\int q_\phi (z|x) \ln \frac{q_\phi (z|x)}{p_(z)} \mathrm{d}z + \int q_\phi (z|x) \ln \frac{p(z)}{q_\phi (z)} \mathrm{d}z}\nonumber \\&{=D_{KL}(q_\phi (z|x)\parallel p(z))- {\mathbb {E}}_{q_\phi (z|x)}\left[ \ln \frac{q_\phi (z)}{p(z)}\right] } \end{aligned}$$
(10)

Here, the first term in Eq. (10) represents the KL divergence between the inference network(encoder) and standard Gaussian distribution. This first term can be computed in a closed form. The expectation of the logarithm of the density ratio \( q_\phi (z)/p(z) \) is denoted by the second term. We estimate \( q_\phi (z)/p(z) \) using the density ratio technique. Now that the latent space variable z is low-dimensional, the density ratio approach is effective. We can compute the density ratio \( q_\phi (z)/p(z) \) in this manner. Firstly, we select samples from \( q_\phi (z) \) and samples from p(z) . Since the distributions p(z) and \( q_\phi (z|x) \) are a Gaussian we can easily sample from them. Also, we can sample from \( q_\phi (z) \) (the aggregated posterior) by employing ancestral sampling: we randomly select a data point x from the dataset and sample z from the encoder given this data point x. Secondly, for samples from \( q_\phi (z) \), we label them as \( y = 1 \) and samples from p(z) are labelled \( y = 0 \). Next, \( p^*(z|y) \) is expressed as:

$$\begin{aligned} p^*(z|y) \equiv {\left\{ \begin{array}{ll} q_\phi (z) &{} (y = 1)\\ p(z) &{} (y = 0) \end{array}\right. } \end{aligned}$$
(11)

Thirdly, we define D(z) , a probabilistic binary classifier which discriminates between the samples from \( q_\phi (z) \) and p(z) . If D(z) discriminates these samples perfectly, we redefine \( q_\phi (z)/p(z) \) (the density ratio) by employing Bayes theorem and D(z) in this manner:

$$\begin{aligned} \frac{q_\phi (z).}{p(z)}&= \frac{p^*(z|y=1)}{p^*(z|y=0)} = \frac{p^*(y=0)p^*(y=1|z)}{p^*(y=1)p^*(y=0|z)} \end{aligned}$$
(12)
$$\begin{aligned}&= \frac{p^*(y=1|z)}{p^*(y=0|z)} \equiv \frac{ D(z)}{1- D(z)} \end{aligned}$$
(13)

Here, \( p^*(y=0) \) equals \( p^*(y=1) \) becuase they have the same number of samples. Now, D(z) is modelled using \(\sigma (T_\psi (z)) \). Note that \( T_\psi (z) \) is a neural network having parameter \( \psi \), input z and \(\sigma (\cdot )\) depicts a sigmoid function. Next, \( T_\psi (z) \) is trained to maximize the objective function below:

$$\begin{aligned} T^*(z) = \max _\psi {\mathbb {E}}_{q_\phi (z)}[\ln (\sigma (T_\psi (z)))]\nonumber \\ +{\mathbb {E}}_{p(z)}[\ln (1-\sigma (T_\psi (z)))] \end{aligned}$$
(14)

We can compute the density ratio \( q_\phi (z)/p(z) \) by employing \( T^*(z) \) in the following way:

$$\begin{aligned} \frac{q_\phi (z)}{p(z)}&= \frac{\sigma (T^*(z))}{1-\sigma (T^*(z))} \Leftrightarrow T^*(z) = \ln \frac{q_\phi (z)}{p(z)} \end{aligned}$$
(15)

Finally, we can compute the KL divergence between the inference network and the aggregated posterior \( D_{KL}(q_\phi (z|x) \parallel q_\phi (z)) \) as:

$$\begin{aligned} D_{KL}(q_\phi (z|x) \parallel p_\phi (z))- {\mathbb {E}}_{q_\phi (z|x)}[T^*(z)] \end{aligned}$$
(16)

Robust variational autoencoder architecture of IOVA-CF

The IOVA-CF model utilizes a robust variational autoencoder-based model for collaborative filtering. Unlike previous VAE-based CF models, our model employs an expressive implicit optimal prior for quality latent representation learning. We highlight here that, this technique considerably reduces the prevalent over-regularization problem leading to the generation of high-quality recommendations. The details of the generation and inference processes of the IOVA-CF model are as follows:

Generation process

The generation procedure starts by sampling the latent representation \( z \in {\mathbb {R}}^k \) for a designated user u using the expressive implicit optimal prior method described in “4.1” . The input x employed in model is binarized to depict implicit feedback, which is the typical strategy used for top-N recommendations. Next, we assume that the ratings of user u for every item follows a Bernoulli distribution: \( x_u^\prime |z_u \sim Bernoulli(\sigma (f_\theta (z_u))) \). Here, \( \sigma \) is a sigmoid function. Additionally, via our generative network \( p_\theta (x^{\prime }_{u}|z_{u}) \), a new vector \( {x^{\prime }_{u}} \) is produced as a distribution over the items. The model is anticipated to assign higher probability values to the items which are more likely to be picked by a user. In consequence, our model fits well in the context of the top-N ranking-based recommendation system.

Inference process

For our inference procedure, the primary goal is to obtain the intractable posterior distribution \( p_\theta (z_{u}|x_{u}) \). Here, we utilize the amortized variational inference technique [10]. The ultimate aim is to obtain the true posterior using other variational distributions that are tractable. The expression below depicts the output of our inference process:

$$\begin{aligned} q_\phi (z_{u}) = {\mathcal {N}}\left( \mu _\phi (x_{u}),diag\{\sigma ^2_{\phi }(x_{u})\}\right) \end{aligned}$$

where the parameter \( \mu _\phi (x_{u}) \in {\mathcal {R}}^k \), and \( \sigma _\phi (x_{u}) \in {\mathcal {R}}^k \) are computed by the encoder (inference network). Note that, we employed the implicit optimal aggregated prior described in the “4.1” section to estimate our intractable posterior distribution. The advantage of our solution is that we can compute the KL divergence in closed form using the density ratio trick. Overall, this method aids in obtaining robust latent representations.

Model optimization process

The final objective function for training our IOVA-CF model is obtained by incorporating inference and generation processes and the discriminator network that aids in computing the KL divergence in closed form. Note that the inference and generation procedures are based on the implicit optimal aggregated posterior prior method introduced. Overall the final objective function is expressed as:

$$\begin{aligned}&\max _{\theta ,\phi } \int p_D(x){\mathcal {L}}(x;\theta ,\phi )\mathrm{d}x\nonumber \\&\quad {= \max _{\theta ,\phi }\int p_D(x)\{-D_{KL}(q_\phi (z|x) \parallel p_\phi (z))}\nonumber \\&\qquad + {\mathbb {E}}_{q_\phi (z|x)}[\ln p_\theta (x|z) + T_\psi (z)]\} \mathrm{d}x \end{aligned}$$
(17)

Here, \( T_\psi (z) \) maximizes the Eq. (14). Given the dataset \( {\mathbf {X}} = \left\{ x^{(1)},...,x^{(N)}\right\} \), we optimize the Monte Carlo approximation of this objective:

$$\begin{aligned}&\max _{\theta ,\phi }\frac{1}{N}\sum _{i=1}^{N}\{ -D_{KL}(q_\phi (z|x^{(i)}) \parallel p_\phi (z)) \nonumber \\&\quad +{\mathbb {E}}_{q_\phi (z|x^{(i)})}[\ln p_\theta (x^{(i)}|z) + T_\psi (z)]\} \mathrm{d}x \end{aligned}$$
(18)

It is worth noting that we approximate the expectation term in Eq. (18) using:

$$\begin{aligned}&{\mathbb {E}}_{q_\phi (z|x^{(i)})}[\ln p_\theta (x^{(i)}|z) + T_\psi (z)]\nonumber \\&\quad \simeq \frac{1}{L}\sum _{i=1}^{L}\left\{ [\ln p_\theta (x^{(i)}|z^{(i,l)}) + T_\psi (z^{(i,l)})]\right\} \mathrm{d}x \end{aligned}$$
(19)

where \( z^{(i,l)} = \mu _\phi (x^{(i)}) + \epsilon ^{(i,l)} \odot \sigma _\phi (x^{(i)}) \), \( \epsilon ^{(i,l)} \) is a sample selected from \( {\mathbb {N}}(z|0,I) \), \( \odot \) depicts the element-wise product, and the sample size of the reparameterization trick is denoted as L. Now the resultant objective function is:

$$\begin{aligned}&\max _{\theta ,\phi }\frac{1}{N}\sum _{i=1}^{N}\{ -D_{KL}(q_\phi (z|x^{(i)}) \parallel p_\phi (z)) \nonumber \\&\quad +\frac{1}{L}\sum _{i=1}^{L}\left\{ [\ln p_\theta (x^{(i)}|z^{(i,l)})T_\psi (z^{(i,l)})]\right\} \mathrm{d}x \end{aligned}$$
(20)

Using the stochastic gradient descent (SGD), we can optimize this model by iterating the following two-step process: Firstly, we make \(\psi \) fixed and update \(\theta \) and \(\phi \) to maximize Eq. (19). Secondly, we make \(\theta \) and \(\phi \) fixed and update \(\psi \) to maximize the Monte Carlo approximation of Eq. (14) in this way:

$$\begin{aligned}&\max _{\theta ,\phi }\frac{1}{M}\sum _{i=1}^{M} \ln (\sigma (T_\psi (z^{(i)}_1))) \nonumber \\&\quad + \frac{1}{M}\sum _{j=1}^{M} \ln (1-\sigma (T_\psi (z^{(j)}_0))) \end{aligned}$$
(21)

where the sample \( z^{(i)}_1) \) is selected from \( q_\phi (z) \), the sample \( z^{(j)}_0 \) is also selected from p(z) , and the sampling size of Monte Carlo approximation is M. During training, \(\psi \) is trained more than \(\theta \) and \(\phi \): if \(\theta \) and \( \phi \) are updated for \( J_1 \) steps, \(\psi \) is updated for \( J_2 \) steps where \( J_2 \) is larger than \( J_1 \). The optimization procedure of IOVA-CF is depicted in Algorithm 1, where the minibatch size of SGD is represented as K.

figure a

Empirical study

This section outlines the experimental settings and reports our findings and analysis from the empirical evaluation results. Here, we want to answer the following research questions to justify our model and scientific contributions:

  • RQ1: How does IOVA-CF fare when compared to other competitive CF methods?

  • RQ2: What is the impact of key model components and hyperparameters such as the implicit optimal prior (IoP) and the dimension of latent representation (d) towards the overall performance of IOVA-CF?

  • RQ3: How robust is the IOVA-CF model in terms of computational efficiency and model convergence?

Experimental settings

The datasets used, the baseline models and metrics for evaluation, and the parameter settings are all described in this subsection.

Datasets

We conducted our empirical evaluations on four real-world and publicly available datasets: MovieLens-10m (ML-10m),Footnote 1 MovieLens-1m (ML-1m),Footnote 2 Amazon Electronics,Footnote 3 and Book-Crossing datasets.Footnote 4 For all these datasets, we used a rating value three (3) and above. We also only kept users with at least ten (10) interactions, and we made sure the items we used had at least ten (10) interactions. We additionally converted all scores to 1 because we are focusing on the implicit feedback setting. The dataset statistics after the preprocessing step is depicted in Table 2.

Table 2 Statistics of all datasets after pre-processing

Baseline models

We compare IOVA-CF to the following baseline models to validate its performance:

Pop: This model considers the most popular items in a dataset and recommends these items to users.

BPR [27]: BPR is a Bayesian-based framework that presents a generic optimization approach for personalized ranking.

WMF [14]: Weighted matrix factorization (WMF) is a low-rank factorization algorithm with a linear structure. We use alternating least squares to train the WMF model, which generally outperforms SGD in terms of performance.

ENMF [4] ENMF is a well-optimized neural recommendation model that employs whole training data without sampling.

CDAE [40]: CDAE is a variant of a standard denoising autoencoder in which an additional per-user latent factor is augmented in the input. Following the example of [23], we modify the (user, item) entry subsampling method in SGD training in the original paper to the user-level subsampling. Generally, this process yields more stable convergence and better performance.

NeuMF [12]: NeuMF generalizes MF for CF using a neural network to overcome the constraint of linear interaction in MF. NeuMF’s number of parameters grows linearly as the number of users and objects grows, similar to CDAE.

Multi-DAE [23]: This is a point estimate, non-Bayesian and autoencoder-based recommendation model. Mult-DAE is a competitive model for top-N recommendations. Its objective function uses multinomial likelihood, which is a notable characteristic.

Multi-VAE [23]: This is a representative VAE-based recommender systems model. It carries out the task of collaborative filtering for implicit interaction data using VAEs. Multi-VAE is a good top-N recommendation model. Its objective function uses multinomial likelihood, which is a distinguishing feature.

RecVAE [30]: RecVAE mainly adopts several new techniques to enhance Multi-VAE. These include a new composite prior distribution for the latent representations, a new method for hyperparameter tuning, and a new approach for model training.

MacridVAE [24] MacridVAE is a model that is used for learning disentangled representations from user behavior.

EVCF [16] : To investigate the effect of flexible priors in CF, EVCF uses VampPrior, which was initially designed for image generation. Also, they adopted gating mechanisms in addition to the VampPriors for CF.

Evaluation metrics

We employ a variety of standard recommender system metrics such as Precision@R (P@R), Recall@R (R@R), and the normalized discounted cumulative gain(NDCG@R). We contrast the predicted rank of the held-out items to the actual rank using the Recall@R and NDCG@R metrics. By sorting the unnormalized probability, we get the predicted rank. NDCG@R employs a monotonically increasing discount to signal the relevance of higher rankings over lower ones, whereas Recall@R considers all items ranked inside the first R to be equally relevant. Notably, we calculate the Recall and NDCG at rank positions 20, 50, and 100. Besides, we also measure Precision@20, Precision@50, and Precision@100.

Parameter settings

We used the Pytorch framework on a workstation with an NVIDIA GeForce 1070 GPU to implement all models, including IOVA-CF. We divided the datasets into three categories for all of our experiments: 80% training set, 10% validation set, and 10% test set. Concerning our baseline models, we employed the setup described in their respective papers unless otherwise stated, and we tuned the corresponding parameters to obtain the best performance. For all IOVA-CF models, we utilize a latent dimension of 100. We used the SGD optimizer for training the models. We adjusted the learning rate and mini-batch size based on the validation sets’ results. Moreover, we utilized the early stopping approach for the training and evaluation of our models. Notably, we adopt early stopping when model performance does not increase for ten (10) successive epochs.

Table 3 Performance comparison of baseline models and IOVA-CF model on the Movielens-10M dataset
Table 4 Performance comparison of baseline models and IOVA-CF model on the Movielens-1M dataset
Table 5 Performance comparison of baseline models and IOVA-CF model on the Book Crossing dataset
Table 6 Performance comparison of baseline models and IOVA-CF model on the Amazon Electronics dataset

Performance comparison (RQ1)

Tables 345 and 6 depicts the performance of IOVA-CF in comparison to existing state-of-the-art \( top-N \) recommendation models on the Movielens-10m (ML-10m), Movielens-1m (ML-1m), Book-Crossing and Amazon Electronics datasets respectively. For all the results, a paired t-test is conducted. Here, \(p < 0.001\) indicates statistical significance. After careful analysis, we observe the following:

  • Linear models such as Pop and WMF are not competitive. This result indicates the inability of linear models to effectively capture complex user-item interactions, which can aid in the recommendation task.

  • Generally, the neural network-based recommendation baseline models perform better than the linear baseline models. This trend demonstrates the importance of learning the non-linear user-item interactions. Notably, the neural network-based models can capture complex collaborative filtering signals.

  • Interestingly, we see that CDAE performs poorly on the Book-Crossing and the Amazon Electronics datasets. We attribute this pattern to CDAE’s sensitivity to the availability of data. Notably, the sparsity percentages of Book-Crossing and Amazon Electronics datasets are very high. Moreover, the observed interactions in these datasets are relatively smaller than the interactions in ML-1m and ML-10m.

  • Generally, Bayesian models such as (BPR and the VAE-based models) show strong performance over their non-Bayesian counterparts. The possible reason for this trend is that Bayesian models can efficiently capture the unobserved user-item interactions, thereby generalizing better than the non-Bayesian models. Surprisingly, BPR (a linear model) outperforms ENMF (a neural network-based model) across all the datasets in all the metrics. This demonstrates the effectiveness of framing your recommendation task in the Bayesian settings.

  • VAE-based methods that adopt complex prior distributions (RecVAE, EVCF, and our model: IOVA-CF) perform better than Multi-VAE, which essentially employs the vanilla VAE objective function. This trend suggests that the vanilla VAE-based method is limited in capturing the latent user-item interactions. Unlike Multi-VAE, RecVAE, EVCF, and IOVA-CF incorporate flexible and complex priors in the latent space. In effect, they can detect discernible hidden representations effectively. Besides, they can avoid the issue of user agnostic posterior approximation. These reasons justify the superior performance of RecVAE, EVCF, and IOVA-CF over Multi-VAE.

  • Regarding the models that employ complex prior distributions, EVCF and IOVA-CF show superior performance gains over RecVAE consistently across all the datasets in all the metrics. We attribute this trend to the fact that both EVCF and IOVA-CF are modeled with the aggregated posterior prior, which is an optimal prior distribution for maximizing the training objective of VAE-based CF models. In effect, it yields excellent density estimation of latent representations for producing good recommendation performance.

  • Consistently, our IOVA-CF model achieves the best performance for all the metrics in all the datasets. Notably, compared to the second-best model (EVCF), IOVA-CF achieves improvements of 4.12%, 4.78%, 6.37%, and 8.77% in the Recall@50 (R@50) metric for the Movielens-10m (ML-10m), Movielens-1m (ML-1m), Book-Crossing, and Amazon Electronics datasets, respectively. We attribute this superior performance to the fact that IOVA-CF models the user-item interactions via the implicit optimal aggregated posterior prior. Unlike EVCF, IOVA-CF can efficiently compute the whole training objective in closed form, yielding excellent latent representation learning and uncertainty estimation. Also, EVCF is very sensitive to hyperparameters that are difficult to tune. However, this is not the case in IOVA-CF.

Study of IOVA-CF (RQ2)

In this section, we investigate the impact and contribution of the implicit optimal prior (IOP) and the effect of the dimension of latent representation on model performance.

Impact of the implicit optimal prior (IOP)

Fig. 2
figure 2

A study of the impact of IOP on IOVA-CF

To measure the impact of IOP, we conducted an extensive study of IOVA-CF by comparing it to six (6) other variants. The variants of IOVA-CF are as follows:

  • IOVA-CF_ML: This variant is formed by introducing multinomial likelihood in the learning framework of the IOVA-CF model.

  • IOVA-CF_G: In this variant, we embed into the IOVA-CF model the gating technique as proposed in the EVCF model [16].

  • IOVA-CF_H: Inspired by the model design in EVCF model [16], we created a hierarchical architecture version of the IOVA-CF model to learn the contribution of such design choice.

  • IOVA-CF_A: Here, we include the \( \beta \)-annealing regularization technique utilized in [23] into the learning framework of IOVA-CF.

  • IOVA-CF_DD: Motivated by the impact of decoder without denoising in RecVAE [30], we study the effect of decoder denoising with this variant.

  • IOVA-CF_noIOP: In this variant, we disabled the entire implicit optimal prior (IOP). This variant can be likened to a VAE-CF model with the vanilla objective function.

Note that apart from IOVA-CF_noIOP, IOP is used in all the other models. Figure 2 illustrate the results of the performance of all these models on all the datasets for the Recall@50 and NDCG@50 metrics. Our findings after careful analysis are as follows:

  • We did not witness any increase in model performance with any of the variants of IOVA-CF across all the datasets for all the metrics. This implies that IOP is very robust even without additional model tweaks and regularizations. It also highlights the optimal nature of the aggregated posterior prior as it can independently produce excellent latent representations for high recommendation performance.

  • Three variants out of the six, namely: IOVA-CF_ML, IOVA-CF_G, IOVA-CF_A attained model performance that was generally similar to the IOVA-CF model.

  • Employing IOVA-CF_H and IOVA-CF_DD resulted in decreased model performance consistently. This trend implies that modeling IOP with hierarchical architecture is ineffective in our context. Additionally, we see that utilizing decoder denoising is not helpful. Hence, we conclude that employing a decoder without denoising is useful when using IOP. This approach is consistent with the finding in RecVAE [30].

  • IOVA-CF without IOP (IOVA-CF_noIOP) consistently resulted in poor model performance. For instance, the results were even below that of Multi-VAE. This vast discrepancy between IOVA-CF_noIOP and IOVA-CF highlights the disadvantages of modelling the VAE-CF model without complex prior distribution. Complex priors like IOP can reasonably estimate the unobserved user-item interactions leading to highly effective representation learning. Besides, complex priors are excellent helps to manage the issue of over regularization and yields good generalization.

Effect of the dimension of latent representations

Here, we study the impact of the dimension of latent representations on the overall performance of IOVA-CF. The main motivation here is that the density ratio trick is susceptible to high dimensional data, so obtaining the best dimension that yields excellent performance is extremely significant. We experiment with latent representation dimensions \( d = 32, 64, 96, 128, 160, 192, 224, 256 \). The experimental results, as depicted in Fig. 3 show that consistently, \( d = 128 \) yields optimal model performance. Also, we see that the model performance deteriorates when \( d > 128 \).

Fig. 3
figure 3

A study of the effect of the dimension of latent represenations on model performance

Robustness of IOVA-CF (RQ3)

We examine the robustness of the IOVA-CF model in terms of computational efficiency and model convergence in this subsection.

Computational efficiency

It’s worth noting that the aggregated posterior prior is the best prior for VAE. KL between the inference network and the aggregated posterior prior, on the other hand, cannot be computed in closed form. In EVCF, an explicit strategy is adopted to solve this issue. Particularly, the aggregated posterior in EVCF is formulated as \(p_\lambda (z)=\frac{1}{K}\sum _{k=1}^{K}q_\phi (z|u_k)\) where K is the number of pseudo-inputs, and \(u_k\) is a D-dimensional vector that serves as the pseudo-input. Here K and \(u_k\) are hyperparameters that are learned via backpropagation in the EVCF framework. Note that the process of learning an optimal K and \(u_k\) is computationally expensive. Additionally, EVCF also employs gating techniques and hierarchical architecture that contribute to its computational cost. Overall, we point out that a high computational cost is required to obtain optimal results in the EVCF framework. This cost makes tuning of hyperparameters an arduous task as well. On the other hand, IOVA-CF employs an implicit technique (the density ratio trick) that avoids the explicit computation of the aggregated posterior, making it a lightweight framework. Moreover, only a few hyperparameters are tuned in the IOVA-CF framework. Besides, not only is IOVA-CF efficient and lightweight, but IOVA-CF can easily be trained within an appreciable period to obtain better performance than EVCF.

Fig. 4
figure 4

The Convergence of IOVA-CF

Table 7 Total training and evaluation times per epoch for all VAE-based models across all the datasets

Model convergence

Figure 4 depicts the training convergence graphs of IOVA-CF on the four datasets for the Recall@20, NDCG@20, and Precision@20 metrics. We notice that the IOVA-CF model converges mainly around 50 to 60 epochs. Notably, the model converges after about 50 epochs on the Book-Crossing and Amazon electronics datasets. For the ML-1m and ML-10m datasets, IOVA-CF converges around 60 epochs for both datasets. Our VAE-based baseline models: Multi-VAE, RecVAE, MacridVAE, and EVCF, generally converge to stable performance after 150, 70, 100, and 75 epochs, respectively, on all the datasets. Additionally, in Table 7, we report the total training and evaluation time for a single epoch for all the VAE-based RS models. Overall, we can conclude the training efficiency of IOVA-CF is generally faster than the baseline VAE-models. We remark here that our proposed IOVA-CF model’s rapid convergence to good performance is noteworthy. It implies that IOVA-CF can save training time while also achieving exceptional outcomes. We ascribe this remarkable training efficiency to IOP’s ability to aid our IOVA-CF model to swiftly generalize well and successfully learn quality user representations for delivering excellent recommendation performance.

Conclusion

We developed a novel variational collaborative filtering approach (IOVA-CF) in this paper. IOVA-CF employed an implicit optimal aggregated posterior prior for effective latent representation learning. Furthermore, the IOVA-CF we proposed can adequately capture non-linear user-item interactions and generalize user feedback data well. Another significant innovation of IOVA-CF is that it can compute KL divergence between the inference network and the proposed aggregated posterior prior(AP) in a closed-form, making IOVA-CF’s objective function very useful and optimal. We highlight here that this method enhanced the model learning process. Besides, unlike previous VAE-based CF models, IOVA-CF significantly alleviates the over-regularization issue and can effectively capture the latent space’s uncertainty. Finally, through detailed empirical evaluations with a variety of competitive baseline models on four real-world datasets, we have validated the superior performance of IOVA-CF.