1 Introduction

When recommending books, movies, or certain products in subscription-based services, the long- or mid-term user profile plays a more important role than the sequence of the last few items consumed, which is typically available in non-subscription-based services. While short-term consumption products that depend more on the current mood of the user such as music, products that the users purchase less frequently and potentially engage with for a longer time [1] are characterized better by looking back at past user behavior. Although neural models have greatly improved performance for the short-term, sequential recommendation compared with traditional approaches, they overemphasize the sequential information of sessions, that is, the order of items in each browsing session [2, 3]. In contrast to session-based recommendation that predicts the next click or item for an ongoing session without knowing who is now performing the actions, in this paper, we concentrate on recommendation based on the long-term user profile, a task of practical relevance in itself or in a recommender ensemble combined with session-based methods.

In addition to focusing on recommendation based on long-term user profiles, we address the implicit recommendation task where models rely on interaction feedback such as viewing an item or listening to a song, and only the lack of interaction serves as negative feedback. Several practitioners argue that most of the recommendation tasks they face are implicit feedback [4]; in [5], the authors claim that 99% of the recommendation industry tasks are implicit. Even if sufficiently many users are willing to explicitly rate the items such as Netflix movies [6], much more detailed information is available in the form of implicit feedback of who interacted with which items. Even for Netflix movie ratings, the implicit feedback version “who rated what” is an interesting and practically important task [7]. Toward this end, we also consider explicit feedback research data by transforming ratings into interaction regardless of its value.

Our contribution to non-sequential implicit feedback recommendation is the mathematical understanding and improvement of nearest neighbor, one of the oldest and most popular approaches for collaborative filtering [8]. While a multitude of more sophisticated methods has been developed over the years [9,10,11], very few can match the simplicity of the classic k-nearest neighbor. This simplicity also makes the recommendations and behavior of the method easier to interpret than many more recent counterparts, which makes it attractive in practice. Recently, several thorough experiments proved that they are still often very competitive in terms of predictive performance [12], even when challenged by much more complex methods, and in some cases, even compared to sequential recommendation recurrent network methods [13].

Despite the success of the nearest neighbor recommender, there is a lack of theoretical analysis and understanding of the foundations of nearest neighbor methods for implicit feedback recommendation in the literature. Already in 2007, Bell and Koren [14] listed various concerns regarding certain theoretically unjustified components, including the lack of statistical foundations for the similarity functions and the difficulty to avoid overfitting for similarities with low-support items.

Our goal is to provide a probabilistic formulation, which has multiple uses. A probabilistic basis allows us to understand the behavior and properties of the algorithm in much more detail. Further, our work provides insight into not only the mechanics of the algorithm we investigate but of the data itself, both of which could potentially influence or inspire future research.

Even the very recent nearest neighbor implementations [15,16,17] depend on ad hoc choices and are heuristic in nature. In particular, the theoretical foundation is missing for the similarity functions [14]. Understanding machine learning approaches in probabilistic theory can become challenging, as the prediction tasks quickly become too complex for traditional statistical machinery to sufficiently handle. Oversimplification can lead to failing to account for certain effects, which may result in the approach under-performing its heuristic counterparts. Accounting for these effects often requires nontrivial methods. In our approach, we investigate frequently used heuristics and reinterpret them in a probabilistic framework.

Prior to our work, research mostly focused on learning the similarities [14, 18] and there have been only a few attempts for a probabilistic interpretation of the nearest neighbor algorithm, mostly for the classification problem [19, 20]. While probabilistic analysis of nearest neighbor for explicit ratings exists [21], we are aware of no probabilistic formulations in the implicit case on how interactions with neighbors affect the preference of the item in question, which can also be combined with separating the popularity component of the implicit feedback.

Our work reinterprets the item-based nearest neighbor recommendation algorithm by considering similarity a conditional probability. We frame multiple heuristics as ways to deal with the uncertainty of the observed similarity and propose using the Bayesian credible interval lower bound as an alternative. For the selected set of neighbors, we argue that the classic prediction formula is in fact a probabilistic OR operation over multiple predictors, using the assumption that these predictors are independent. Although the independence assumption is invalid in general [14], empirical tests still confirm in practice; additionally, methods to improve independence in the selected neighborhood remain an option for future work.

We also show that it is plausible that the independence assumption is responsible for introducing a strong popularity bias in the predictions. Our modeling of the popularity component in item similarity explains the popularity bias that has been observed many times either directly [22] or through the need for inverse frequency scaling [23,24,25].

Decoupling popularity from similarity is a key component of our work, which leads to more recommendations in the long tail without sacrificing accuracy. We give a Bayesian interpretation of the user behavior as a combination of popularity and personal taste. We consider each positive feedback observation as an OR operation over a popularity and a personal taste component. By assuming a Beta prior on the personal taste component, we derive a Bayesian formula for the posterior, which we then use in the nearest neighbor recommendation. Using this approach, we give a theoretically founded method to reduce the effect of selection bias toward the popular items when measuring personal taste.

Our model relies on a Bayesian interpretation of the user behavior as a combination of popularity and personal taste. We consider each positive feedback observation as an OR operation over a popularity and a personal taste component. By assuming a Beta prior on the personal taste component, we derive a Bayesian formula for the posterior, which we then use in the nearest neighbor recommendation. Using this approach, we give a theoretically founded method to reduce the effect of selection bias toward the popular items when measuring personal taste.

As a summary, our key contributions toward a probabilistic interpretation of the nearest neighbor method are as follows:

  • We reinterpret similarity as a conditional probability.

  • We introduce the Bayesian credible interval lower bound to incorporate uncertainties.

  • We decouple popularity from personal taste by using Bayesian inference.

Limitations of our approach include the assumption that feedback is binary: We do not take into account the number of interactions between a given user-item pair, only the fact of whether it exists. Further, in this paper, we only consider the case of item-based nearest neighbor recommendations, however, most of the ideas presented could carry over to user-based nearest neighbor recommendation as well.

We experiment with a large variety of test data and compare to the recent baseline implementations of Dacrema et al. [12]. We find that our approach works well in practice, providing a notable recommendation accuracy increase over other item-based nearest neighbor approaches, while still closely matching the structure of the original method. Even though our goal is first and foremost to better understand the mechanics of the nearest neighbor method, this improvement also strongly supports the validity of our claims.

This paper is organized as follows. After the related results, we give a brief overview of the existing main nearest neighbor approaches. The probabilistic interpretation and our new nearest neighbor algorithm are found in Sect. 4, and in Sect. 5, we describe our evaluation methodology and the conducted experiments.

2 Related results

Recommender systems [26] have become extremely common recently in a variety of areas including movies, music, news, books, and products in general. They produce a list of recommended items by either collaborative or content-based filtering. Collaborative filtering methods [27, 28] build models of the past user-item interactions, while content-based filtering [29] typically generates lists of similar items.

In the past few years, several papers addressed the so-called sequential recommendation problem where the recommendation is based on the last sequence of items in a user session. This task is highly relevant when the users are reluctant to create logins and prefer to browse anonymously. Most popular and successful algorithms use recurrent neural networks such as [30,31,32,33,34,35], however, there are other approaches as well, such as using graph neural networks [36]. A large-scale empirical evaluation of session-based recommendation algorithms [13] enumerates and compares these algorithms. Note that, the main conclusion of that study is that “the progress in terms of prediction accuracy that is achieved with neural methods is still limited. In most cases, our experiments show that simple heuristic methods based on nearest neighbors schemes are preferable over conceptually and computationally more complex methods” [13].

Even more recently, several authors noticed that sequential recommendation provides suboptimal results as they cannot sufficiently capture long-term user preferences [2, 3, 37, 38]. While most of these papers concentrate on improving sequence recommenders by engineering features of past user interactions into their models, we believe that addressing the non-sequential component alone is an important contribution, since the best methods can then be combined in an ensemble recommender [39].

In this paper, we consider user independent item-to-item recommendation based on the profiles of the entire past user interactions [27, 28]. The best known example of this task is the Amazon list of books related to the last visited one [27]. We model the implicit feedback of the user. The feedback may be explicit, such as one to five stars ratings of movies on Netflix [40]. Most of the recommendation tasks are however implicit, as the user provides no like or dislike information. In such cases, recommender systems have to rely on implicit feedback such as time elapsed viewing an item or listening to a song. Although implicit recommendation is much more frequent in practice, explicit feedback is overrepresented in research data. For this reason, in our experiments, we consider a few explicit recommendation research data by transforming a rating into an interaction regardless of its value.

Even in the case when explicit feedback is available, it is only for a small fraction of the interactions, and due to the frugality of the explicit feedback, it is important to know who rated what [7]. Furthermore, each user often takes some implicit actions (e.g., click) before making an explicit decision (e.g., purchase), hence the binarized, implicit feedback version of the original explicit feedback data is of high relevance [41]. Note that, BPR [42] also turns both explicit and implicit feedback into user-dependent pairwise preferences regarding items. In our experiments, we consider a binarized version of the explicit feedback data of Amazon, MovieLens, Epinions, and more.

The first item-to-item recommender methods [27, 28, 43] were using similarity information to directly find nearest neighbor transactions. Very early papers [44] already compared the performance of several variants; for a more recent one, see [45]. The two main variants are based either on user [43] or item [28] similarities; in [15], a reformulation is proposed that unifies the two approaches. Nearest neighbor algorithms differ in the similarity functions used [18, 46], in feature weighting [23], and other normalization approaches [14]; some apply spectral methods to denoise the similarity matrix [17] or use different framings of item-to-item approaches [47].

However, the theory behind nearest neighbor methods was criticized for at least two main reasons. First, the similarity metrics typically have no mathematical justification. Second, the confidence of the similarity values is often not involved when finding the nearest neighbor, which leads to overfitting in sparse cases. In [14], a method is given that learns similarity weights for users. Their method estimates user preference toward an item by regression over similar item ratings. By the nature of the regression, their method applies to explicit feedback recommendation only. One of our main goals is to provide an implicit feedback counterpart to their result, albeit based on a probabilistic approach for implicit feedback recommendation.

Many works introduce probabilistic reasoning for various collaborative filtering approaches that use elements similar to ours. The notion of using conditional probability as similarity for nearest neighbor appears as early as [22]. In that work, the author already notes that the approach is known to lead to popularity bias and mentions TF-IDF and normalization as approaches to mitigate; however, the proposed solution is based on a heuristic power function transformation of the conditional probability inspired by TF-IDF rather than a theoretically founded method. In [48], an undirected graphical model over the items is introduced with a fast training algorithm to learn the relations between items; however, there the main goal is to extend the heuristic solutions by neighbor interaction considerations. The work [49] computes probabilities for how likely two users are to be of the same personality type. Their method can be considered as a Naive Bayes network with past positive user interactions as features; however, their method makes no consideration to handle the difference between sparse and dense observations. The paper [50] also deals with user behavior and collaborative filtering in a Bayesian setting; however, their work ventures rather far from the classic KNN formulation.

Of special note is the recent work [21] that has a very similar theme to ours, attempting to place KNN-based recommendation in a probabilistic framework, and also using a conditional probability-based explanation of similarity. However, the basic approach to problem formulation is different, with [21] marginalizing over categorical variables rather than treating events as binomial variables. As a result, the two approaches further diverge in how they treat ensemble prediction and related assumptions. We believe our formulation is more straightforward, while also allowing for a much more direct way of modeling popularity bias, and an intuitive solution to handling it through Bayesian inference. Our model also has the additional benefit of providing insight into the underlying user behavior.

One of our main theoretical and practical questions relates to separating popularity from personal taste in the feedback. Most authors address this question in the context of the missing-not-at-random feedback [51,52,53,54,55], however, they all take approaches different from ours. For example, in [56], the training procedure itself is de-biased. In [54], a thorough analysis is given on item popularity in recommendation, which focuses primarily on evaluation aspects such as the fraction of recommendation in the long tail. In [57], a probabilistic latent variable model is proposed to make use of the sparse signals from the users. Finally, for explicit ratings, the behavior on missing feedback is probabilistically modeled in [55], however, it remains unclear how their model can be applied for implicit feedback. Since in our research, we have neither ground truth nor testing user base, we restrict our attention to the user behavior as observed in the data. Popularity bias in the data does not primarily affect the traditional offline recommendation accuracy metrics such as recall or NDCG, as the bias is present in the test set just as much as in the training set. Rather, it hinders recommendation in the long tail, which is considered particularly valuable for the users [58]. In our work, we are concerned with the bias introduced by the method as also observed in [22], instead of the bias present in the collected data.

Regarding popularity bias, as we discussed, several authors, for example [56, 57], attempt to de-bias the data itself to directly provide a better user experience. While several heuristics (regularization and shrink [14], popularity-stratified training [54], confidence weight [57], de-biasing the evaluation [56], etc.) were proposed to account for popularity, we require a well-grounded formulation for combining popularity and personal taste.

3 Background

In this section, we give an overview of the main variants of the classic item-based KNN using cosine similarity [28, 59]. We mention that the first variants were user similarity-based [43] and user-based approaches also perform well in recent experiments [12]. While our ideas could carry over to the user-based nearest neighbor problem, we restrict to item similarity for the clarity of the presentation.

Nearest neighbor algorithms typically make an ad hoc choice of a similarity measure, which is only empirically justified. For example, different papers propose the Jaccard coefficient [18], Cosine [28], Asymmetric Cosine [46], and others such as Dice-Sorensen and Tversky similarities [12]. In some variants, we can consider the denominator in the similarity measure as a normalization term. Next, we describe Cosine similarity as an illustration and motivation for our definition of similarity as a conditional probability.

The item-based KNN algorithm, which is the main focus of this paper, uses a number of heuristics. First, the user-item interaction matrix may optionally be transformed using TF-IDF (or BM25) normalization [23]:

$$\begin{aligned} z_{u,i}^{\text {tfidf}} = \frac{\sqrt{z_{u,i}}}{\log {\left( \frac{M}{t_i}\right) }}\text {.} \end{aligned}$$
(1)

Here, \(z_{u,i}\in \{0,1\}\) is the observation of whether user u interacted with item i, M is the number of items, and \(t_i=\sum _u z_{u,i}\). Since we assume the matrix to be \(0-1\) valued, in TF-IDF, the square root in the numerator holds no significance.

Next, the similarity values are computed between items. The cosine similarity value for items i and j is calculated as

$$\begin{aligned} \text {cos}(i,j) = \frac{\sum _u z_{u,i}z_{u,j}}{\sqrt{t_i}\sqrt{t_j}}\text {.} \end{aligned}$$
(2)

Optionally, this calculation involves a shrink value [14] to penalize items with too few observed interactions:

$$\begin{aligned} \text {sim}(i,j) = \frac{\sum _u z_{u,i}z_{u,j}}{\sqrt{t_i}\sqrt{t_j} + \text {shrink}}\text {.} \end{aligned}$$
(3)

Next, for each item i, the k items \(n(i)=\{n(i)_1, \ldots , n(i)_k\}\) with the highest similarity values are selected.

Finally, we calculate our prediction \({\hat{z}}_{u,i}\) for each item i and rank the items to produce a toplist of the items that the user is most likely to interact with by

$$\begin{aligned} {\hat{z}}_{u,i} = \sum _{j\in n(i)}\text {sim}(i, j)\;z_{u,j}\text {.} \end{aligned}$$
(4)

Note that, we use certain similarity values for ranking neighbors and other values in the prediction formula of Eq. (4). The similarity values in the two cases do not necessarily have to be the same, but they are often selected so. In our work, however, we give separate interpretations for the two subproblems and use different similarity values.

4 Our new algorithm in the probabilistic interpretation of nearest neighbors

In this section, we introduce our main result, a probabilistic interpretation of the KNN method. The key ingredients are summarized as follows:

  • We interpret similarity as a conditional probability and propose using Bayesian credible intervals to select neighbors.

  • We present a plausible explanation for the neighbor ensemble prediction formula of the classic KNN method.

  • We explain the presence of popularity bias in the predictions and develop a formula to mitigate its effects by using Bayesian inference to acquire an estimate of popularity-free personal taste. Our Bayesian approach is reminiscent of Latent Dirichlet Allocation, however, with a simpler distribution (binomial instead of multinomial) but with a more complex inference that involves an operation between the random variables expressing item popularity and user personal taste.

The notation used throughout this section is summarized in Table 1.

We summarize the outline of this Section along with our main contributions as follows:

  • In Sect. 4.1, we replace the naive formulation of the observed conditional probability of a new interaction given a past one by a Bayesian inference using Beta priors.

  • In the same section, we introduce Bayesian credible intervals to handle the uncertainties if too few interactions are available in the data.

  • In Sect. 4.2, using mild independence assumptions, we derive formula (11) for computing the OR of all past interactions that indicate a given new one and show how this formula can efficiently be computed.

  • In Sect. 4.3, we also involve negative events (items not consumed by the user) in our formulation.

  • In Sect. 4.4, we enhance the main formula (11) to decouple popularity from user preference. We introduce and handle separate taste and popularity components of a given user interaction.

  • The decoupled components of the interaction are handled separately in the next two subsections. In Sect. 4.5, we introduce a variant of the estimated or maximum a posteriori estimates to handle the taste component and mathematically prove its properties.

  • In Sect. 4.6, we complete our formula by adding the popularity component as a Bernoulli variable.

Table 1 Summary of notations

4.1 Conditional probability as similarity

In an attempt to probabilistically interpret the influence of the nearest neighbor items j on the unknown feedback of an item i, let us interpret each observation \(z_{u,i}\) as a single sample taken from a Bernoulli variable \(Z_{u,i}\). Our goal is then to predict \(P(Z_{u,i}=1)\) for each i to produce a ranking list. Typical methods select a set n(i) of items j, for example those with strongest similarity to i, as an “expert committee” to decide on i.

To select neighbors n(i) with the highest predictive power, we want to rank them by the conditional probability \(P(Z_{u,i}=1 \mid Z_{u,j}=1)\). A naive way to estimate this value from the data is to calculate

$$\begin{aligned} P(Z_{u,i}=1 \mid Z_{u,j}=1) \approx \frac{\sum _u z_{u,i}z_{u,j}}{t_j}\text {.} \end{aligned}$$
(5)

However, this formula fails to account for the uncertainty of the estimation, i.e., the sample size \(t_j\). Rather, we propose estimating this probability as the parameter of a Beta-binomial variable with sample size \(t_j\) and a number of observed positive outcomes equal to \(\sum _u z_{u,i}z_{u,j}\). Since the observable events are binary, the Beta distribution is a natural and often used choice of prior for the parameter of the resulting binomial variables, as it has the convenient property of being a conjugate prior, i.e., the posterior also assumes a Beta distribution. Our estimate thus becomes

$$\begin{aligned} P(Z_{u,i}=1 \mid Z_{u,j}=1) \approx \frac{\sum _u z_{u,i}z_{u,j} + a}{t_j + b}\text {,} \end{aligned}$$
(6)

where a and b are the parameters of the Beta-prior used. Note that, if our prior assumes a low probability, i.e., a is significantly smaller than b, then this closely resembles the shrink heuristic mentioned in Sect. 3.

To emphasize the importance of sample size, we can calculate a Bayesian credible interval for the posterior distribution and rank the items based on the lower interval bound. This effectively means that we discount items by the uncertainty present in our estimate, i.e., we calculate the lower bound \(\gamma _{u,i}\) for which

$$\begin{aligned} P(\;P(Z_{u,i}=1 \mid Z_{u,j}=1) < \gamma _{u,i}) = q\text {,} \end{aligned}$$
(7)

where q is a given confidence value, for example \(q=0.05\). With the assumed Beta prior, the value of q is given by

$$\begin{aligned} \gamma _{u,i} = B^{-1}\left( \;q\;,\; t_{i,j} + a\;,\; t_j - t_{i,j} + b\right) \text {,} \end{aligned}$$
(8)

where \(t_{i,j} = \sum _u z_{u,i}z_{u,j}\) and \(B^{-1}(\,\cdot \,, a, b)\) denotes the inverse of the cumulative distribution function of a Beta distribution with parameters a and b. To compute, we use the numerical approximation implemented by the Python package SciPy [60] and a lookup table for avoiding multiple calculations of the same value.

4.2 Ensemble prediction

Next, we propose a probabilistic interpretation for aggregating the contribution of similar items in predicting user taste. In the literature [14], several authors argue that there is no theoretical foundation for the similarity measure and the summation of Eq. (4).

Our reasoning starts with defining a not directly observable event that the reason for user u consuming item i is an earlier item j, or at least j is indicative of some underlying reason, such as a certain component of the user’s personal taste. In other words, if \(Z_{u,j}=1\), we may in this case infer \(Z_{u,i}=1\) with certain probability. We denote this abstract event as \(Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i}\). In this formulation, the user will consume item i if there is any similar item \(j \in n^+_u(i)\) for which \(Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i}\) is true, i.e., if the following formula is true:

$$\begin{aligned} \bigvee _{\begin{array}{c} j\in n^+_u(i) \end{array}}Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i} \text {,} \end{aligned}$$
(9)

where \(n^+_u(i)\) is the set of neighbors of i that appear in the interaction history of user u.

Next, we approximate the probability that at least one neighboring item serves as a reason for u to consume i. Assuming that the events \(Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i}\) are fully independent for \(j \in n^+_u(i)\) and approximating

$$\begin{aligned} P(Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i}) \approx P(Z_{u,i}=1 \mid Z_{u,j}=1)\text {,} \end{aligned}$$
(10)

this yields

$$\begin{aligned}&P\Biggl [\bigvee _{\begin{array}{c} j\in n^+_u(i) \end{array}}\!\!\!\!Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i}\Biggr ] \nonumber \\&\quad \approx 1-\!\!\!\!\prod _{j\in n^+_u(i)}\!\!\!1-P(Z_{u,i} \mid Z_{u,j}=1)\text {.} \end{aligned}$$
(11)

Note that, by expanding the product on the right of Eq. (11), we get a formula that can be re-written to another form, also resulting from the inclusion-exclusion principle, such that the summation of Eq. (4) clearly appears as the first term:

$$\begin{aligned}&1-\prod _{j\in n^+_u(i)}\!\!\!(1-P(Z_{u,i} \mid Z_{u,j}=1)) \nonumber \\&\quad = 1-\left( 1 - \!\!\!\!\sum _{j\in n^+_u(i)}\!\!\!\! P(Z_{u,i} \mid Z_{u,j}=1) + \Delta \right) \nonumber \\&\quad = \sum _{j\in n(i)}\text {sim}(i, j)\;z_{u,j} - \Delta \text {.} \end{aligned}$$
(12)

The rest of the terms \(\Delta \) prove to be negligible in practice, as we show in Sect. 5.4. Similar to Sect. 4.1, we treat the conditional probability as the parameter of a Beta-binomial variable and estimate it using the formula (6).

It is important to note that the independence assumption is overly strong and most likely does not hold in realistic scenarios. At the same time, the predictive performance of the nearest neighbor methods indicates that this approximation works in practice. A consequence of this line of thinking is that one should try to select neighbors that are not only strong predictors but also independent of each other, as also observed in [14]. Equation (11) and the independence assumption will be further discussed in Sect. 4.4.

Another implicit assumption behind Eq. (11) as a prediction for \(P(Z_{u,i})\) is that we have found and taken into account all possible reasons for the user to consume i, which is unreasonable in general. However, it makes sense to view Eq. (11) as the probability that the interaction happened because one of the modeled events caused it to happen. In essence, we need to view the value defined in Eq. (11) with the caveat that it is the probability that can be reasonably inferred based on the evidence taken into account (i.e., the top k neighbors) and ignoring any other event not explicitly modeled. Note that, in Sect. 4.4, we expand the modeled events to include a general item-dependant one. Further, this problem is alleviated by the fact that it is equally true for all items that we are trying to rank, thus the limited amount of evidence is not a problem when we compare candidates for recommendation.

Although it may not be immediately clear, formula (11) can be implemented with the same computational complexity as (4). The predictions in Eq. (4) are usually calculated by first computing an item-item neighborhood matrix A, where

$$\begin{aligned} \forall i \,\forall j\in n(i): A_{j,i}=sim(i,j) \end{aligned}$$
(13)

and \(A=0\) elsewhere. The calculation of Eq. (4) then becomes a vector-matrix multiplication between the 0and1 valued user interaction vector and the sparse matrix A. By transforming Eq. (11) into an equivalent formula

$$\begin{aligned} 1-\exp \left( \;\sum _{j\in n(i)}z_{u,j}\;\log \left( 1-P(Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i})\right) \right) \text {,} \end{aligned}$$
(14)

we can calculate predictions as a matrix-vector multiplication.

4.3 Predicting from negative events

For the ensemble formula in Eq. (9), we used predictors in the form

$$\begin{aligned} Z_{u,j}\mathrel {\!\twoheadrightarrow \!}Z_{u,i}\text {.} \end{aligned}$$
(15)

By the above equation, our method for selecting the nearest neighbors essentially means that we select events that we can use to predict the event \(Z_{u,i}=1\). However, so far, we only considered events in the form \(Z_{u,j}=1\). It is also possible to consider events in the form \(Z_{u,j}=0\), meaning that we also incorporate users not interacting with an item to predict new interactions.

To assess the power of predictors based on negative events, we can use a formula similar to the one used in Eq. (6), i.e.,

$$\begin{aligned}&P(Z_{u,i}=1 \mid Z_{u,j}=0)\nonumber \\&\quad \approx \frac{\sum _u z_{u,i}(1-z_{u,j}) + a}{|I |- t_j + b}\text {.} \end{aligned}$$
(16)

We measure the estimated strength of negative predictors and their effect in Sect. 5. We note in advance that using negative predictors is not very promising, as the basic premise of the mechanics described in Sect. 4.2 is that the predictor event has a causal relationship with the predicted event. Such a relationship is much harder to assume when the predictor event is a missing interaction.

4.4 Inverse frequency scaling

Next, we infer the popularity component of taste by introducing a new unobservable event \(Y_{u,i}\) if a user u interacts with item i due to its popularity. We apply Bayesian inference for modeling the relation of popularity and personal taste by assuming that the user interacts with the item either because of personal motivation or because of popularity.

The intuition behind the usage of TF-IDF in Eq. (1) is that interacting with a very popular item carries less information about the user’s taste than interacting with a less frequent item. We model this intuition by assuming that observations \(z_{u,i}\) arise as the combination of multiple independent components of behavior

$$\begin{aligned} Z_{u,i} = X_{u,i} \vee Y_{u,i}\text {,} \end{aligned}$$
(17)

where \(X_{u,i}\) is the Bernoulli variable of u interacting with i because of personal motivation, \(Y_{u,i}\) is the Bernoulli variable of any particular user interacting with i because of its popularity, and \(Z_{u,i}\) is the variable that we are able to observe in the form of \(z_{u,i}\) samples. We assume X and Y to be independent by definition, as we view Y as common behavior and X as deviation from common behavior. Further, we assume \(Y_{u,i}\) to be independent of u and denote it by \(Y_{i}\) (with its parameter denoted by \(y_i\)) in the remainder of the paper. Hence, the probability \(P(Z_{u,i}=1)\) given the values of \(x_{u,i}\) and \(y_{i}\) can be expressed as

$$\begin{aligned} P(Z_{u,i}=1 \mid x_{u,i}, y_i) = x_{u,i} + y_{i} - x_{u,i} \cdot y_{i}\text {.} \end{aligned}$$
(18)

This ties back into the independence assumption of Eq. (11): If the observed variables \(Z_{u,i}\) contain a component \(Y_{i}\) that is independent of all other \(Z_{\cdot , j}\), then what we are actually estimating in Eq. (11) is instead

$$\begin{aligned} P(Z_{u,i} \mid z_{u,j}=1)\approx & {} P(Z_{u,j} \mathrel {\!\twoheadrightarrow \!}(X_{u,i}\vee Y_{i})) \nonumber \\= & {} P((Z_{u,j} \mathrel {\!\twoheadrightarrow \!}X_{u,i})\vee Y_{i}) \text {.} \end{aligned}$$
(19)

Thus, when in Eq. (11) we use the independence assumption to calculate

$$\begin{aligned} P(Z_{u,i}) \approx P\Biggl [\bigvee _{\begin{array}{c} j\in n^+_u(i) \end{array}}\!\!\! Y_i \vee (Z_{u,j}\mathrel {\!\twoheadrightarrow \!}X_{u,i})\Biggr ]\text {,} \end{aligned}$$
(20)

we introduce significant popularity bias, as we count the \(Y_i\) component multiple times. By this argument, if we increase the neighborhood size, we also increase popularity bias, which we will indeed measure in Sect. 5.6. To mitigate the effect of popularity in combination with neighborhood size, we propose using the formula

$$\begin{aligned} P(X_{u,i})\approx & {} P\Biggl [\bigvee _{\begin{array}{c} j\in n^+_u(i) \end{array}}\!\!\! Z_{u,j}\mathrel {\!\twoheadrightarrow \!}X_{u,i}\Biggr ]\nonumber \\\approx & {} 1-\!\!\!\!\prod _{j\in n^+_u(i)}\!\!\!(1-P(X_{u,i} \mid Z_{u,j}=1)) \end{aligned}$$
(21)

to get an estimate for the personal component \(X_{u,i}\). We discuss estimating \(P(X_{u,i} \mid Z_{u,j}=1)\) in Sect. 4.5 and the reintroduction of \(Y_i\) into the prediction in Sect. 4.6.

4.5 Estimating \(P(X_{u,i} \mid Z_{u,j}=1)\)

Estimating \(P(X_{u,i} \mid Z_{u,j}=1)\) is possible by treating \(X_{u,i}\) as a Beta-binomial variable and doing inference for it based on Eq. (18). For this, we also need the value of \(y_i\), which we estimate as

$$\begin{aligned} P(Y_i=1) = \frac{ct_i}{|U|}\text {;} \end{aligned}$$
(22)

in other words, we assume it to be proportional to the overall frequency of the item in the data divided by the number of users \(|U|\), with a global scaling coefficient c to be able to adjust the strength of the assumed bias. We treat c as a hyperparameter of the method.

Statement 1

Let X follow a binomial distribution with parameter x. Let us assume the relation of Eq. (17) between X, Y, Z. Then, assuming a \(\text {Beta}(a, b)\) prior distribution for X, we have

$$\begin{aligned} f_{x\mid z}(x) = \frac{ ((1-y)x + y)^r \; x^{a-1} \; (1-x)^{b+s-1} }{ \int _0^1 ((1-y)x_0 + y)^r \; x_0^{a-1} \; (1-x_0)^{b+s-1}dx_0 }\text {,}\nonumber \\ \end{aligned}$$
(23)

where r and s are the number of positive and negative samples observed, respectively.

Proof

Since observing \(z_{u,i}\) carries only indirect information about \(X_{u,i}\), our posterior probability will no longer be Beta. Further, we need to calculate the posterior while observing all of the samples at the same time, otherwise the ordering of the samples would influence the resulting posterior. Thus, we need to calculate

$$\begin{aligned} f_{x\mid z}(x_0)= & {} \frac{P(z\mid x=x_0)f_x(x_0)}{P(z)} \nonumber \\= & {} \frac{P(z\mid x=x_0)f_x(x_0)}{\int _0^1 P(z\mid x=x_1)f_x(x_1)dx_1}\text {,} \end{aligned}$$
(24)

where z is the observed sample of \(P(Z_{u,j} \mid X_{u,i})\) values and

$$\begin{aligned} f_x(x_0) = \frac{x_0^{a-1}(1-x_0)^{b-1}}{B(a,b)}\text {.} \end{aligned}$$
(25)

Further,

$$\begin{aligned} P(z\mid x = x_0) = ((1-y)x_0 + y)^r((1-x_0)(1-y))^s\text {,} \end{aligned}$$
(26)

where r and s are the number of positive and negative samples observed. \(\square \)

The number of positive and negative samples in our case is given by

$$\begin{aligned} r&= \sum _u z_{u,i}z_{u,j}\end{aligned}$$
(27)
$$\begin{aligned} s&= t_j\text {.} \end{aligned}$$
(28)

After deriving the density function for \(P(X_{u,i} \mid Z_{u,j})\) in Statement 1, we turn to calculating the or relation of all past items that can potentially be the reason for user u consuming item i in Eq. (21). Since this equation for the or relation involves a product of probabilities, it is prohibitively complex to compute the distribution of \(P(X_{u,i})\) by this equation. Instead, we approximate each \(P(X_{u,i} \mid Z_{u,j})\) value by a point estimate first, such as the estimated a posteriori (EAP) or maximum a posteriori (MAP) estimate. Unlike in the case of a Beta distribution, in our case, these two estimates do not coincide and are nontrivial to calculate.

While an analytical formula of the EAP estimate exists for integers a and b, it is not practical to compute. Approximate integration can be done using various methods such as self-normalized importance sampling.

In our case, this means approximating the infinite sum

$$\begin{aligned} E(X_{ui}\mid Z_{uj}) \approx \frac{ \sum _{{\tilde{x}}_i}{\tilde{x}}_i\frac{f_{x\mid z}({\tilde{x}}_i)}{q({\tilde{x}}_i)} }{ \sum _{{\tilde{x}}_i}\frac{f_{x\mid z}({\tilde{x}}_i)}{q({\tilde{x}}_i)} }\text { where } {\tilde{x}}_i \sim q\text {,} \end{aligned}$$
(29)

by including only a finite number of terms. Note that, the denominator of \(f_{x\mid z}\) and optionally of q can be canceled out for the computation, thus the integral itself does not need to be calculated.

The distribution q is the so-called proposal distribution, which can be any density function that is reasonably similar to the one we are trying to approximate. Since effective sampling methods for Beta distributed variables exist, we can choose q to be a Beta distributed with the shape parameters \(a_q\) and \(b_q\) chosen such that the maximum of the of the density function coincides with that of \(f_{x\mid z}\). For this, we need the MAP estimate of \(f_{x\mid z}\). Since in case of the beta distribution the expected value coincides with the maximum of the density function, we need to set

$$\begin{aligned} \frac{a_q}{a_q+b_q} = \mathrm{{MAP}}(f_{x\mid z})\text {.} \end{aligned}$$
(30)

The sample size parameter (i.e., \(a_q + b_q\)) is chosen as \(p\cdot (a+b+r)\) where \(p\in (0,1]\) to make the computation more stable.

While the formula is guaranteed to converge, in our experience, a large number of samples (in the order of \(10^4\)) are needed for stability. Since we need to calculate the estimate k times for each item i, the approximate integration proved to be too computationally expensive for our purposes. Because of this, we resort to using the MAP estimate itself

$$\begin{aligned} \mathop {\mathrm {arg\,max}}\limits _{x_0} f_{x\mid z} = \mathop {\mathrm {arg\,max}}\limits _{x_0} P(z\mid x=x_0)f_x(x_0)\text {,} \end{aligned}$$
(31)

which is relatively straightforward to compute by the next statement.

Statement 2

Assuming \(a>2\) and \(b+s>2\), the MAP estimate for the distribution described in Eq. (23) can be written as

$$\begin{aligned}&\frac{-c_2-\sqrt{c_2^2 - 4c_1c_3}}{2c_1},{\mathrm {where}}\end{aligned}$$
(32)
$$\begin{aligned} c_1&= (a'+b'+r) (y - 1)\end{aligned}$$
(33)
$$\begin{aligned} c_2&= (a'+ r) (y - 1) + ( a'+b') y\end{aligned}$$
(34)
$$\begin{aligned} c_3&= a' y \end{aligned}$$
(35)

and using substitutions \(a'=a-1\) and \(b' = b+s-1\).

Proof

We can find the maximum by taking the derivative of the nominator and finding its root in the (0, 1) interval. Note that, this root of the derivative is guaranteed to exist since the function is continuous, zero at both 0 and 1 and nonzero in between, thus it must have an extreme point in the interval (0, 1). Calculating the derivative is straightforward:

$$\begin{aligned}&\frac{\mathrm{{d}}}{\mathrm{{d}}x_0} ((1-y)x_0 + y)^r \; x_0^{a'} \; (1-x_0)^{b'} \nonumber \\&\quad = ((1-y)x_0 + y)^{r-1} \; x_0^{a'-1} \; (1-x_0)^{b'-1} \omega (x_0) \text {,} \end{aligned}$$
(36)

meaning roots at \(-\frac{y}{1-y}\), 0, 1, and at the roots of the polynomial

$$\begin{aligned} \omega (x_0) = c_1 x_0^2 + c_2 x_0 + c_3\text {.} \end{aligned}$$
(37)

We can observe that \(c_1\) is always negative, as \(a',b',r > 0\) and \(0<y<1\), making the denominator of Eq. (32) also negative. Further, \(c_3\) is clearly always positive. This in turn makes the expression

$$\begin{aligned} -c_2 + \sqrt{c_2^2-4c_1c_3} \end{aligned}$$
(38)

positive, resulting in the corresponding root being negative, thus leaving the only possible candidate the one described in Eq. (32). \(\square \)

We can also use a very similar calculation for estimating \(P(X_{u,i} \mid Z_{u,j}=0)\). In fact, Statements 1 and 2 still hold, we only need to use a different definition for the positive and negative samples r and s:

$$\begin{aligned} r&= \sum _u z_{u,i}(1-z_{u,j}),\end{aligned}$$
(39)
$$\begin{aligned} s&= |I |- t_j\text {.} \end{aligned}$$
(40)

4.6 Reintroducing popularity \(Y_i\)

Finally, to get a prediction for \(P(Z_{u,i}=1)\), we have to reintroduce the effect of selecting item i due to its popularity, an abstract event that we denoted by \(Y_{u,i}\).

In our next calculations, we assume that our inference for X might not be perfect in the sense that the prediction we calculate by Eq. (11) might still depend on \(Y_i\). We model the remaining effect of \(Y_i\) on the inferred \(X_i\) using a Bernoulli variable A with parameter \(\alpha \):

$$\begin{aligned} P(X_{u,i} \vee (A \wedge Y_{i})) = x_{u,i} + \alpha y_{i} - \alpha x_{u,i}y_{i}\text {,} \end{aligned}$$
(41)

which can be transformed to the formula of our final algorithm using hyperparameter \(\alpha \):

$$\begin{aligned} P(Z_{u,i}=1)= & {} x_{u,i} + y_{i} - x_{u,i}y_{i} \nonumber \\= & {} \left( x_{u,i} + \alpha y_{i} - \alpha x_{u,i}y_{i}\right) \cdot \frac{1-y_{i}}{1-\alpha y_{i}}\nonumber \\&+ \frac{y_{i} - \alpha y_{i}}{1-\alpha y_{i}}\text {.} \end{aligned}$$
(42)

Optimal values for \(\alpha \) are highly dependent on the values of other hyperparameters: Both the number of neighbors k and the global frequency scaling coefficient c in Eq. (22) heavily influence the amount of popularity bias in the recommendations.

4.7 Summary

To summarize the resulting method, we take the following steps when calculating a prediction for \(P(Z_{u,i})\).

  1. 1.

    We calculate the credible interval \(\gamma _{u,i}\) with the selected value \(\alpha \) for each possible neighbor candidate using Equation (8).

  2. 2.

    We select the k top ranked neighbors as predictors.

  3. 3.

    We calculate the MAP estimate for \(P(Z_{u,j}\mathrel {\!\twoheadrightarrow \!}X_{u,i})\) for each selected neighbor using Statement 2.

  4. 4.

    By the MAP estimate, we return to Sect. 4.4 to calculate \(P(X_{u,i})\) using the ensemble formula (21).

  5. 5.

    We obtain our final prediction score \(P(Z_{u,i})\) by reintroducing the global popularity component using Equation (42).

Throughout this derivation, we rely on the following hyperparameters: the percentile q in step (1); the (possibly distinct) parameters for the Beta priors in steps (1) and (3); the global scaling parameter c in step (3); and \(\alpha \) in step (5). We discuss hyperparameter selection in Sect. 5.9.

5 Experiments

5.1 Datasets and evaluation

We conduct our experiments on five explicit ratings datasets, the first three of which are also the ones used in [61], by the same transformation into an implicit recommendation task as in our experiment.

Table 2 Statistics of the experimental data

The datasets used are as follows:

  • Amazon Instant Video [62]: ratings from Amazon.com. The records have been transformed by transforming a rating into an interaction regardless of its value. Users with fewer than 5 ratings were removed. The training-test and then the training-validation sets are split randomly on a per-user basis with given percentages (20% and 20%).

  • MovieLens 1M [63]: a widely used dataset of movie ratings. Implicit transformation was done the same way as above. The training-test and then the training-validation sets are split randomly on a per-user basis with given percentages (20% and 10%).

  • HetRec [64]: a dataset released by the Second International Workshop on Information Heterogeneity and Fusion in Recommender Systems. Implicit transformation was done the same way as above. The training-test and then the training-validation sets are split randomly on a per-user basis with given percentages (20% and 20%).

  • Epinions [65]: This dataset is collected from Epinions.com website, which provides an online service for users to share product feedback. Implicit transformation was done the same way as above. The validation and test sets were each created by leaving one interaction out of the training set randomly for each user.

  • Amazon Books [66]: book ratings from Amazon.com spanning May 1996–July 2014. To create an implicit dataset, we filtered for ratings of 5 and extracted the 10-core [67] of the remaining graph, resulting in a dataset that is still one of the most sparse of the ones used. The training-test-validation sets are created by randomly assigning every record with the given probabilities to each set (0.9, 0.05 and 0.05).

The datasets used have varying sizes and densities, see Table 2 for specifics. For measuring predictive performance, results are reported in terms of Normalized Discounted Cumulative Gain with a cutoff size of 50 (N@50 or NDCG@50) and recall again with a cutoff size of 50 (R@50 or Recall@50). We evaluate both metrics without sampling on item ranks that are calculated by considering the full item set. We note that for this reason, the issues related to sampled ranking metrics described in [68] do not apply in our experiments.

These two metrics give a good indication of the performance of the methods, with NDCG focusing more on the top of the recommended list of items, and recall placing uniform weight on all positions.

5.2 Baselines

We use the baseline implementations and automatic hyperparameter tuning framework of [12, 69] for measuring the performance of baseline algorithms. We optimize for NDCG@50 performance on the validation set in every case. Note that, the main purpose of including the best algorithms for a wide selection of model types is important to put our contribution in context, our primary objective is to improve upon the Item KNN algorithm as the key competitor. We also note that we only consider algorithms that use no item sequence information. For such tasks, our new algorithm can be evaluated in an ensemble combined with recurrent neural networks, as for example in [37] in future work.

We use the splitting procedure implemented by [12, 69] for the HetRec, Amazon Instant Video, Epinions and MovieLens 1M datasets. The procedure is non-deterministic, thus the performance of different algorithms has some variance depending on the exact split generated. For this reason, we re-run the baselines on the same exact split that our models are also tested on. While not an exact match, the results we get are very much in line with the numbers reported in [12, 69].

We also selected one dataset, the Amazon Books, which is not part of the evaluation framework of [12, 69]. We split the interactions in this dataset randomly into training, validation, and test sets with probabilities \(90\%\), \(5\%\), and \(5\%\), respectively.

We briefly summarize the baseline algorithms that we compare our methods against.

  • Item KNN: The method described in Sect. 3 uses the presence of other items in the user interaction history to predict the likelihood of interacting with an item. Multiple different similarity measures can be used for selecting similar items [15], as well as the shrink and the TF-IDF or BM25 weighting heuristics [23].

  • \(\text {{\textbf {P}}}^3\alpha \) and \(\text {{\textbf {R}}}^3\beta \): Random-walk based methods, proposed in [70, 71]. While technically not nearest neighbor algorithms, both of them use closely related ideas.

  • iALS: Matrix factorization for implicit feedback datasets [72].

  • SLIMElasticNet: Sparse Linear Models is a well-performing regression-based method for top-n recommendation [73]. Similar to [12, 69], we use the more efficient version introduced in [74].

  • \(\text {{\textbf {EASE}}}^R\): A shallow autoencoder-like model, with a closed form solution for the training objective [75].

  • MultVAE: A variational autoencoder architecture for collaborative filtering recommendation, introduced in [57]. We chose this one as it performed best against baselines when reproduced by [12, 69].

An important recent class of recommender algorithms is based on recurrent neural networks with ideas stemming from natural language processing. Such algorithms include GRU4Rec [30], Transformers4Rec [34], and more [31, 35]. However, these methods deal with sequential or session-based recommendation, as opposed to the classic collaborative filtering model considered in this paper. Since neither our problem setting and datasets nor our algorithms involve sequential context for the items, they cannot be directly compared to recurrent neural network-based methods. We also note that a recent empirical evaluation study [13] finds that the nearest neighbor and other simple approaches remain competitive even for session recommendations, however, such a comparison is beyond the scope of the present paper.

5.3 Inverse frequency scaling

The traditionally used TF-IDF formula and the probabilistic inverse frequency scaling described in Sect. 4.4 fulfill similar roles in the algorithm. They are applied at different steps in the process, however, TF-IDF is a pre-scaling for the data, while the probabilistic version is integrated into the equation that calculates the values for the ensemble formula. Also, different is the fact that TF-IDF scales both based on the predicted and the predicting item (i.e., item and its neighbor), while the probabilistic version only considers the frequency of the predicted item.

Still, we can compare them somewhat directly, because they both scale the weights used for calculating the score of an item, based on its frequency in the dataset. Further complicating the issue is that TF-IDF is used before the similarity function. For example with the cosine similarity, this effectively results in an additional square root in the weighting formula, when considered as a function of the frequency of item i:

$$\begin{aligned}&\frac{\sum _k \text {tf-idf}(x_{ik})\text {tf-idf}(x_{jk})}{\sqrt{\sum _k \text {tf-idf}(x_{ik})} \sqrt{\sum _k \text {tf-idf}(x_{jk})}}\qquad \nonumber \\&\quad = \frac{\text {tf-idf}_i}{\sqrt{\text {tf-idf}_i}} \frac{\sum _k x_{ik} \text {tf-idf}(x_{jk})}{\sqrt{\sum _k x_{ik}} \sqrt{\sum _k \text {tf-idf}(x_{jk})}}\text {.} \end{aligned}$$
(43)

Note that, as per Sect. 3, variables \(x_{i,k}\) are \(0-1\) valued, thus the TF-IDF scaling can indeed be treated as a multiplicative coefficient.

Based on the discussion on the interaction of TF-IDF and the similarity function, we plot both the regular TF-IDF weighting and its square root against the probabilistic version on Fig. 1 for the Hetrec training set. Weights are normalized to \(f(0)=1\) to show their relative magnitudes. Popularity is given as a percentage of the most frequent item. Values are also dependent on sample size and number of positive samples, which, for this example, are chosen as 100 and 20, respectively.

Fig. 1
figure 1

The relative weight of different frequency scaling schemes, at different rates, all of which discount the popular items

The main contribution of Sect. 4.5 is illustrated in Fig. 1. We can observe that TF-IDF starts heavily discounting the weight of items sooner, while the probabilistic version starts off more gradually. On the other hand, the probabilistic version ends up discounting the weights more heavily at higher popularity values. The slope is controlled by hyperparameter c defined in Eq. (22). We observe that the slope of the theoretically justified methods discount for popularity better than heuristic mathods based on TF-IDF.

5.4 The ensemble formula

The “or” ensemble formula defined in Eqs. (11) and (21) calculates a prediction close to that of the “sum” ensemble formula (4). However, because we evaluate based on ranking lists, very small differences in score could potentially lead to significant changes in the performance as measured by our metrics. To evaluate the difference between the two possible ensemble formulas, we run experiments with the same settings for both formulas. A bit of caution is needed in the comparison since Eq. (4) carries no guarantees regarding whether the resulting values can be interpreted as probabilities. For this reason, applying the methods introduced in Sect. 4.4 and 4.5 would be questionable. We thus run the experiments using Eqs. (4) and (11), ignoring Eq. (21) and do not handle popularity bias in this comparison experiment.

Table 3 Performance difference using the ensemble formula of Eq. (11) (“or”) compared to the classic formula of Eq. (4) (“sum”) on the validation sets

The comparison of the two ensemble formulas is presented in Table 3. We can observe that the variation in the scores is very small, to the point of being negligible. The measurements confirm our interpretation of the prediction formula as Eq. (11). Without popularity bias mitigation, our new formula performs almost identical to the classical KNN prediction formula that uses the “sum” formula of Eq. (4). This supports the plausibility of our explanations.

5.5 Negative predictors

In Sect. 4.3, we described how we can incorporate an interaction not happening as a predictor into the ensemble formula. When ranking neighbors for each item, we indeed observe that we can find many such predictors with seemingly high enough predictive power to make it to the top-k predictors.

Fig. 2
figure 2

Average percent of negative neighbors in the top-k for each popularity percentile in the Hetrec dataset. Higher percentiles contain more popular items

In Fig. 2, we plot the average percent of negative predictors ranked in the top-k for each popularity percentile of items when measured on the Hetrec dataset. We can observe that generally, lower popularity items tend to have more negative neighbors, while the most popular items have none. Note that, due to the power-law distribution of item popularity, the change of popularity is not linear in the percentile. This could explain the sharp drop at around the 50th percentile.

Although many negative predictors were estimated to be good predictors, including them in the ensemble formula results in a predictive performance loss of at least 50% of the model according to our experiments. This leads us to believe that despite the high estimated predictive power of these events, they are spurious most of the time.

When we extend the set of possible predictor events to include the ones based on not interacting with items, the number of potential events increases by a very large number. This is due to the fact that most users interact with only a small percentage of items. With the number of possible events increasing, the likeliness of including spurious predictors increases as well. This effect is particularly problematic with less popular items, where the small sample sizes make it harder to find successful predictors and result in higher variance in the measured predictive power. As a result, we end up excluding such events from our model in further experiments.

Regarding the average relative popularity of recommended items, we can observe a large decrease. When measuring item popularity as the number of interacting users, normalized by the maximum of such numbers, the average popularity of recommendations decreases from 0.4263 to 0.2073 when measured at a cutoff of 50 on the HetRec dataset. Such a significant change in itself can be expected to worsen the measured accuracy of the model, as the distribution of the recommendations should approximately reflect the data distribution to achieve best offline evaluation results. Nevertheless, some experiments [13, 76] suggest that offline evaluation does not necessarily reflect user satisfaction in such cases.

5.6 Neighborhood size and popularity bias

As we increase the number of neighbors k, we ground the recommendation on more items that each carry a popularity bias itself. The reasoning behind Eq. (20) implies that popularity bias increases with k. To verify, we measure the average popularity of the recommended items while changing k. We also run the experiment with different values of c, the parameter that controls the strength of the popularity bias in Eq. (22). We show the results in Fig. 3. Average popularity on the y axis is measured by a linear scale, with the value 1 for the most and 0 for the least popular item in the data.

Fig. 3
figure 3

Change of average popularity of recommended item with different values of k and c on the Hetrec dataset. We can observe that using a larger number of neighbors indeed leads to higher average popularity, as predicted by Eq. (20). Our inference-based method, however, is able to counteract this effect to some degree, as higher values for parameter c result in less popularity increase

With \(c=0\), our algorithm behaves similar to a classic one without any bias reduction. We can observe the average popularity rising with k, which reinforces the correctness of our assumption that the popularity component is considered multiple times in a classical KNN method.

For larger values of c, we quantify our mitigation attempts. Larger values of c clearly reduce the average popularity of the recommendation list. By conflating Fig. 3 with Fig. 5, we can also note that the optimal value of \(c\approx 1.25\) seems to cause the average popularity of the recommended items to flatten out to a constant after an initial rise with k.

Table 4 Predictive performance results on the test set with a cutoff of 50
Table 5 Predictive performance results on the test set with a cutoff of 20

5.7 Predictive performance

Overall predictive performance is summarized in Table 4 for our algorithm and the baseline KNN methods that we improve upon. We also separately report results for more advanced algorithms. Scores equal to or higher than our method are highlighted in bold. Unfortunately, we are not able to report the performance of the \(\text {EASE}^R\) algorithm on the Epinions dataset, as the computation would not complete in a reasonable amount of time, even with considerable computing resources, due to the relatively large number of items in this dataset.

We observe that our method improves predictive performance over the KNN baselines in every case when measured by the optimization target NDCG@50, and also in almost all cases when measured by Recall@50. We can observe very similar behavior when measuring on the smaller cutoff of 20 in Table 5.

Table 6 Recommended item diversity on the test set for the evaluated methods

The KNN baseline methods fluctuate with no clear winner. Also note that, the baseline methods themselves include very similar ingredients as in our proposed method, however, based only on heuristic grounds and implemented to varying degrees. Using a number of heuristic elements in the baseline methods have an effect of performance variance from dataset to dataset, while our algorithm performs well in all cases.

Our approach also manages to outperform a number of more advanced methods, including the neural network-based recommender MultVAE that performed best in the comparison in [12, 69]. The only algorithm that performs better in multiple data sets is SLIMElasticNet from 2013 [74]. Note that, for our algorithm, hyperparameter selection was performed for the cutoff 50, in which case even SLIMElasticNet only outperforms ours in half of the cases as seen in Table 5.

An interesting case study is the Amazon Video dataset, where our variant consistently wins in the optimization target NDCG and loses in Recall against other KNN algorithms. From this observation, we might conclude that our variant is quite effective at optimizing for NDCG in a trad-off against Recall on this relatively small dataset, possibly due to the high number of hyperparameters involved. However, we do not observe the same effect in other, larger datasets, such as Amazon Books or Epinions.

Fig. 4
figure 4

Change of recall when compared with the Gini diversity of the recommendation lists of different parameter configurations. We also plot the best-fit linear regressor as the red dashed line. We can observe that these two metrics tend to be inversely proportional to each other

5.8 Recommendation diversity

Recommendation diversity measures how diverse the recommendation lists are for different users. Diversity can be measured in various ways, such as using the Gini diversity measure [77] or the Shannon entropy of the distribution of recommended items over all users.

Several authors [78, 79] observed a natural trade-off between diversity and recommendation accuracy. In Fig. 4, we plot the trade-off between recall and Gini diversity by using the data points of Fig. 3. We can clearly observe an inversely proportional relation between the recall and amount of diversity in the ranking lists.

In Table 6, we present the Gini diversity and the Shannon entropy of the recommendation lists of our method and the baselines, at a 50 item cutoff. Our KNN variant achieves a balanced diversity score. It is notable, however, that different scores have very high variance across the evaluated methods. Conflating with Table 4, the accuracy-diversity trade-off is arguably present across methods as well. Still, some methods clearly perform better in both regards then the average, such as the RP3\(\beta \), which often achieves very high results in both metrics.

While the notion of diversity is related to popularity, our goal was not to improve diversity, but to better model user behavior through explicitly incorporating popularity bias in our model. Accordingly, all of our hyperparameter configurations are optimized for recall, as described in Sect. 5.2. In this paper, we attempt no optimization for a combined measure of accuracy and diversity [78], which could yield different results across methods. To explore whether the presented model can be used for an improved recall-diversity trade-off remains future research.

5.9 Hyperparameters

Next, we measure performance as the function of the various hyperparameters introduced throughout Sect. 4. All newly introduced hyperparameters have the favorable property that a certain minimum or maximum value turns the selected component of the algorithm off: the credible interval lower bound returns the probability itself at \(q=0.5\); the prior parameters a and b have very little effect when they are small; value 0 for the global popularity scaling parameter c denotes no popularity bias, i.e., \(Y_i=0\) when \(c=0\); and finally, \(\alpha \) of Eq. (41) has no effect when \(\alpha =1\). With this setup, whenever an optimal parameter value differs from the ones listed above, we can conclude that the corresponding step improves prediction accuracy.

Table 7 Optimal hyperparameters for the datasets

Optimal values vary with the data; we present optimal values in Table 7. For q, optimal values range from \(10^{-4}\) to 0.1. With the prior parameters, we fixed \(a=2.01\) to simplify the optimization process, which still allows us to change the expectation of the prior, i.e., \(\frac{a}{a+b}\), by changing the value of b. Note that, we use two separate Beta priors in two steps of our algorithm, in Eq. (8) and in Eq. (32). To distinguish, in Table 7, we denote the corresponding parameters b as \(b_1\) and \(b_2\). Optimal values for \(b_1\) were around 10 in most cases, but ranged from 4 to 250 for \(b_2\). Optimal values for c ranged from around 1 to 5, while optimal values for \(\alpha \) tend to be near 0 in all cases.

Fig. 5
figure 5

Change of recall with the hyperparameter c, the strength of the assumed popularity bias to be removed in Eq. (22), over the Hetrec validation set

Fig. 6
figure 6

Change of recall with the hyperparameter \(\alpha \), the weight of popularity after recombining with the popularity-free decoupled estimate, in Eq. (41) on the Hetrec validation set

To illustrate the behavior of the method with different values of c and \(\alpha \), we plot the performance in recall with different values of these parameters in Figs. 5 and 6. We observe that the best value for c is slightly above 1, which means that in Eq. (22), we give an even higher probability estimate to the effect of popularity on the user interacting with the given item than what we would infer from the fraction of users consuming the item. Furthermore, best performance is achieved with a low value of \(\alpha \), which corresponds to little weight to popularity after reintroducing it in Eq. (41). Both of these findings confirm the importance of decoupling popularity from taste and the power of the two main constants in Eqs. (22) and (41) that control the strength of the effect.

6 Conclusions

In this work, we addressed the task of implicit feedback recommendation using item-based nearest neighbor methods. We reviewed the classic method and developed a probabilistic explanation of its inner mechanics. We proposed a new interpretation of the prediction process by considering similarity as a conditional probability and proposed incorporating the uncertainty of the observed similarities by using Bayesian credible intervals. Further, we presented an interpretation of the classic ensemble prediction formula that considers the prediction a logical operator over multiple independent predictors. Finally, we developed a novel way of modeling user behavior as a combination of popularity and personal taste and used Bayesian inference to extract the personal taste component. We tested our claims extensively over multiple datasets to assess their validity.

The overall goal of this paper was to explore interpretations of the classic method in a probabilistic context, possibly influencing or inspiring future methods, and to gain an increased understanding of and insight into the mechanics behind these methods. As a result, our proposed solutions also translated into an improvement in recommendation accuracy. In future work, our new algorithm can be evaluated in an ensemble combined with sequential recommenders based on for example recurrent neural networks to complement them with modeling the longer term user taste for recommendation.