1 Introduction

Collaborative filtering aims at identifying interesting information items (e.g. movies, books, websites) for a set of users, given their user profiles. Different from its counterpart, content-based filtering (Belkin and Croft 1992), it utilizes other users’ preferences to perform predictions, thus making direct analysis of content features unnecessary.

User profiles can be explicitly obtained by asking users to rate items that they know. However these explicit ratings are hard to gather in a real system (Claypool et al. 2001). It is highly desirable to infer user preferences from implicit observations of user interactions with a system. These implicit interest functions usually generate frequency-counted profiles, like “playback times of a music file”, or “visiting frequency of a web-site” etc.

So far, academic research into frequency-counted user profiles for collaborative filtering has been limited. A large body of research work for collaborative filtering by default focuses on rating-based user profiles (Adomavicius and Tuzhilin et al. 2005; Marlin 2004). Research started with memory-based approaches to collaborative filtering (Herlocker et al. 1999; Sarwar et al. 2001; Wang et al. 2006; Xue et al. 2005) and lately came with model-based approaches (Hofmann 2004; Jin et al. 2006; Marlin 2004).

In spite of the fact that these rating-based collaborative filtering algorithms lay a solid foundation for collaborative filtering research, they are specifically designed for rating prediction, making them difficult to apply in many real situations where frequency-counted user profiling is demanded. Most importantly, the purpose of a recommender system is to suggest to a user items that he or she might be interested in. The user decision on whether accepting a suggestion (i.e. to review or listen to a suggested item) is a binary one. As already demonstrated in (Deshpande and Karypis 2004; McLaughlin and Herlocker et al. 2004), directly using predicted ratings as ranking scores may not accurately model this common scenario.

This motivated us to conduct a formal study on probabilistic item ranking for collaborative filtering. We start with the Probability Ranking Principle of information retrieval (Robertson 1997) and introduce the concept of “binary relevance” into collaborative filtering. We directly model how likely an item might be relevant to a given user (profile), and for the given user we aim at presenting a list of items in rank order of their predicted relevance. To achieve this, we first establish an item ranking framework by employing the log-odd ratio of relevance and then derive two ranking models from it, namely an item-based relevance model and user-based relevance model. We then draw an analogy between the classic text retrieval model (Robertson and Walker et al. 1994) and our models, effectively decoupling the estimations of frequency counts and (non-)relevance counts from implicit user preference data. Because data sparsity makes the probability estimations less reliable, we finally extend the basic log-odd ratio of relevance by viewing the probabilities of relevance and non-relevance in the models as parameters and apply the Bayesian inference to enforce different prior knowledge and smoothing into the probability estimations. This proves to be effective in two real data sets.

The remainder of the paper is organized as follows. We first describe related work and establish the log-odd ratio of relevance ranking for collaborative filtering. The resulting two different ranking models are then derived and discussed. After that, we provide an empirical evaluation of the recommendation performance and the impact of the parameters of our two models, and finally conclude our work.

2 Related work

2.1 Rating prediction

In the memory-based approaches, all rating examples are stored as-is into memory (in contrast to learning an abstraction), forming a heuristic implementation of the “Word of Mouth” phenomenon. In the rating prediction phase, similar users or (and) items are sorted based on the memorized ratings. Relying on the ratings of these similar users or (and) items, a prediction of an item rating for a test user can be generated. Examples of memory-based collaborative filtering include user-based methods (Breese et al. 1998; Herlocker et al. 1999; Resnick et al. 1994), item-based methods (Deshpande and Karypis 2004; Sarwar et al. 2001) and unified methods (Wang et al. 2008; Wang et al. 2006). The advantage of the memory-based methods over their model-based alternatives is that less parameters have to be tuned; however, the data sparsity problem is not handled in a principled manner.

In the model-based approaches, training examples are used to generate an “abstraction” (model) that is able to predict the ratings for items that a test user has not rated before. In this regard, many probabilistic models have been proposed. For example, to consider user correlation, (Pennock et al. 2000) proposed a method called personality diagnosis (PD), treating each user as a separate cluster and assuming a Gaussian noise applied to all ratings. It computes the probability that a test user is of the same “personality type” as other users and, in turn, the probability of his or her rating to a test item can be predicted. On the other hand, to model item correlation, (Breese et al. 1998) utilizes a Bayesian Network model, in which the conditional probabilities between items are maintained. Some researchers have tried mixture models, explicitly assuming some hidden variables embedded in the rating data. Examples include the aspect models (Hofmann 2004; Jin et al. 2006), the cluster model (Breese et al. 1998) and the latent factor model (Canny 2002). These methods require some assumptions about the underlying data structures and the resulting ‘compact’ models solve the data sparsity problem to a certain extent. However, the need to tune an often significant number of parameters has prevented these methods from practical usage. For instance, in the aspect models (Hofmann 2004; Jin et al. 2006), an EM iteration (called "fold-in") is usually required to find both the hidden user clusters or/and hidden item clusters for any new user.

2.2 Item ranking

Memory-based approaches are commonly used for rating prediction, but they can be easily extended for the purpose of item ranking. For instance, a ranking score for a target item can be calculated by a summation over its similarity towards other items that the target user liked (i.e. in the user preference list). Taking this item-based view, we formally have the following basic ranking score:

$$ o_{u_k}(i_m) = \sum_{i_{m'} \in L_{u_k}} s_I(i_{m'},i_m) $$
(1)

where u k and i m denote the target user and item respectively, and \(i_{m^{\prime}}\,{\in}\,L_{u_k}\) denotes any item in the preference list of user u k . S I is the similarity measure between two items, and in practice cosine similarity and Pearson’s correlation are generally employed. To specifically target the item ranking problem, researchers in (Deshpande and Karypis. 2004) proposed an alternative, TFxIDF-like similarity measure, which is shown as follows:

$$ s_I(i_{m'},i_m)=\frac{Freq(i_{m'},i_m)}{Freq(i_{m^{\prime}})\times Freq(i_m)^{\alpha}} $$
(2)

where Freq denotes the frequency counts of an item \(Freq(i_{m^{\prime}})\) or co-occurrence counts for two items \(Freq(i_{m^{\prime}},i_m).\) α is a free parameter, taking a value between 0 and 1. On the basis of empirical observations, they also introduced two normalization methods to further improve the ranking.

In Wang et al. (2006), we proposed a language modelling approach for the item ranking problem in collaborative filtering. The idea is to view an item (or its presence in a user profile) as the output of a generative process associated with each user profile. Using a linear smoothing technique (Zhai and Lafferty 2001), we have the following ranking formula:

$$ o_{u_k}(i_m) = \sum_{i_{m^{\prime}} \in L_{u_k}} \hbox{ln} \left(\lambda P(i_{m^{\prime}}|i_m) + (1-\lambda)P(i_{m^{\prime}})\right) + \hbox{ln}\,P(i_m) $$
(3)

where the ranking score of a target item is essentially a combination of its popularity (expressed by the prior probability P(i m )) and its co-occurrence with the items in the preference list of the target user (expressed by the conditional probability \(P(i_{m^{\prime}}|i_m)\)). λ ∈ [0, 1] is used as a linear smoothing parameter to further smooth the conditional probability from a background model (\(P(i_{m^{\prime}})\)).

Nevertheless, our formulations in Wang et al. (2006) only take the information about presence/absence of items into account when modelling implicit user preference data, completely ignoring other useful information such as frequency counts (i.e. the number of visiting/playing times). We shall see that the probabilistic relevance framework proposed in this paper effectively extends the language modelling approaches of collaborative filtering. It not only allows us to make use of frequency counts for modelling implicit user preferences but has room to model non-relevance in a formal way. They prove to be crucial to the accuracy of recommendation in our experiments.

3 A probabilistic relevance ranking framework

The task of information retrieval aims to rank documents on the basis of their relevance (usefulness) towards a given user need (query). The Probability Ranking Principle (PRP) of information retrieval (Robertson 1997) implies that ranking documents in descending order by their probability of relevance produces optimal performance under a “reasonable” assumption, i.e. the relevance of a document to a user information need is independent of other documents in the collection (van Rijsbergen 1979).

By the same token, our task for collaborative filtering is to find items that are relevant (useful) to a given user interest (implicitly indicated by a user profile). The PRP applies directly when we view a user profile as a query to rank items accordingly. Hereto, we introduce the concept of “relevancy” into collaborative filtering. By analogy with the relevance models in text retrieval (Lafferty et al. 2003; Robertson and Sparck Jones et al. 1976; Taylor et al. 2003), the top-N recommendation items can be then generated by ranking items in order of their probability of relevance to a user profile or the underlying user interest.

To estimate the probability of relevance between an item and a user (profile), let us first define a sample space of relevance: Φ R and let R be a random variable over the relevance space Φ R . R is either ‘relevant’ r or ‘non-relevant’ \(\bar{r}.\) Secondly, let U be a discrete random variable over the sample space of user id’s: \( \Upphi_U= \{u_{1}, \ldots , u_K\}\) and let I be a random variable over the sample space of item id’s: \( \Upphi_I= \{i_{1}, \ldots , i_M\} ,\) where K is the number of users and M the number of items in the collection. In other words, U refers to the user identifiers and I refers to the item identifiers.

We then denote P as a probability function on the joint sample space Φ U  × Φ I  × Φ R . The PRP now states that we can solve the ranking problem by estimating the probability of relevance P(R = r|UI) and non-relevance \(P(R=\bar{r}|U,I).\) The relevance ranking of items in the collection Φ I for a given user U = u k can be formulated as the log odds of the relevance:

$$ o_{u_k}(i_m) = \hbox{ln}\frac{{P(r| u_k,i_m )}}{{P(\bar{r}| u_k,i_m )}} $$
(4)

For simplicity, the propositions \(R=r, R=\bar{r}, U=u_k\) and I = i m are denoted as \(r, \bar{r}, u_k,\) and i m , respectively.

3.1 Item-based relevance model

Two different models can be derived if we apply the Bayes’ rule differently. This section introduces the item-based relevance model, leaving the derivations of the user-based relevance model in Sect. 3.2.

By factorizing P(•|u k i m ) with P(u k |i m , •)P(•|i m )/P(u k |i m ), the following log-odds ratio can be obtained from Eq. 4:

$$ o_{u_k}(i_m) = \hbox{ln} \frac{{P(u_k |i_m ,r)}} {{P(u_k |i_m ,\bar{r})}} + \hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m )}} $$
(5)

Notice that, in the ranking model shown in Eq. 5, the target user is defined in the user id space. For a given new user, we do not have any observations about his or her relevancy towards an unknown item. This makes the probability estimations unsolvable. In this regard, we need to build a feature representation of a new user by his or her user profile so as to relate the user to other users that have been observed from the whole collection.

This paper considers implicit user profiling: user profiles are obtained by implicitly observing user behavior, for example, the web sites visited, the music files played etc., and a user is represented by his or her preferences towards all the items. More formally, we treat a user (profile) as a vector over the entire item space, which is denoted as a bold letter \(\mathbf{l}:=(l^1,\ldots ,l^{m^\prime},\ldots ,l^{M}),\) where l m denotes an item frequency count, e.g., number of times a user played or visited item \(i_{m^{\prime}}.\) Note that we deliberately used the item index m′ for the items in the user profile, as opposed to the target item index m. For each user u k , the user profile vector is instantiated (denoted as \({\mathbf{l}}_k\)) by assigning this user’s item frequency counts to it: \(l^{m^\prime}=c_k^{m^\prime},\) where \(c_k^{m^\prime}\in\{0,1,2\ldots\}\) denotes number of times the user u k played or visited item \(i_{m^{\prime}}.\) Changing the user presentation from Eq. 5, we have the following:

$$ o_{u_k}(i_m)=\hbox{ln} \frac{{P({\mathbf{l}}_{k} |i_m ,r)}} {{P({\mathbf{l}}_{k} |i_m ,\bar{r})}} + \hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m )}} = \sum_{\forall m'} \hbox{ln} \frac{{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,r)}} {{P(l^{m^\prime} = c_k^{m^\prime} |i_m ,\bar{r})}} + \hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m )}} $$
(6)

where we have assumed frequency counts of items in the target user profile are conditionally independent, given relevance or non-relevance.Footnote 1 Although this conditional independent assumption does not hold in many real situations, it has been empirically shown to be a competitive approach (e.g., in text classification (Eyheramendy et al. 2003)). It is worthwhile noticing that we only ignore the item dependency in the profile of the target user, while for all other users, we do consider their dependence. In fact, how to utilise the correlations between items is crucial to the item-based approach.

For the sake of computational convenience, we intend to focus on the items (\(i_{m^{\prime}},\) where m′ ∈ {1, M}) that are present in the target user profile (\(c_k^{m^\prime} > 0\)). By splitting items in the user profile into two groups, i.e. presence and absence, we have:

$$ o_{u_k}(i_m) = \sum_{\forall m^{\prime}:c_k^{m^\prime} > 0} \hbox{ln} \frac{{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,r)}} {{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,\bar{r})}} + \sum_{\forall m^{\prime}:c_k^{m^\prime}=0} \hbox{ln} \frac{{P(l^{m^\prime} = 0|i_m ,r)}} {{P(l^{m^\prime} = 0 |i_m ,\bar{r})}} + \hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m )}} $$
(7)

Both subtracting

$$ \sum_{\forall m^{\prime}:c_k^{m{^\prime}} > 0} \hbox{ln} \frac{{P(l^{m^\prime} = 0|i_m ,r)}} {{P(l^{m{^\prime}} =0 |i_m ,\bar{r})}}, $$
(8)

to the first term and adding it from the second (where \(\hbox{ln} x-\hbox{ln}y=\hbox{ln} \frac{x}{y}\)) gives

$$ o_{u_k}(i_m) = \left(\sum_{\forall m':c_k^{m^\prime} > 0} \hbox{ln} \frac{{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,r)P(l^{m^\prime} = 0|i_m ,\bar{r})}} {{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,\bar{r})P(l^{m^\prime} = 0 |i_m , r)}} \right) +\left( \sum_{\forall m'} \hbox{ln} \frac{{P(l^{m^\prime} = 0|i_m ,r)}} {{P(l^{m^\prime} = 0 |i_m ,\bar{r})}} \right)+\hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m )}} $$
(9)

where the first term only deals with those items that are present in the user profile. \(P(l^{m^\prime}=c_k^{m^\prime} |i_m ,r)\) is the probability that item \(i_{m^{\prime}}\) occurs \(c_k^{m^\prime}\) times in a profile of a user who likes item i m (i.e. item i m is relevant to this user). In other words, it means, given the evidence that a user who likes item i m , what is the probability that this user plays item \(i_{m^{\prime}}\) c m k times.

In summary, we have the following ranking formula:

$$ o_{u_k}(i_m)= W_{u_k,i_m}+X_{i_m}+Y_{i_m} $$
(10)

where

$$ W_{u_k,i_m} = \left( \sum_{\forall m':c_k^{m^\prime} > 0} \hbox{ln} \frac{{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,r)P(l^{m^\prime} = 0|i_m ,\bar{r})}} {{P(l^{m^\prime}=c_k^{m^\prime} |i_m ,\bar{r})P(l^{m^\prime} = 0 |i_m , r)}} \right) $$
(11)
$$ X_{i_m}=\sum_{\forall m^{\prime}} \hbox{ln} \frac{{P(l^{m^\prime} = 0|i_m ,r)}} {{P(l^{m^\prime} = 0 |i_m ,\bar{r})}} $$
(12)
$$ Y_{i_m} = \hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m)}} $$
(13)

From the final ranking score, we observe that the relevance ranking of a target item in the item-based model is a combination between the evidence that is dependent on the target user profile (\(W_{u_k,i_m}\)) and that of the target item itself (\(X_{i_m}+Y_{i_m}\)). However, we shall see in Sect. 3.2 that, due to the asymmetry between users and items, the final ranking of the user-based model (Eq. 27) only requires the “user profile”-dependent evidence.

3.1.1 Probability estimation

Let us look at the weighting function \(W_{u_k,i_m}\) (Eq. 11) first. Item occurrences within user profiles (either \(P(l^{m^\prime}=c_k^{m^\prime} |i_m,r)\) or \(P(l^{m^\prime}=c_k^{m^\prime} |i_m,\bar{r})\)) can be modeled by a Poisson distribution. Yet, an item occurring in a user profile does not necessarily mean that this user likes this item: randomness is another explanation, particularly when the item occurs few times only. Thus, a better model would be a mixture of two Poisson models, i.e. a linear combination between a Poisson model coping with items that are “truly” liked by the user and a Poisson model dealing with some background noise. To achieve this, we introduce a hidden random variable \(E^{m^\prime}\in\{e,\bar{e}\}\) for each of the items in the user profile, describing whether the presence of the item in a user profile is due to the fact that the user truly liked it (E m = e), or because the user accidentally selected it (\(E^{m^\prime}=\bar{e}\)). A graphical model describing the probabilistic relationships among the random variables is illustrated in Fig. 1a. More formally, for the relevance case, we have

$$ \begin{aligned} P(l^{m^\prime}=c_k^{m^\prime}|i_m, r) &= P(l^{m^\prime}=c_k^{m^\prime}|e)P(e|i_m, r)+P(l^{m^\prime}=c_k^{m^\prime}|\bar{e})P(\bar{e}|i_m, r)\\ &= \frac{\lambda_{1}^{(c_k^{m^\prime})}\exp(-\lambda_{1})}{(c_k^{m^\prime})!}p+ \frac{\lambda_{0}^{(c_k^{m^\prime})}\exp(-\lambda_{0})}{(c_k^{m^\prime})!}(1-p) \end{aligned} $$
(14)

where λ1 and λ0 are the two Poisson means, which can be regarded as the expected item frequency counts in the two different cases (e and \(\bar{e}\)) respectively. p ≡ P(e|i m , r) denotes the probability that the user indeed likes item \(i_{m}^{\prime},\) given the condition that he or she liked another item i m . A straight-forward method to obtain the parameters of the Poisson mixtures is to apply the Expectation-Maximization (EM) algorithm (Dempster et al. 1977). To illustrate this, Fig. 1b plots the histogram of the item frequency distribution in the Last.FM data set as well as its estimated Poisson mixtures by applying the EM algorithm.

Fig. 1
figure 1

A Poisson mixture model for modelling the item occurrences in user profiles. (a) A graphical model of the Poisson mixtures. (b) An estimation of the Poisson mixtures for the Last.FM data set in the relevance case (λ0 = 0.0028, λ1 = 6.4691 and p = 0.0046)

The same can be applied to the non-relevance case. Incorporating the Poisson mixtures for the both cases into Eq. 11 gives

$$ \begin{aligned} &W_{u_k,i_m} = \sum_{\forall m':c_k^{m^\prime} > 0} W_{i_m',i_m} \\=& \sum_{\forall m':c_k^{m^\prime} > 0} \hbox{ln} \frac{\left(\lambda_{1}^{(c_k^{m^\prime})}\exp(-\lambda_{1})p+\lambda_{0}^{(c_k^{m^\prime})}\exp(-\lambda_{0})(1-p)\right) \left(\exp(-\lambda_{1})q+\exp(-\lambda_{0})(1-q)\right)} {\left(\lambda_{1}^{(c_k^{m^\prime})}\exp(-\lambda_{1})q+\lambda_{0}^{(c_k^{m^\prime})}\exp(-\lambda_{0})(1-q)\right) \left(\exp(-\lambda_{1})p+\exp(-\lambda_{0})(1-p)\right) } \\=& \sum_{\forall m':c_k^{m^\prime} > 0} \hbox{ln} \frac{\left(p+(\lambda_{0}/\lambda_1)^{(c_k^{m^\prime})}\exp(\lambda_1-\lambda_{0})(1-p)\right) \left(\exp(\lambda_0-\lambda_{1})q+(1-q)\right)} {\left(q+(\lambda_{0}/\lambda_1)^{(c_k^{m^\prime})}\exp(\lambda_1-\lambda_{0})(1-q)\right) \left(\exp(\lambda_0-\lambda_{1})p+(1-p)\right)} \end{aligned} $$
(15)

where, similarly, \(q\equiv P(e|i_m, \bar{r})\) denotes the probability of the true preference of an item in the non-relevance case, while \(W_{i_m^{\prime},i_m}\) denotes the ranking score obtained from the target item and the item in the user profile.

For each of the item pairs \((i_m^{\prime},i_m),\) we need to estimate four parameters (p, q, λ0 and λ1), making the model difficult to apply in practice. Furthermore, it should be emphasised that the component distributions estimated by the EM algorithm may not necessarily correspond to the two reasons that we mentioned for the presence of an item in a user profile, even if the estimated mixture distribution may fit the data well.

In this regard, this paper takes an alternative approach, approximating the ranking function by a much simpler function. In text retrieval, a similar two-Poisson model has been proposed for modeling within-document term frequencies (Harter 1975). To make it applicable also, (Robertson and Walker et al. 1994) introduced an approximation method, resulting in the widely-used BM25 weighting function for query terms. Following the same way of thinking, we can see that the weighting function for each of the items in the target user profile \(W_{i_m^{\prime},i_m}\) (Eq. 15) has the following characteristics: (1) Function \(W_{i_m^{\prime},i_m}\) increases monotonically with respect to the item frequency count \(c_k^{m^\prime},\) and (2) it reaches its upper-bound, governed by log(p(1 − q)/q(1 − p)), when \(c_k^{m^\prime}\) becomes infinity ∞ (Sparck et al. 2000, Sparck et al. 2000). Roughly speaking, as demonstrated in Fig. 2, the parameters λ0 and λ1 can adjust the rate of the increase (see Fig. 2a), while the parameters p and q mainly control the upper bound (see Fig. 2b).

Fig. 2
figure 2

The relationship between weighting function \(W_{{i_m'},{i_m}}\) and its four parameters λ0, λ1, p and q. We plot ranking score \(W_{{i_m'},{i_m}}\) against various item frequency counts \(c_k^{m^\prime}\) from 0 to 20. (a) We fix λ0 = 0.02, p = 0.02 and q = 0.010, and vary λ1 ∈ {0.03, 0.04, 0.1, 0.4, 5}. (b) We fix λ0 = 0.02, λ1 = 0.04 and p = 0.02, and vary q ∈ {0.020, 0.018, 0.016, 0.014, 0.012, 0.010}

Therefore, it is intuitively desirable to approximate these two characteristics separately. Following the discussion in (Robertson and Walker 1994), we choose the function \(c_{k}^{m^\prime}/(k_3+c_{k}^{m^\prime})\) (where k 3 is a free parameter), which increases from zero to an asymptotic maximum, to model the monotonic increase with respect to the item frequency counts. Since the probabilities q and p cannot be directly estimated, a simple alternative is to use the probabilities of the presence of the item, i.e. \(P({l^{m^\prime} > 0} |i_m ,r)\) and \(P({l^{m^\prime} > 0} |i_m,\bar{r})\) to approximate them respectively. In summary, we have the following ranking function:

$$ W_{u_k,i_m} \approx \left( \sum_{\forall m':c_k^{m^\prime} > 0} \frac{c_k^{m^\prime}}{k_3+c_k^{m^\prime}}\hbox{ln} \frac{{P({l^{m^\prime} > 0} |i_m ,r)P({l^{m^\prime}=0} |i_m ,\bar{r})}} {{P({l^{m^\prime} > 0} |i_m ,\bar{r})P({l^{m^\prime}=0} |i_m ,r)}} \right) $$
(16)

where the free parameter k 3 is equivalent to the normalization parameter of within-query frequencies in the BM25 formula (Robertson and Walker 1994) (also see Appendix A), if we treat a user profile as a query. \(P({l^{m^\prime} > 0} |i_m,r)\) (or \(P({l^{m^\prime} > 0} |i_m,\bar{r})\)) is the probability that item m′ occurs in a profile of a user who is relevant (or non-relevant) to item i m . Equation 16 essentially decouples frequency counts \(c_k^{m^\prime}\) and presence (absence) probabilities (e.g. P(l m > 0 |i m r)), thus largely simplifying the computation in practice.

Next, we consider the probability estimations of presence (absence) of items in user profiles. To handle data sparseness, different from the Robertson-Sparck Jones probabilistic retrieval (RSJ) model (Robertson and and Sparck Jones 1976), we propose to use Bayesian inference (Gelman et al. 2003) to estimate the presence (absence) probabilities. Since we have two events, either an item is present (l m > 0) or absent (l m = 0), we assume that the probability follow the Bernoulli distribution. That is, we define \(\theta_{m^{\prime},m}\equiv P({l^{m^\prime} > 0} |i_m,r)\), where \(\theta_{m^{\prime},m}\) is regarded as the parameter of a Bernoulli distribution. For simplicity, we treat the parameter as a random variable and estimate its value by maximizing an a posteriori probability. Formally we have

$$ \hat{\theta}_{m^{\prime},m}=\mathop{\hbox{argmax}}\limits_{\theta_{m',m}} p(\theta_{m',m}|r_{m',m},R_{m};{\alpha_r,\beta_r}) $$
(17)

where R m denotes the number of user profiles that are relevant to an item i m , and among these user profiles, \(r_{m^{\prime},m}\) denotes the number of the user profiles where an item \(i_{m^{\prime}}\) is present. This establishes a contingency table for each item pair (shown in Table 1). In addition, we choose the Beta distribution as the prior (since it is the conjugate prior for the Bernoulli distribution), which is denoted as Beta r , β r ). Using the conjugate prior, the posterior probability after observing some data turns to the Beta distribution again with updated parameters.

$$ p(\theta_{m',m}|r_{m',m},R_{m};{\alpha_r,\beta_r}) \propto \theta_{m',m}^{r_{m',m}+\alpha_r-1}(1-\theta_{m',m})^{R_{m}-r_{m',m}+\beta_r-1} $$
(18)

Maximizing an a posteriori probability in Eq. 18 (i.e. taking the mode) gives the estimation of the parameter (Gelman et al. 2003)

$$ \hat{\theta}_{m',m}= \frac{r_{m^{\prime},m}+\alpha_r -1}{R_{m}+\alpha_r+\beta_r-2} $$
(19)

Following the same reasoning, we obtain the probability of item occurrences in the non-relevance case.

$$ P(l^{i_{m^{\prime}}} > 0|i_m,\bar{r})\equiv\hat{\gamma}_i= \frac{n_{m^{\prime}}-r_{m^{\prime},m}+\alpha_{\bar{r}} -1}{K-R_{m}+\alpha_{\bar{r}}+\beta_{\bar{r}}-2} $$
(20)

where we used \(\hat{\gamma}_i\) to denote \(P(l^{i_{m^{\prime}}} > 0|i_m,\bar{r}).\ \alpha_{\bar{r}}\) and \(\beta_{\bar{r}}\) are again the parameters of the conjugate prior (\(Beta(\alpha_{\bar{r}},\beta_{\bar{r}}\))), while \(n_{m^{\prime}}\) denotes the number of times that item \(i_{m^{\prime}}\) is present in a user profile (See Table 1). Replacing Eqs. 19 and  20 into Eq. 16, we have

$$ \begin{aligned} W_{u_k,i_m}\approx&\sum_{\forall m': c_k^{m^\prime}} \frac{c_k^{m^\prime}}{k_3+c_k^{m^\prime}}\hbox{ln} \frac{\hat \theta_i (1-\hat\gamma_i) } {\hat\gamma_i (1-\hat\theta_i)} \\ =&\sum_{\forall m': c_k^{m^\prime}}\frac{c_k^{m^\prime}}{k_3+c_k^{m^\prime}} \hbox{ln} \frac{({r_{m',m}+\alpha_r -1}) ((K-R_{m})-(n_{m'}-r_{m',m})+\beta_{\bar{r}} -1) } {(n_{m'}-r_{m',m}+\alpha_{\bar{r}} -1) (R_{m}-r_{m',m}+\beta_r -1)} \\ \end{aligned} $$
(21)
Table 1 Contingency table of relevance vs. occurrence: item model

The four hyper-parameters \((\alpha_r,\alpha_{\bar r},\beta_r,\beta_{\bar{r}})\) can be treated as pseudo frequency counts. Varying choices for them leads to different estimators (Zaragoza et al. 2003). In the information retrieval domain (Robertson and and Sparck Jones et al. 1976; Robertson and Walker 1994), adding an extra 0.5 count for each probability estimation has been widely used to avoid zero probabilities. This choice corresponds to set tiny constant values \(\alpha_r=\alpha_{\bar{r}}=\beta_r=\beta_{\bar{r}}=1.5.\) We shall see that in the experiments collaborative filtering needs relatively bigger pseudo counts for the non-relevance and/or absence estimation (\(\alpha_{\bar{r}},\) β r and \(\beta_{\bar{r}}\)). This can be explained because using absence to model non-relevance is noisy, so more smoothing is needed. If we define a free parameter v and set it to be equal to a r  − 1, we have the generalized Laplace smoothing estimator. Alternatively, the prior can be fit on a distribution of the given collection (Zhai and Lafferty 2001).

Applying the Bayesian inference similarly, we obtain \(X_{i_m}\) as follows:

$$ \begin{aligned} X_{i_m}=&\sum_{i_{m'}} \hbox{ln} \frac{{P(l^{i_{m^{\prime}}} = 0|i_m ,r)}} {{P(l^{i_{m^{\prime}}}=0 |i_m ,\bar{r})}}\\ = &\sum_{i_{m'}} \hbox{ln} \frac{(K-R_{m}+\alpha_r + \beta_r -2)(R_{m}-r_{m^{\prime},m}+\beta_r -1)} {( R_{m} +\alpha_r + \beta_r -2) (K-R_{m}-(n_{m^{\prime}}-r_{m^{\prime},m})+\beta_{\bar{r}}-1)} \end{aligned} $$
(22)

For the last term, the popularity ranking \(Y_{i_m},\) we have

$$ \begin{aligned} Y_{i_m}=\hbox{ln} \frac{{P(r|i_m )}} {{P(\bar{r} |i_m )}} = \hbox{ln} \frac{R_{m}}{K-R_{m}} \qquad\qquad\qquad\quad \\ \end{aligned} $$
(23)

Notice that in the initial stage, we do not have any relevance observation of item i m . We may assume that if a user played the item frequently (say played more than t times), we treat this item being relevant to this user’s interest. By doing this, we can also construct the contingency table to be able to estimate the probabilities.

3.2 User-based relevance model

Applying the Bayes’ rule differently results in the following formula from Eq. 4:

$$ o_{u_k}(i_m) = \hbox{ln} \frac{{P( i_m |u_k,r)}} {{P(i_m|u_k ,\bar{r})}} + \hbox{ln} \frac{{P(r|u_k )}} {{P(\bar{r} |u_k )}} $$
(24)

Similarly, using frequency counts over a set of users \( (l^1, \ldots , l^{k^\prime}, \ldots ,l^K)\) to represent the target item i m , we get

$$ \begin{aligned} S_{u_k}(i_m) =& \sum_{\forall k': c_{k'}^{m} > 0} \hbox{ln} \frac{{P(l^{k'}=c_{k'}^{m} |u_k ,r)P(l^{k'} = 0|u_k ,\bar{r})}} {{P(l^{k'}=c_{k'}^{m} |u_k ,\bar{r})P(l^{k'} = 0 |u_k , r)}}\\ &+\sum_{\forall k'} \hbox{ln} \frac{{P(l^{k'}=0|u_k ,r)}} {{P(l^{k'} =0 |u_k ,\bar{r})}} +\hbox{ln} \frac{{P(r|u_k )}} {{P(\bar{r} |u_k )}} \\ \end{aligned} $$
(25)

where the last two terms in the formula are independent of target items, they can be discarded. Thus we have

$$ S_{u_k}(i_m)\propto_{u_k}\sum_{\forall k': c_{k'}^{m} > 0} \hbox{ln} \frac{{P(l^{k'}=c_{k'}^{m} |u_k ,r)P(l^{k'} = 0|u_k ,\bar{r})}} {{P(l^{k'}=c_{k'}^{m} |u_k ,\bar{r})P(l^{k'} = 0 |u_k , r)}} $$
(26)

where \(\propto_{u_k}\) denotes same rank order with respect to u k .

Following the same steps (the approximation to two-Poisson distribution and the MAP probability estimation) as discussed in the previous section gives

$$ \begin{aligned} S_{u_k}(i_m)\propto_{u_k} &\sum_{\forall k': c_{k'}^{m} > 0} \frac{c_{k'}^{m}}{{\mathcal{K}}+c_{k'}^{m}}\hbox{ln} \frac{{P({l^{k'} > 0} |u_k ,r)P({l^{k'}=0} |u_k ,\bar{r})}} {{P({l^{k'} > 0} |u_k ,\bar{r})P({l^{k'}=0} |u_k ,r)}} \\ = &\sum_{\forall k': c_{k'}^{m} > 0} \frac{c_{k'}^{m}}{{\mathcal{K}}+ c_{k'}^{m}} \hbox{ln}\frac{(r_{k',k}+\alpha_r-1)(M-n_{k'}-R_{k}+r_{k',k}+\beta_{\bar{r}}-1)} {(n_{k'}-r_{k',k}+\alpha_r{\bar{r}}-1)(R_{k}-r_{k',k}+\beta_r -1)} \end{aligned} $$
(27)

where \({\mathcal{K}}=k_1((1-b)+bL_m).\) k 1 is the normalization parameter of the frequency counts for the target item, L m is the normalized item popularity (how many times the item i m has been “used”) (i.e. the popularity of this item divided by the average popularity in the collection), and  b ∈ [0, 1] denotes the mixture weight. Notice that if we treat an item as a document, the parameter k 1 is equivalent to the normalization parameter of within-document frequencies in the BM25 formula (see Appendix A). Table 2 shows the contingency table of user pairs.

Table 2 Contingency table of relevance vs occurrence: user model

3.3 Discussion

Previous studies on collaborative filtering, particularly memory-based approaches, make a distinction between user-based (Breese et al. 1998; Herlocker et al. 1999; Resnick et al. 1994) and item-based approaches (Deshpande and Karypis 2004; Sarwar et al. 2001). Our probabilistic relevance models were derived with an information retrieval view on collaborative filtering. They demonstrated that the user-based (relevance) and item-based (relevance) models are equivalent from a probabilistic point of view, since they have actually been derived from the same generative relevance model. The only difference corresponds to the choice of independence assumptions in the derivations, leading to the two different factorizations. But statistically they are inequivalent because the different factorizations lead to the different probability estimations; In the item-based relevance model, the item-to-item relevancy is estimated while in the user-based one, the user-to-user relevancy is required instead. We shall see shortly in our experiments that the probability estimation is one of the important factors influencing recommendation performance.

4 Experiments

4.1 Data sets

The standard data sets used in the evaluation of collaborative filtering algorithms (i.e. MovieLens and Netflix) are rating-based, which are not suitable for testing our method using implicit user profiles. This paper adopts two implicit user profile data.

The first data set comes from a well known social music web site: \({\tt Last.FM}.\) It was collected from the play-lists of the users in the community by using a plug-in in the users’ media players (for instance, Winamp, iTunes, XMMS etc). Plug-ins send the title (song name and artist name) of every song users play to the Last.FM server, which updates the user’s musical profile with the new song. For our experiments, the triple {userID, artistID, Freq} is used.

The second data set was collected from one well-known collaborative tagging Web site, \({\tt del.icio.us}.\) Unlike other studies focusing on directly recommending contents (Web sites), here we intend to find relevance tags on the basis of user profiles as this is a crucial step in such systems. For instance, the tag suggestion is needed in helping users assigning tags to new contents, and it is also useful when constructing a personalized “tag cloud” for the purpose of exploratory search (Wang et al. 2007). The Web site has been crawled between May and October 2006. We collected a number of the most popular tags, found which users were using these tags, and then downloaded the whole profiles of these users. We extracted the triples {userID, tagID, Freq} from each of the user profiles. User IDs are randomly generated to keep the users anonymous. Table 3 summarizes the basic characteristics of the data sets.Footnote 2

Table 3 Characteristics of the test data sets

4.2 Experiment protocols

For 5-fold cross-validation, we randomly divided this data set into a training set (80% of the users) and a test set (20% of the users). Results are obtains by averaging 5 different runs (sampling of training/test set). The training set was used to estimate the model. The test set was used for evaluating the accuracy of the recommendations on the new users, whose user profiles are not in the training set. For each test user, 5, 10, or 15 items of a test user were put into the user profile list. The remaining items were used to test the recommendations.

In information retrieval, the effectiveness of the document ranking is commonly measured by precision and recall (Baeza-Yates and Ribeiro-Neto 1999). Precision measures the proportion of retrieved documents that are indeed relevant to the user’s information need, while recall measures the fraction of all relevant documents that are successfully retrieved. In the case of collaborative filtering, we are, however, only interested in examining the accuracy of the top-N recommended items, while paying less attention to finding all the relevant items. Thus, our experiments here only consider the recommendation precision, which measures the proportion of recommended items that are ground truth items. Note that the items in the profiles of the test user represent only a fraction of the items that the user truly liked. Therefore, the measured precision underestimates the true precision.

4.3 Performance

We choose the state-of-the-art item ranking algorithms that have been discussed in Section 2.2 as our baselines. For the method proposed in (Deshpande and Karypis 2004), we adopt their implementation, the top-N suggest recommendation libraryFootnote 3 which is denoted as \({\tt SuggestLib}.\) We also implement the language modelling approach of collaborative filtering in (Wang et al. 2006) and denote this approach as \({\tt LM\hbox{-}LS}\) while its variant using the Bayes’ smoothing (i.e., a Dirichlet prior) is denoted as \({\tt LM\hbox{-}BS}.\) To make a comparison, the parameters of the algorithms are set to the optimal ones.

We set the parameters of our two models to the optimal ones and compare them with these strong baselines. The item-based relevance model is denoted as \({\tt BM25\hbox{-}\tt {Item}}\) while the user-based relevance model is denoted as \({\tt BM25\hbox{-}\tt{User}}.\) Results are shown in Figs. 3 and 4 over different returned items. Let us first compare the performance of the \({\tt BM25\hbox{-}\tt{Item}}\) and \({\tt BM25\hbox{-}\tt{User}}\) models. For the \({\tt Last.FM}\) data set (Fig. 3), the item-based relevance model consistently performs better than the user-based relevance model. This confirms a previous observation that item-to-item similarity (relevancy) in general is more robust than user-to-user similarity (Sarwar et al. 2001). However, if we look at the \({\tt del.icio.us}\) data (Fig. 4), the performance gain from the item-based relevance model is not clear any more—we obtain a mixture result and the user-based one even outperforms the item-based one when the number of items in user preferences is set to 15 (see Fig. 4c). We think this is because the characteristics of data set play an important role for the probability estimations in the models. In the \({\tt Last.FM}\) data set, the number of users is larger than the number of items (see Table 3). It basically means that we have more observations from the user side about the item-to-item relevancy while having less observations from the item side about user-to-user relevancy. Thus, in the \({\tt Last.FM}\) data set, the probability estimation for the item based relevance model is more reliable than that of the user-based relevance model. But in the \({\tt del.icio.us}\) data set (see Table 3), the number of items is larger than the number of users. Thus we have more observations about user-to-user relevancy from the item side, causing a significant improvement for the user-based relevance model.

Fig. 3
figure 3

Precision of different methods in the \({\tt Last.FM}\) data set

Fig. 4
figure 4

Precision of different methods in the \({\tt del.icio.us}\) data set

Since the item-based relevance model in general outperforms the user-based relevance model, we next compare the item-based relevance model with other methods (shown in Table 4 and 5). From the tables, we can see that the item-based relevance model performs consistently better than the \({\tt SuggestLib}\) method over all the configurations. A Wilcoxon signed-rank test (Hull 1993)is done to verify the significance. We also observe that in most of the configurations our item-based model significantly outperforms the language modelling approaches, both the linear smoothing and the Bayesian smoothing variants. We believe that the effectiveness of our model is due to the fact that the model naturally integrates frequency counts and probability estimation of non-relevance into the ranking formula, apart from other alternatives.

Table 4 Comparison with the other approaches. Precision is reported in the \({\tt Last.FM}\) data set
Table 5 Comparison with the other approaches. Precision is reported in the \({\tt del.icio.us}\) data set

4.4 Parameter estimation

This section tests the sensitivity of the parameters, using the \({\tt del.icio.us}\) data set. Recall that for both the item-based relevance model (shown in Eq. 10) and the user-based relevance model (shown in Eq. 27), we have frequency smoothing parameter k 1 (and b) or k 3, and co-occurrence smoothing parameters α and β. We first test the sensitivity of the frequency smoothing parameters. Figure 5 shows recommendation precision against the parameters k 1 and b of the user-based relevance model while Fig. 6 shows recommendation precision varying the parameter k 3 of the item relevance model. The optimal values in the figures demonstrate that both the frequency smoothing parameters (k 1 and k 3) and the length normalization parameter b, inspired by the BM25 formula, indeed improve the recommendation performance. We also observe that these parameters are relatively insensitive to different data sets and their different sparsity setups.

Fig. 5
figure 5

Parameters in the user-based relevance model

Fig. 6
figure 6

The smoothing parameter of frequency counts k 3 in the item-based relevance model

Next we fix the frequency smoothing parameters to the optimal ones and test the co-occurrence smoothing parameters for both models. Figures 7 and 8 plot the smoothing parameters against the recommendation precision. More precisely, Figs. 7a and  8a plot the smoothing parameter for the relevance part v 1 = α r  − 1 while Figs. 7b and  8b plot that of the non-relevance or absence parts; all of them are set to be equal (\(v_2=\alpha_{\bar{r}}-1=\beta_r -1=\beta_{\bar r}-1\)) in order to minimize the number of parameters while still retaining comparable performance. From the figures, we can see that the optimal smoothing parameters (pseudo counts) of the relevance part v 1 are relatively small, compared to those of the non-relevance part. For the user-based relevance model, the pseudo counts of the non-relevance estimations are in the range of (Deshpande and Karypis 2004; Hofmann 2004) (Fig. 7b) while for the item-based relevance model, they are in the range of [50,100] (Fig. 8b). It is due to the fact that the non-relevance estimation is not as reliable as the relevance estimation and thus more smoothing is required.

Fig. 7
figure 7

The relevance/non-relevance smoothing parameters in the user-based relevance model

Fig. 8
figure 8

The relevance/non-relevance smoothing parameters in the item-based relevance model

5 Conclusions

This paper proposed a probabilistic item ranking framework for collaborative filtering, which is inspired by the classic probabilistic relevance model of text retrieval and its variants (Robertson and and Sparck Jones et al. 1976; Robertson and Walker 1994; Sparck et al. 2000; Sparck et al. 2000). We have derived two different models in the relevance framework in order to generate top-N item recommendations. We conclude from the experimental results that the proposed models are indeed effective, and significantly improve the performance of the top-N item recommendations.

In current settings, we fix a threshold when considering frequency counts as relevance observations. In the future, we may also consider graded relevance with respect to the number of times a user played an item. To do this, we may weight (sampling) the importance of the user profiles according to the number of times the user played/reviewed an item when we construct the contingency table. In current models, the hyperparameters are obtained by using cross-validation. In the future, it is worthwhile investigating the evidence approximation framework (Bishop and 2006) by which the hyperparameters can be estimated from the whole collection; or we can take a full Bayesian approach that integrates over the hyperparameters and the model parameters by adopting variational methods (Jordan 1999).

It has been seen in this paper that relevance is a good concept to explain the correspondence between user interest and information items. We have setup a close relationship between the probabilistic models of text retrieval and these of collaborative filtering. It facilitates a flexible framework to tryout more of the techniques that have been used in text retrieval to the related problem of collaborative filtering. For instance, relevance observations can be easily incorporated in the framework once we have relevance feedback from users. An interesting observation is that, different from text retrieval, relevance feedback for a given user in collaborative filtering is not dependent of this user’s “query” (a user profile) only. It instead has a rather global impact, and affects the representation of the whole collection; Relevance feedback from one user could influence the ranking order of the other users. It is also worthwhile investigating query expansion by including more relevant items as query items or re-calculating (re-constructing) the contingency table according to the relevance information.

Finally, a combination of the two relevance models is of interest (Wang et al. 2008, 2006). This has some analogies with the “unified model” idea in information retrieval (Robertson et al. 1982). However, there are also some differences: in information retrieval, based on explicit features of items and explicit queries, simple user relevance feedback relates to the current query only, and a unified model is required to achieve the global impact which we have already identified in the present (non-unified) models for collaborative filtering. These subtle differences make the exploration of the unified model ideas particularly attractive.