Probabilistic topic models for sequence data

Barbieri, Nicola; Manco, Giuseppe; Ritacco, Ettore; Carnuccio, Marco; Bevacqua, Antonio

doi:10.1007/s10994-013-5391-2

Probabilistic topic models for sequence data

Published: 03 July 2013

Volume 93, pages 5–29, (2013)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Probabilistic topic models for sequence data

Download PDF

Nicola Barbieri¹,
Giuseppe Manco²,
Ettore Ritacco²,
Marco Carnuccio³ &
…
Antonio Bevacqua³

3854 Accesses
20 Citations
4 Altmetric
Explore all metrics

Abstract

Probabilistic topic models are widely used in different contexts to uncover the hidden structure in large text corpora. One of the main (and perhaps strong) assumption of these models is that generative process follows a bag-of-words assumption, i.e. each token is independent from the previous one. We extend the popular Latent Dirichlet Allocation model by exploiting three different conditional Markovian assumptions: (i) the token generation depends on the current topic and on the previous token; (ii) the topic associated with each observation depends on topic associated with the previous one; (iii) the token generation depends on the current and previous topic. For each of these modeling assumptions we present a Gibbs Sampling procedure for parameter estimation. Experimental evaluation over real-word data shows the performance advantages, in terms of recall and precision, of the sequence-modeling approaches.

A Nonparametric N-Gram Topic Model with Interpretable Latent Topics

An N-Gram Topic Model for Time-Stamped Documents

Big topic modeling based on a two-level hierarchical latent Beta-Liouville allocation for large-scale data and parameter streaming

Article 28 February 2024

1 Introduction and background

Probabilistic topic models, such as the popular Latent Dirichlet Allocation (LDA) (Blei et al. 2003), assume that each collection of documents exhibits an hidden thematic structure. The intuition is that each document may exhibit multiple topics, where each topic is characterized by a probability distribution over words of a fixed size dictionary. This representation of the data into the latent-topic space offers several advantages from a modeling perspective, and topic modeling techniques have been applied to different contexts. Example scenarios range from traditional problems (such as dimensionality reduction and classification) to novel areas (such as the generation of personalized recommendations).

Traditional LDA-based approaches propose a data generation process that is based on a “bag-of-words” assumption, i.e. such that the order of the items in a document can be neglected. This assumption fits textual data, where probabilistic topic models are able to detect recurrent co-occurrence patterns, which are used to define the topic space. However, there are several real-world applications where data can be “naturally” interpreted as sequences, such as biological data, web navigation logs, customer purchase history, etc. Ignoring the intrinsic sequentiality of the data, may result in poor modeling: according to the bag-of-word assumption, co-occurrences are modeled independently for each word, via a probability distribution over the dictionary in which some words exhibit a higher likelihood to appear than others. On the other hand, sequential data may express causality and dependency, and different topics can be used to characterize different dependency likelihoods. The focus here is the context where a current user acts and expresses preferences, i.e., the environment, characterized by side information, where the observations hold. Our claim is that the context can be enriched by the sequential information, and the latter allows a more refined modeling. In practice, a sequence expresses a context which provides valuable information for the modeling.

The above observation is particularly noteworthy when data express preferences made by users, and the ultimate objective is to model a user’s behavior in order to provide accurate recommendations. The analysis of the sequential patterns has important applications in modern recommender systems (RSs), which are significantly focusing on an accurate balance between personalization and contextualization techniques. For example, in Internet based streaming services for music or video (such as Last.fm^{Footnote 1} and Videolectures.net^{Footnote 2}), the context of the user interaction with the system can easily be interpreted by analyzing the content previously requested. The assumption here is that the current item (and/or its genre) influences the next choice of the user. In particular, if a specific user is in the “mood” for classical music (as observed in the current choice), it is unlikely that the immediate subsequent choice will depart from the aforementioned mood, in favor of a song of different genre. Being able to capture such properties and exploiting them in recommendation strategy can greatly improve the accuracy of the recommendation.

Recommender systems have greatly benefited from probabilistic modeling techniques based on LDA. Recent works in fact have empirically shown that probabilistic latent topics models represent the state-of-the-art in the generation of accurate personalized recommendations (Barbieri and Manco 2011; Barbieri et al. 2011, 2012). More generally, probabilistic techniques offer some renowned advantages: notably, they can be tuned to optimize a variety of loss functions; moreover optimizing the likelihood allows to model a distribution over rating values which can be used to determine the confidence of the model in providing a recommendation; finally, they allow the possibility to include prior knowledge into the generative process, thus allowing a more effective modeling of the underlying data distribution. Notably, when preferences are implicitly modeled through selection (that is, when no rating information is available), the simple LDA best models the probability that an item is actually selected by a user so far (Barbieri and Manco 2011).

Following the research direction outlined above, in this paper we study the effects of “contextual” information in probabilistic modeling of preference data. We focus on the case where the context can be inferred from the analysis of the sequence data, and we propose some topic models which explicitly make use of dependency information. As a matter of fact, the issue has been dealt with in similar papers (like, e.g. Wallach 2006). Here, we summarize and extend the approaches in the literature, by covering different ways of modeling dependency within preference data. Furthermore, we concentrate on the effects of such modeling on recommendation accuracy, as it explicitly reflects accurate modeling of user behavior.

In short, the contributions of the paper can be summarized as follows.

1.
We propose a unified probabilistic framework to model dependency in preference data, and instantiate the framework in accordance to different assumptions on the sequentiality of the underlying generative process.
2.
We study and experimentally compare the proposed models, and highlight relative advantages and weaknesses.
3.
We study how to adapt the proposed frameworks to support a recommendation scenario. In particular, for each of the proposed model, we provide the relative ranking functions that can be used to generate personalized and context-aware recommendation lists.
4.
We finally show that the proposed sequential modeling of preference data better models the underlying data, as it allows more accurate recommendations in terms of precision and recall.

The paper is structured as follows. In Sect. 2 we introduce sequential modeling according to different dependency assumptions, and specify in Sect. 3 the corresponding item ranking functions for supporting recommendations. The experimental evaluation of the proposed approaches is then presented in Sect. 4, in which we measure the performance of the approaches in a recommendation scenario. In Sect. 5 we qualitatively compare the models studied in this paper with the current literature. Section 6 concludes the paper with a summary of the findings and a discussion of possible extensions.

2 Modeling sequence data

In a general setting, we consider a set $\mathcal{I}=\{1, \ldots, N\}$ of tokens, representing the vocabulary of possible events that can be observed. Example events are words that can be observed in a document, or items that can be purchased by a customer. A corpus W={w ₁,…,w _M} is a collection of traces, where $\mathbf {w}_{d} = [ w_{d,1} . w_{d,2} . \cdots. w_{d,N_{d}-1}.w_{d,N_{d}} ]$ is the sequence of tokens for trace d, and $w_{d,j}\in\mathcal{I}$. The set $\mathcal{I}_{d}\subseteq\mathcal{I}$ denotes all the tokens in w _d. We also assume that each token is characterized by a latent factor, called topic, triggering the underlying event. That is, a topic set Z={z ₁,…,z _M} is associated to the data, where, again $\mathbf{z}_{d} = [z_{d,1} . z_{d,2} . \cdots. z_{d,N_{d}-1}.z_{d,N_{d}}]$ is a latent topic sequence, and z _d,j∈{1,…,K} is the latent topic associated with token w _d,j. By assuming that Φ and Θ are the distribution functions governing the likelihood of W and Z (with respective priors β and α), we can express the complete likelihood as:

(1)

where P(Φ|β) and P(Θ|α) are specified according to the modeling assumptions. In particular, in the standard LDA setting where all tokens are independent and exchangeable, we have:

$$ \begin{aligned}[c] P(\mathbf{w}_d| \mathbf{z}_d, \boldsymbol{\varPhi}) & = \prod _{j=1}^{N_d} P(w_{d,j}|z_{d,j}, \boldsymbol{\varPhi}) ,\qquad P(w|k,\boldsymbol{\varPhi}) = \prod _{s=1}^N \varphi_{k,s}^{\delta _{s,w}} \\ P(\mathbf{z}_d|\boldsymbol{\theta}_d) & = \prod _{j=1}^{N_d} P(z_{d,j}|\boldsymbol{ \theta}_d) ,\qquad P(z|\boldsymbol{\theta}_d) = \prod _{k=1}^K \vartheta _{d,k}^{\delta_{k,z}} \\ P(\boldsymbol{\varTheta}|\boldsymbol{\alpha}) & = \prod _{d=1}^{M} P(\boldsymbol{\theta}_d | \boldsymbol{\alpha}) ,\qquad P(\boldsymbol{\theta}_d |\boldsymbol{\alpha}) = \frac{\varGamma(\sum^K_{k=1}\alpha_k )}{\prod^K_{k=1}\varGamma (\alpha_k)} \prod^K_{k=1} \vartheta^{\alpha_k -1}_{d,k} \\ P(\boldsymbol{\varPhi}|\boldsymbol{\beta}) & = \prod_{k=1}^{K} P(\boldsymbol{\varphi}_k |\boldsymbol{\beta}_k) ,\qquad P( \boldsymbol{\varphi}_k |\boldsymbol{\beta}_k) = \frac {\varGamma(\sum^N_{s=1}\beta_{k,s}) }{\prod^N_{s=1} \varGamma (\beta_{k,s}) } \prod^N_{s=1} \varphi^{\beta_{k,s}-1}_{k,s} \end{aligned} $$

(2)

Here, δ _a,b represents the Kronecker delta function, returning 1 when a=b and 0 otherwise. Figure 1(a) graphically describes the generative process. As usual, the joint topic-data probability can be obtained by marginalizing over the Φ and Θ components:

$$P(\mathbf{W},\mathbf{Z}|\boldsymbol{\alpha}, \boldsymbol{\beta}) = \int _{\boldsymbol{\varPhi}} \int_{\boldsymbol{\varTheta}} P(\mathbf{W}|\mathbf{Z}, \boldsymbol{\varPhi}) P(\boldsymbol{\varPhi }|\boldsymbol{\beta}) P(\mathbf{Z}|\boldsymbol{ \varTheta}) P(\boldsymbol{\varTheta}|\boldsymbol {\alpha}) d\boldsymbol{\varPhi} d\boldsymbol{ \varTheta} $$

In the following, we model further assumptions on both w _d and z _d, which explicitly reject the exchangeability assumption and instead rely on the idea of sequential dependency. We concentrate on three basic models, which in a sense subsume the core of sequential modeling. Here, a sequence can be modeled as a stationary first order Markov chain:

A Markovian process naturally models the sequential nature of the data, where dependencies among past and future tokens reflect changes over time that are still governed by similar features;
The chain is stationary, as a fixed number of tokens is likely to frequently appear in sequences;
The order of the chain is 1 because the possibility that two subsequent tokens share some features is more likely than that of two tokens distant in time.^{Footnote 3}

We now analyze each model in turn.

Token-bigram model

In this model, we assume that w _d represents a first-order Markov chain, where, each token w _d,j depends on the most recent token w _d,j−1 observed by far. This is essentially the same model proposed in Wallach (2006), Cadez et al. (2000), and the probability of a trace has to be changed from Eq. (2) as

$$ P(\mathbf{w}_d|\mathbf{z}_d, \boldsymbol{\varPhi}) = \prod ^{N_d}_{j=1} P(w_{d,j}| w_{d,j-1}, z_{d,j}, \boldsymbol{\varPhi}) $$

(3)

In practice, a token w _d,j is generated according to a multinomial distribution $\boldsymbol{\phi}_{z_{d,j},w_{d,j-1}}$ which depends on both the current topic z _d,j and the previous token w _d,j−1. (Notice that when j=1, the previous token is empty and the multinomial resolves to $\boldsymbol{\phi}_{z_{d,j}}$, representing the initial status of a Markov chain). The conjugate prior for ϕ can be defined as:

$$ P(\boldsymbol{\varPhi}|\boldsymbol{\beta}) = \prod^K_{k=1} \prod^N_{r=0} P(\boldsymbol{ \phi}_{k,r}|\boldsymbol{\beta}_{k,r}) = \prod^K_{k=1} \prod ^N_{r=0} \frac{\varGamma(\sum^N_{s=1}\beta _{k,r.s}) }{\prod^N_{s=1} \varGamma(\beta_{k,r.s}) } \prod ^N_{s=1} \varphi^{\beta_{k,r.s}-1}_{k,r.s} $$

Since the Markovian process does not affect the topic sampling, both P(z _d|θ _d) and P(Θ|α) are defined as in Eq. (2). The generative model, depicted in Fig. 1(b), can be described as follows:

For each trace d∈{1,…,M} sample the topic-mixture components θ _d∼Dirichlet(α) and sequence length n _d∼Poisson(ξ)
For each topic k∈1,…,K and token r∈{0,…,N}
- sample token selection components ϕ _k,r∼Dirichlet(β _k,r)
For each trace d∈{1,…,M} and j∈{1,…,N _d}
- sample a topic z _d,j∼Discrete(θ _d)
- sample a token $w_{d,j} \sim \mathit{Discrete}(\boldsymbol{\phi}_{z_{d,j},w_{d,j-1}})$

Notice that we explicitly assume the existence of a family {β _k,r} with k={1,…,K} and r={0,…,N} of Dirichlet coefficients, and of a special token r=0 which represents the previous token of the first token of each trace. As shown in Wallach (2006), different modeling strategies (e.g., shared priors β _k,r.s=β _s) can affect the accuracy of the model.

By algebraic manipulations, the joint token-topic distribution can be simplified into:

$$ P(\mathbf{W},\mathbf{Z}|\alpha, \beta)= \Biggl(\prod _{d=1}^M \frac{\Delta (\mathbf{n}_{d,(\cdot)} + \boldsymbol{\alpha } )}{ \Delta(\boldsymbol{\alpha})} \Biggr) \Biggl(\prod _{k=1}^K\prod_{r =0}^N \frac{\Delta (\mathbf{n}^{k}_{(\cdot),r} + \boldsymbol{\beta }_{k,r} )}{ \Delta(\boldsymbol{\beta}_{k,r})} \Biggr) $$

(4)

The latter is the basis for developing a stochastic EM strategy (Bishop 2006, Sect. 11.1.6), where the E step consists in a collapsed Gibbs sampling procedure (Heinrich 2008; Bishop 2006) for estimating Z, and the M step estimates both the predictive distributions Θ and Φ and the hyper parameters α and β given Z. Within Gibbs sampling, topics are iteratively sampled, according to the probability:

$$ P(z_{d,j} = k | \mathbf{Z}_{-(d,j)},\mathbf{W}) \propto \bigl( n^k_{d,(\cdot)} + \alpha_k -1 \bigr) \cdot \frac {n^k_{(\cdot),r.s} + \beta_{k,r.s} -1}{ \sum_{s'=1}^N n^k_{(\cdot),r.s'} + \beta_{k,r.s'} -1} $$

(5)

relative to the topic to associate with the n-th token of the d-th trace, where w _d,j−1=r and w _d,j=s.

Given Z, the parameters Θ and Φ can be estimated according to the following equations:

$$ \vartheta_{d,k} = \frac {n^{k}_{d,(\cdot)} + \alpha_{k}}{ \sum_{k'=1}^K (n^{k'}_{d,(\cdot)} + \alpha_{k'})} ,\qquad \varphi_{k,r.s} = \frac {n^{k}_{(\cdot),r.s} + \beta_{k,r.s}}{ \sum_{s'=1}^N (n^{k}_{(\cdot),r.s'} + \beta_{k,r.s'}) } $$

(6)

The estimation of the hyper parameters will be approached later in the paper.

Topic-bigram model

A different approach can be taken by assuming that sequentiality regards topics, rather than tokens. That is, we can still consider tokens independent to each other and related to a latent topic. However, since topics represent the ultimate factors underlying a token appearance in the sequence, correlation between topics can better model an evolution of the underlying themes. Assuming a first-order Markovian dependency, the probability of a sequence of latent topics in Eq. (2) can be redefined as:

$$ P(\mathbf{z}_d|\boldsymbol{\theta}_d) = \prod ^{N_d}_{j=1} P(z_{d,j}|z_{d,j-1}, \boldsymbol{\theta}_d) $$

(7)

The difference here is in the distribution generating z _d,j, which is a multinomial $\boldsymbol{\theta}_{d,z_{d,j-1}}$ parameterized by both a trace d and a previously sampled topic z _d,j−1. The conjugate Dirichlet distributions can be expressed as:

$$ P(\boldsymbol{\varTheta}|\boldsymbol{\alpha}) = \prod^M_{d=1} \prod^K_{h=0} \frac{\varGamma(\sum^K_{k=1}\alpha_{h.k} )}{\prod^K_{k=1}\varGamma(\alpha_{h.k})} \prod ^K_{k=1} \vartheta^{\alpha _{h.k} -1}_{d,h.k} $$

(8)

P(w _d|z _d,Φ) and P(Φ|β) are still defined as in Eq. (2). Again, the generative process is shown in Fig. 1(c) and described below.

For each trace d∈{1,…,M} and topic h∈{0,…,K} sample topic-mixture components θ _d,h∼Dirichlet(α _h) and sequence length N _d∼Poisson(ξ)
For each topic k=1,…,K
- sample token selection components φ _k∼Dirichlet(β _k)
For each d∈{1,…,M} and j∈{1,…,N _d} sequentially:
- sample a topic $z_{d,j} \sim \mathit{Discrete}(\boldsymbol{\theta }_{d,z_{d,j-1}})$
- sample a token $w_{d,j} \sim\mathit{Discrete}(\boldsymbol{\phi }_{z_{d,j}})$

Here, h=0 is a special topic that precedes the first topic of each trace.

The joint token-topic distribution becomes:

$$ P(\mathbf{W},\mathbf{Z}|\alpha, \beta)= \Biggl(\prod _{d=1}^M \prod_{h=0}^K \frac{\Delta (\mathbf{n}^{k}_{d,(\cdot)} + \boldsymbol{\alpha }_{h} )}{ \Delta(\boldsymbol{\alpha}_{h})} \Biggr) \Biggl( \prod_{k=1}^K \frac{\Delta (\mathbf{n}^{k}_{(\cdot)} + \boldsymbol{\beta }_k )}{ \Delta(\boldsymbol{\beta}_k)} \Biggr) $$

(9)

and the corresponding collapsed Gibbs sampler works by iteratively sampling a topic k relative to token w _d,j=s of trace d according to the following:

(10)

Also, the multinomial parameters can be estimated according to the following equations:

$$ \vartheta_{d,h.k} = \frac {n^{h.k}_{d,(\cdot)} + \alpha_{h.k}}{ \sum_{k'=1}^K n^{h.k'}_{d,(\cdot)} + \alpha_{h.k'}} ,\qquad \varphi_{k,s} = \frac {n^{k}_{(\cdot),s} + \beta_{k,s}}{ \sum_{s'=1}^N n^{k}_{(\cdot),s'} + \beta_{k,s'} } $$

(11)

Token-bitopic model

In the last model, we still relate tokens to past events. However, the events we are interested in are the recent latent topics which trigger the past tokens. The generative model is shown in Fig. 1(b). Again, topic selection probability is defined like in Eq. (2), whereas token selection probability can be defined in terms of the multinomial $\boldsymbol{\phi}_{z_{d,j},z_{d,j-1}}$ (and its related conjugate):

(12)

(13)

These assumptions are at the basis of the following generative process.

For each trace d∈{1,…,M} sample topic-mixture components θ _d∼Dirichlet(α) and sequence length N _d∼Poisson(ξ)
For each topic pair h.k, where h∈{0,…,K} and k∈{1,…,K}
- sample token selection components φ _h.k∼Dirichlet(β _h.k)
For each d∈{1,…,M} and j∈{1,…,N _d} in sequence:
- sample a topic z _d,j∼Discrete(θ _d)
- sample a token $w_{d,j} \sim\mathit{Discrete}(\boldsymbol{\phi }_{z_{d,j}, z_{d,j-1}})$

Once again h=0 is the special topic which precedes all the first topics of the traces. As usual, by algebraic manipulations, the joint token-topic distribution can be expressed as

$$ P(\mathbf{W},\mathbf{Z}|\alpha, \beta)= \Biggl(\prod _{d=1}^M \frac{\Delta (\mathbf{n}_{d,(\cdot)} + \boldsymbol{\alpha } )}{ \Delta(\boldsymbol{\alpha})} \Biggr) \Biggl(\prod _{h=0}^K\prod_{k=1}^K \frac{\Delta (\mathbf{n}^{h.k}_{(\cdot)} + \boldsymbol{\beta }_{h.k} )}{ \Delta(\boldsymbol{\beta}_{h.k})} \Biggr) $$

(14)

which induce the following inference steps:

E step::: for the token w _d,j=s at position j in trace d, sample a topic k according to the following probability:
(15)
M Step::: estimate multinomial probabilities according to the following equations:
$$ \vartheta_{d,k} = \frac {n^{k}_{d,(\cdot)} + \alpha_{k}}{ \sum_{k'=1}^K n^{k'}_{d,(\cdot)} + \alpha_{k'}} ,\qquad \varphi_{h.k,s} = \frac {n^{h.k}_{(\cdot),s} + \beta_{h.k,s}}{ \sum_{s'}^N {n^{h.k}_{(\cdot),s'} + \beta_{h.k.s'} }} $$
(16)

2.1 Log-likelihoods

A crucial component in the inference and estimation steps is the computation of the data likelihood. In general, the likelihood function is defined as:

$$ \begin{aligned}[t] P(\mathbf{W}) & = \prod _{d=1}^M P(\mathbf{w}_d) = \prod _{d=1}^M P(w_{d,1}. \cdots. w_{d,N_d}) \\ & = \prod_{d=1}^M \sum _{k=1}^K P(w_{d,1}. \cdots. w_{d,N_d},z_{d,N_d} = k)\end{aligned} $$

Now, each model differs in the way the $P(w_{d,1}. \cdots. w_{d,N_{d}},z_{d,N_{d}})$ component is defined.

Token-bigram

Bayes rule and the first order Markov assumption over tokens simplifies the above probability into:

$$ \log P(\mathbf{W}) = \sum_{d=1}^M \log \Biggl(\prod_{j=1}^{N_d}\sum _k \vartheta_{d,k} \varphi_{k,w_{d,j-1}.w_{d,j}} \Biggr) $$

(17)

Topic-bigram

By algebraic manipulations (see Bishop 2006, Sect. 13.2 for details), we obtain

The result is a recursive equation which can be simplified into the following γ function:

$$ \gamma_k(\mathbf{w}_{d};1) = \varphi_{k,w_{d,1}}; \qquad \gamma_k(\mathbf{w}_d; j) = \varphi_{k,w_{d,j}} \sum_h \gamma _h(\mathbf{w}_{d};j-1) \vartheta_{d,h. k} $$

Substituting into the likelihood, yields:

$$ \log P(\mathbf{W}) = \sum_{d=1}^M \log \biggl(\sum_k \gamma _k( \mathbf{w}_d;N_d) \biggr) $$

(18)

Token-bitopic

The term $P(w_{d,1}. \cdots. w_{d,N_{d}}|z_{d,N_{d}}=k)$ can be decomposed according to the assumption of independence among topics:

where $w_{d, N_{d}} = s$. Again, the latter yields the following recursive equations

$$ \gamma_k(\mathbf{w}_d,1) = \varphi_{w_{d,1},\epsilon.k}; \qquad \gamma_k(\mathbf{w}_d,j) = \sum_h \gamma_{h}( \mathbf{w}_d,j-1) \vartheta_{d,h}\varphi_{w_{d,j},h.k} $$

where ϵ is a special topic, referring to the begin of the trace. The likelihood can hence be expressed as:

$$ \log P(\mathbf{W}) = \sum_{d=1}^M \log \biggl(\sum_k \gamma _k( \mathbf{w}_d;N_d)\vartheta_{d,k} \biggr) $$

(19)

2.2 Estimating the Hyper parameters

We consider asymmetric Dirichlet priors over the trace topic distributions and a symmetric prior over the topic distributions. This modeling strategy has been reported to achieve important advantages over the symmetric version (Wallach et al. 2009a). For the token-bigram and token-bitopic models, we adopted the procedure for updating the prior α as described in Heinrich (2008), Minka (2000). The topic-bigram model requires a difference formulation of the latter. Given a state of the Markov chain Z, the optimal α-hyper parameters can be computed by maximizing the likelihood of the observed pseudo-counts $n^{h.k}_{d,(\cdot)}$ via the fixed-point iteration method:

$$ \alpha^{new}_{h.k}=\alpha_{h.k} \frac{\sum^M_{d=1} \varPsi (n^{h.k}_{d,(\cdot)} + \alpha_{h.k} ) - M\varPsi(\alpha_{h.k}) }{\sum^M_{d=1} \varPsi (n^{h.(\cdot)}_{d,(\cdot)} + \sum^K_{k'=1}\alpha_{h.k'} ) - M\varPsi(\sum^K_{k'=1}\alpha_{h.k'}) } $$

(20)

where Ψ(⋅) indicates the digamma function.

3 Application to Recommender Systems

The general framework introduced above has a natural interpretation when dealing with users’ preference data: the set of users defines the corpus, each user is considered as a trace, the items purchased are considered as tokens and, finally, the topics correspond, intuitively, to the reason why the users purchased particular products. In the following, we assume that a user can be denoted by a unique index d, and a previous history is given by w _d of size N _d. We are interested in providing a ranking for s, the (N _d+1)-th choice $w_{d,N_{d}+1}$.

LDA

Following (Barbieri and Manco 2011) we adopt the following ranking function:

$$ \mathit{rank}(s,d) =\sum^K_{k=1} P(s|z_{d,N_d+1}=k) P(z_{d,N_d+1}=k| \boldsymbol{\theta}_d) = \sum^K_{k=1} \varphi_{k,s} \cdot \vartheta_{d,k} $$

It has been shown (Barbieri and Manco 2011) that LDA, equipped with the above ranking function, significantly outperforms the most significant approaches to modeling user preferences. Hence, it is a natural baseline function upon which to measure the performance of the other approaches proposed in this paper.