Probabilistic topic models for sequence data
 1.3k Downloads
 7 Citations
Abstract
Probabilistic topic models are widely used in different contexts to uncover the hidden structure in large text corpora. One of the main (and perhaps strong) assumption of these models is that generative process follows a bagofwords assumption, i.e. each token is independent from the previous one. We extend the popular Latent Dirichlet Allocation model by exploiting three different conditional Markovian assumptions: (i) the token generation depends on the current topic and on the previous token; (ii) the topic associated with each observation depends on topic associated with the previous one; (iii) the token generation depends on the current and previous topic. For each of these modeling assumptions we present a Gibbs Sampling procedure for parameter estimation. Experimental evaluation over realword data shows the performance advantages, in terms of recall and precision, of the sequencemodeling approaches.
Keywords
Recommender systems Collaborative filtering Probabilistic topic models PerformanceNotations
 M
# Traces
 N
# Distinct tokens
 K
# Topics
 W
Collection of traces, W={w _{1},…,w _{ M }}
 N_{d}
# tokens in trace d
 w_{d}
Token trace d, \(\mathbf{w}_{d} = \{w_{d,1} . w_{d,2} . \cdots. w_{d,N_{d}1}.w_{u,N_{d}}\}\)
 w_{d,j}
jth token in trace d
 Z
Collection of topic traces, Z={z _{1},…,z _{ M }}
 z_{d}
Topics for trace d, \(\mathbf{z}_{d} = \{z_{d,1} . z_{d,2} . \cdots. z_{d,N_{d}1}.z_{d,N_{d}}\}\)
 z_{d,j}
jth topic in trace d
 \(n^{k}_{d,s}\)
Number of times token s has been associated with topic k for trace d
 n_{d,(⋅)}
Vector \(\mathbf{n}_{d,(\cdot)} = \{ n^{1}_{d,(\cdot)}, \ldots, n^{K}_{d,(\cdot)}\}\)
 \(n^{k}_{d,(\cdot)}\)
Number of times topic k has been associated with trace d in the whole data
 \(\mathbf{n}^{k}_{(\cdot),r}\)
Vector \(\mathbf{n}^{k}_{(\cdot),r} = \{ n^{k}_{(\cdot),r.1}, \ldots, n^{k}_{(\cdot),r.N}\}\)
 \(n^{k}_{(\cdot),r.s}\)
Number of times topic k has been associated with the token pair r.s in the whole data
 \(\mathbf{n}^{k}_{(\cdot)}\)
Vector \(\mathbf{n}^{k}_{(\cdot)} = \{ n^{k}_{(\cdot),1}, \ldots, n^{k}_{(\cdot),N}\}\)
 \(n^{k}_{(\cdot),s}\)
Number of times token s has been associated with topic k in the whole data
 \(\mathbf{n}^{k}_{d,(\cdot)}\)
Vector \(\mathbf{n}^{k}_{d,(\cdot)} = \{ n^{k.1}_{d,(\cdot)}, \ldots, n^{k.K}_{d,(\cdot)}\}\)
 \(n^{h.k}_{d,(\cdot)}\)
Number of times that topic pair h.k has been associated with the trace d
 \(n^{h.(\cdot)}_{d,(\cdot)}\)
Number of times that a topic pair, that begins with topic h, has been associated with the trace d
 \(\mathbf{n}^{h.k}_{(\cdot)}\)
Vector \(\mathbf{n}^{h.k}_{(\cdot)} = \{n^{h.k}_{(\cdot),1}, \ldots, n^{h.k}_{(\cdot),N}\}\)
 \(n^{h.k}_{(\cdot),s}\)
Number of times that topic pair h.k has been associated with the token s in the whole data
 α
(LDA, TokenBigram and TokenBitopic Model) hyper parameters for topic Dirichlet distribution α={α _{1},…,α _{ K }} (Topic Bigram Model) set of hyper parameters for topic Dirichlet distribution α={α _{0},…,α _{ K }}
 α_{h}
Hyper parameters for topic Dirichlet distribution α _{ h }={α _{ h.1},…,α _{ h.K }}
 β
(LDA and TopicBigram Model) set of hyper parameters for token Dirichlet distribution β={β _{1},…,β _{ K }} (TokenBigram Model) set of hyper parameters for token Dirichlet distribution β={β _{1,1},…,β _{ K,1},…,β _{1,2},…,β _{ K,2},…,β _{ K,N }} (TokenBitopic Model) set of hyper parameters for token Dirichlet distribution β={β _{1.1},…,β _{ K.1},…,β _{1.2},…,β _{ K.2},…,β _{ K.K }}
 β_{k}
Hyper parameters for token Dirichlet distribution β _{ k }={β _{ k,1},…,β _{ k,N }}
 β_{k,s}
Hyper parameters for token Dirichlet distribution β _{ k,s }={β _{ k,s.1},…,β _{ k,s.N }}
 β_{h.k}
Hyper parameters for token Dirichlet distribution β _{ h.k }={β _{ h.k,1},…,β _{ h.k,N }}
 Θ
Matrix of parameters θ _{ d }
 θ_{d}
Mixing proportion of topics for trace d
 ϑ_{d,k}
Mixing coefficient of the topic k for trace d
 ϑ_{d,h.k}
Mixing coefficient of the topic sequence h.k for the trace d
 Φ
(LDA and TopicBigram Model) matrix of parameters φ _{ k }={φ _{ k,s }} (TokenBigram Model) matrix of parameters φ _{ k }={φ _{ k,r.s }} (TokenBitopic Model) matrix of parameters φ _{ h.k }={φ _{ h.k,s }}
 φ_{k,s}
Mixing coefficient of the topic k for the token s
 φ_{k,r.s}
Mixing coefficient of the topic k for the token sequence r.s
 φ_{h.k,s}
Mixing coefficient of the topic sequence h.k for the token s
 Z_{−(d,j)}
Z−{z _{ d,j }}
 Δ(q)
Dirichlet’s Delta \(\Delta(\boldsymbol {q}) = \frac {\prod_{p=1}^{P} {\varGamma(q_{p})}}{ \varGamma ( \sum_{p=1}^{P} {\varGamma(q_{p})} )}\)
1 Introduction and background
Probabilistic topic models, such as the popular Latent Dirichlet Allocation (LDA) (Blei et al. 2003), assume that each collection of documents exhibits an hidden thematic structure. The intuition is that each document may exhibit multiple topics, where each topic is characterized by a probability distribution over words of a fixed size dictionary. This representation of the data into the latenttopic space offers several advantages from a modeling perspective, and topic modeling techniques have been applied to different contexts. Example scenarios range from traditional problems (such as dimensionality reduction and classification) to novel areas (such as the generation of personalized recommendations).
Traditional LDAbased approaches propose a data generation process that is based on a “bagofwords” assumption, i.e. such that the order of the items in a document can be neglected. This assumption fits textual data, where probabilistic topic models are able to detect recurrent cooccurrence patterns, which are used to define the topic space. However, there are several realworld applications where data can be “naturally” interpreted as sequences, such as biological data, web navigation logs, customer purchase history, etc. Ignoring the intrinsic sequentiality of the data, may result in poor modeling: according to the bagofword assumption, cooccurrences are modeled independently for each word, via a probability distribution over the dictionary in which some words exhibit a higher likelihood to appear than others. On the other hand, sequential data may express causality and dependency, and different topics can be used to characterize different dependency likelihoods. The focus here is the context where a current user acts and expresses preferences, i.e., the environment, characterized by side information, where the observations hold. Our claim is that the context can be enriched by the sequential information, and the latter allows a more refined modeling. In practice, a sequence expresses a context which provides valuable information for the modeling.
The above observation is particularly noteworthy when data express preferences made by users, and the ultimate objective is to model a user’s behavior in order to provide accurate recommendations. The analysis of the sequential patterns has important applications in modern recommender systems (RSs), which are significantly focusing on an accurate balance between personalization and contextualization techniques. For example, in Internet based streaming services for music or video (such as Last.fm^{1} and Videolectures.net^{2}), the context of the user interaction with the system can easily be interpreted by analyzing the content previously requested. The assumption here is that the current item (and/or its genre) influences the next choice of the user. In particular, if a specific user is in the “mood” for classical music (as observed in the current choice), it is unlikely that the immediate subsequent choice will depart from the aforementioned mood, in favor of a song of different genre. Being able to capture such properties and exploiting them in recommendation strategy can greatly improve the accuracy of the recommendation.
Recommender systems have greatly benefited from probabilistic modeling techniques based on LDA. Recent works in fact have empirically shown that probabilistic latent topics models represent the stateoftheart in the generation of accurate personalized recommendations (Barbieri and Manco 2011; Barbieri et al. 2011, 2012). More generally, probabilistic techniques offer some renowned advantages: notably, they can be tuned to optimize a variety of loss functions; moreover optimizing the likelihood allows to model a distribution over rating values which can be used to determine the confidence of the model in providing a recommendation; finally, they allow the possibility to include prior knowledge into the generative process, thus allowing a more effective modeling of the underlying data distribution. Notably, when preferences are implicitly modeled through selection (that is, when no rating information is available), the simple LDA best models the probability that an item is actually selected by a user so far (Barbieri and Manco 2011).
Following the research direction outlined above, in this paper we study the effects of “contextual” information in probabilistic modeling of preference data. We focus on the case where the context can be inferred from the analysis of the sequence data, and we propose some topic models which explicitly make use of dependency information. As a matter of fact, the issue has been dealt with in similar papers (like, e.g. Wallach 2006). Here, we summarize and extend the approaches in the literature, by covering different ways of modeling dependency within preference data. Furthermore, we concentrate on the effects of such modeling on recommendation accuracy, as it explicitly reflects accurate modeling of user behavior.
 1.
We propose a unified probabilistic framework to model dependency in preference data, and instantiate the framework in accordance to different assumptions on the sequentiality of the underlying generative process.
 2.
We study and experimentally compare the proposed models, and highlight relative advantages and weaknesses.
 3.
We study how to adapt the proposed frameworks to support a recommendation scenario. In particular, for each of the proposed model, we provide the relative ranking functions that can be used to generate personalized and contextaware recommendation lists.
 4.
We finally show that the proposed sequential modeling of preference data better models the underlying data, as it allows more accurate recommendations in terms of precision and recall.
The paper is structured as follows. In Sect. 2 we introduce sequential modeling according to different dependency assumptions, and specify in Sect. 3 the corresponding item ranking functions for supporting recommendations. The experimental evaluation of the proposed approaches is then presented in Sect. 4, in which we measure the performance of the approaches in a recommendation scenario. In Sect. 5 we qualitatively compare the models studied in this paper with the current literature. Section 6 concludes the paper with a summary of the findings and a discussion of possible extensions.
2 Modeling sequence data

A Markovian process naturally models the sequential nature of the data, where dependencies among past and future tokens reflect changes over time that are still governed by similar features;

The chain is stationary, as a fixed number of tokens is likely to frequently appear in sequences;

The order of the chain is 1 because the possibility that two subsequent tokens share some features is more likely than that of two tokens distant in time.^{3}
Tokenbigram model

For each trace d∈{1,…,M} sample the topicmixture components θ _{ d }∼Dirichlet(α) and sequence length n _{ d }∼Poisson(ξ)
 For each topic k∈1,…,K and token r∈{0,…,N}

sample token selection components ϕ _{ k,r }∼Dirichlet(β _{ k,r })

 For each trace d∈{1,…,M} and j∈{1,…,N _{ d }}

sample a topic z _{ d,j }∼Discrete(θ _{ d })

sample a token \(w_{d,j} \sim \mathit{Discrete}(\boldsymbol{\phi}_{z_{d,j},w_{d,j1}})\)

Notice that we explicitly assume the existence of a family {β _{ k,r }} with k={1,…,K} and r={0,…,N} of Dirichlet coefficients, and of a special token r=0 which represents the previous token of the first token of each trace. As shown in Wallach (2006), different modeling strategies (e.g., shared priors β _{ k,r.s }=β _{ s }) can affect the accuracy of the model.
Topicbigram model

For each trace d∈{1,…,M} and topic h∈{0,…,K} sample topicmixture components θ _{ d,h }∼Dirichlet(α _{ h }) and sequence length N _{ d }∼Poisson(ξ)
 For each topic k=1,…,K

sample token selection components φ _{ k }∼Dirichlet(β _{ k })

 For each d∈{1,…,M} and j∈{1,…,N _{ d }} sequentially:

sample a topic \(z_{d,j} \sim \mathit{Discrete}(\boldsymbol{\theta }_{d,z_{d,j1}})\)

sample a token \(w_{d,j} \sim\mathit{Discrete}(\boldsymbol{\phi }_{z_{d,j}})\)

Tokenbitopic model

For each trace d∈{1,…,M} sample topicmixture components θ _{ d }∼Dirichlet(α) and sequence length N _{ d }∼Poisson(ξ)
 For each topic pair h.k, where h∈{0,…,K} and k∈{1,…,K}

sample token selection components φ _{ h.k }∼Dirichlet(β _{ h.k })

 For each d∈{1,…,M} and j∈{1,…,N _{ d }} in sequence:

sample a topic z _{ d,j }∼Discrete(θ _{ d })

sample a token \(w_{d,j} \sim\mathit{Discrete}(\boldsymbol{\phi }_{z_{d,j}, z_{d,j1}})\)

 E step:
 for the token w _{ d,j }=s at position j in trace d, sample a topic k according to the following probability:
 M Step:
 estimate multinomial probabilities according to the following equations:$$ \vartheta_{d,k} = \frac {n^{k}_{d,(\cdot)} + \alpha_{k}}{ \sum_{k'=1}^K n^{k'}_{d,(\cdot)} + \alpha_{k'}} ,\qquad \varphi_{h.k,s} = \frac {n^{h.k}_{(\cdot),s} + \beta_{h.k,s}}{ \sum_{s'}^N {n^{h.k}_{(\cdot),s'} + \beta_{h.k.s'} }} $$(16)
2.1 Loglikelihoods
Tokenbigram
Topicbigram
Tokenbitopic
2.2 Estimating the Hyper parameters
3 Application to Recommender Systems
The general framework introduced above has a natural interpretation when dealing with users’ preference data: the set of users defines the corpus, each user is considered as a trace, the items purchased are considered as tokens and, finally, the topics correspond, intuitively, to the reason why the users purchased particular products. In the following, we assume that a user can be denoted by a unique index d, and a previous history is given by w _{ d } of size N _{ d }. We are interested in providing a ranking for s, the (N _{ d }+1)th choice \(w_{d,N_{d}+1}\).
LDA
Tokenbigram model
Topicbigram model
Tokenbitopic model
4 Experimental evaluation

On a general setting, we study how the proposed method perform in terms of quality. We measure the quality as a function of the likelihood, as explained in the next section.

On a more specific setting, we compare the models in the envisaged recommendation scenario. Here, the quality of a model is measured indirectly, in terms of the accuracy of the recommendations it boosts. This is explained in Sect. 4.2.
4.1 Perplexity
Topic models are typically evaluated by either measuring performance on some secondary task, such as document classification or information retrieval, or by estimating the probability of unseen heldout traces given some training traces. Notably, a better model will give rise to a higher probability of heldout traces, on average.
 For each w _{ d }∈w ^{ Test }
 1.
Let \(\mathbf{w}_{d}^{(1)}\) and \(\mathbf{w}_{d}^{(2)}\) be an arbitrary split of w _{ d }.
 2.For s=1,…,S
 (a)
sample \(\mathbf{z}^{(1,s)} \sim P(\mathbf{z}^{(1,s)} \mathbf{w}_{d}^{(1)},\mathbf{W}_{train},\boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\varPhi})\) using the Gibbs Sampling equations;
 (b)
estimate \(\boldsymbol{\theta}^{(s)}_{d}\) from z ^{(1,s)};
 (a)
 3.
Approximate P(w _{ d }W _{ Train }) with \(\frac{1}{S} \sum_{s}P(\mathbf{w}_{d}^{(2)}  \boldsymbol{\theta }^{(s)}_{d},\boldsymbol{\varPhi})\), where the latter is computed by exploiting the formulas in Sect. 2.1.
 1.
Following (Wallach 2006), in the experiments we use a dataset composed by drawing 150 Psychological Review abstracts from the data made available by Griffith and Steyvers.^{4} The drawing was made among those documents containing at least 54 tokens. Also, we preprocessed the data as specified in Wallach (2006), by remapping all numbers with the special token NUMBER, and all items with frequency 1 in the training set or appearing as tokens in the test set but not in the training set as UNSEEN. The result of the cleaning process is a vocabulary of 860 item. Starting with the cleaned dataset, we did several random splits of the dataset, by choosing 100 documents as training data and keeping the remaining ones as test data. The splits roughly maintained the proportion 67–33 % on the tokens.
In the following we report the results obtained by the three proposed models. The results are compared with LDA. We also compare the models with the DCMLDA model (Doyle and Elkan 2009). The latter is a modification of LDA to account for the tendency of tokens to appear in bursts, that is if a token appears once in a trace, it is more likely to be appear again. DCMLDA does not model sequentiality, however burstiness can also be interpreted as nonindependence between tokens. In this respect, it is interesting how the proposed models compare to it. It is worth noticing, however, that burstiness is not necessarily alternative to sequentiality, as the approaches proposed in this paper can easily be adapted to model a combination of burstiness and sequentiality.
DCMLDA exhibits the best perplexity, as a result of the customized fitting of token probabilities to a specific document. As a matter of fact, the documents we are investigating here seem to naturally comply with the burstiness assumption.
Also, TokenBitopic seems to worsen the performance as the number of topics increase. This behavior is worth further explanation. The model conditions the probability of appearance of a token to a pair of latent factors. In a sense, this makes the model comparable to a “fresh” LDA model, where the number of latent factors is quadratic in K: in practice, a TokenBitopic model with K=4 can be deemed similar to an LDA model with K=16 topics, and each pair of latent factors is associated to a specific latent factor in the quadratic LDA model. In Fig. 2(c) we compare the two models: the models show the same tendency.
For the rest, models clearly outperform LDA. However, the TokenBigram model requires further explanation. Both the sampling process and the item selection probabilities rely on the frequencies of bigrams. Zerofrequency bigrams appearing in the test set compromise the evaluation just like zerofrequency items. We chose to treat them by associating them with a default frequency. Figure 2(d) shows how this affects the evaluation: here, NoP corresponds to keeping the original frequency, whereas P3 associates a frequency which implicitly corresponds to flattening all the zerofrequency bigrams to a default UNSEEN bigram. The latter is the one reported in Fig. 2(a). The approaches P1 and P2 correspond to intermediate solutions, where the default frequency of the (implicit) UNSEEN bigram is lowered.^{5}
Finally, Fig. 2(e) denotes the running times of the training algorithms on the training data. Although the TopicBigram model requires less parameters than the TokenBitopic approach, the learning time of the first one is considerably larger. This is mainly due to the larger number of hyper parameters (K×K vs. K) and to the complexity of the M step for the update of the hyper parameters α.
4.2 Recommendation accuracy
Summary statistics on reallife recommendation datasets
IPTV1  IPTV2  

Training  Test  Training  Test  
Users  16,237  16,153  64,334  63,878 
Items  759  731  2802  2777 
Evaluations  314,042  78,557  1,224,790  306,271 
Avg # evals (user)  19  5  19  5 
Avg # evals (item)  414  107  437  110 
Min # evals (user)  4  1  4  1 
Min # evals (item)  5  1  5  1 
Max # evals (user)  252  15  497  17 
Max # evals (item)  2284  1527  9606  3167 
Avg time between two evals  
per user  13 days  6 days  
per item  9 hours  23 hours 
Testing protocol
 For each user u, let \(\mathbf{w}'_{u}\) be the trace associated to u in W _{ Train }, and w _{ u } the trace in W _{ Test } (with n _{ u }=w _{ u }). For each token w _{ u,n }∈w _{ u }:

generate the candidate list \(\mathcal{C}_{u}\) by randomly drawing c items i≠w _{ u,n } such that \(i \notin\mathcal{I}_{\mathbf{w}'_{u}}\);

add w _{ u,n } to \(\mathcal{C}_{u}\) and sort the list according to the scoring function provided by the RS;

record the position of the w _{ u,n } in the ordered list: if it belongs to the topH items, we have a hit otherwise, we have a miss.

In the evaluation, we compare the bigram models with some baseline methods from the current literature. These include the aforementioned DCMLDA model, and a version of the LDA where, for each user, the tokens represent (unordered) bigrams rather than single item occurrences. This is in practice a preprocessing of the data, which produces a different representation of the dataset upon which the standard LDA model is trained. Clearly, the ranking function has to be tuned accordingly.
We also provide two further baselines. The first one is a simple bigram model where the probability of occurrence of an item is modeled as \(P(w_{n}) = \lambda f_{w_{n}} + (1  \lambda)f_{w_{n}w_{n1}} \). Here, f _{ i } is the relative frequency of i in the training set, whereas f _{ ij } represents the same frequency conditioned to a preceding occurrence of j in the sequence. The λ parameter weighs the importance of the two components, and is tuned in a way proportional to the frequency of i, as typically lowfrequency items do not provide a reliable estimates of the sequential part.
Finally, we also compare the proposed models to a baseline rooted on matrix factorization (Koren et al. 2009; Menon and Elkan 2011). The basic idea here is to exploit the matrix factorization for ranking, e.g., by providing an estimate of the probability of the item appearance (Menon and Elkan 2010). There are some issues to consider when applying matrix factorization to the case at hand. In our context, matrix factorization is aimed at modeling item occurrence rather then an explicit rating. In this respect, nonoccurrence of an item has a bivalent interpretation, either as unknown (the user has not considered the item yet), or negative (she does not prefer it at all). Thus, the traditional approaches based on explicit preference (such as Salakhutdinov and Mnih 2007) cannot be applied. We experimented with several specific techniques, including (Hu et al. 2008; Sindhwani et al. 2010) and the standard SVD model. In the following, we report the results of the SVD^{6}, that still outperforms all the other methods, as a confirmation of the findings in Cremonesi et al. (2010), Barbieri and Manco (2011).
Results
On both datasets, the proposed models improve the baselines. Concerning IPTV1, both TopicBigram and TokenBigram achieve a significant margin with respect to the other competitors. On IPTV2, TokenBigram outperforms TopicBigram, which is still the runnerup performer.

The underlying assumption within TokenBiTopic does not involve a remarkable increase of the predictive capabilities of the model. In practice, the topic structure of the TokenBiTopic model can be “simulated” by an LDA model with a quadratic number of topics. As a result, the model seems more prone to overfitting.

Contextual information, with particular reference to sequence modeling, provides a substantial contribution to recommendation accuracy. This is proven not only by the models proposed in this paper: even the SimpleBigram baseline model achieves remarkable accuracy. In particular, when the recommendation list is relative small, the latter achieves an accuracy comparable to TokenBitopic. As a matter of fact, all the sequential approaches seem to provide a better estimate of the selection probability for the user’s next choice.

There is a strict correlation between the frequencies exhibited by bigrams and the performance of the TokenBigram model. IPTV2 exhibits more frequent bigrams, and hence it is more likely to boost the performances of the TokenBigram model. By the converse, the TopicBigram exhibits a better capability in generalizing the dependency between the previous hidden context and the next choice. Geometrically, while the TokenBigram model focuses exclusively a restricted area of the topic space, induced by considering only the previous item, the TopicBigram model is actually able to identify larger homogeneous region within the topic space and to estimate the connections (transition probabilities) between them.

Among the competitors, DCMLDA is rather weak. This is somehow surprising, considering that DCMLDA exhibits the best perplexity in the previous sets of experiments. A viable explanation of this dichotomy can be found in the nature of the sequential data explored here, which does not necessarily support burstiness: notably in a movie rental scenario, once a movie is rented by a user, it is unlikely that it is rented again in the future.

LDABigram does not provide a substantial improvement either. Again, this is unexpected, in some sense, as bigrams can be considered contextual information as well. It seems that, when bigrams are introduced without an ordering relationship, the resulting ranking function is weakened.
The last two plots in Fig. 8 highlight the contribution of asymmetric priors in the learning process. As expected, asymmetric priors significantly improve the accuracy. However, the learning time is greatly affected, as learning these parameters requires a further iterative fix point procedure to embed in the main algorithm, as explained in Sect. 2.2.
5 Related work
The generative process, which is common to many extensions of the Latent Dirichlet Allocation (Blei 2011), is strongly based on a “bagofwords” assumption. Even if this assumption may sound unrealistic, this modeling works really good in practice. Latent Dirichlet Allocation and similar models combine the structurediscovery power of dimensionality reduction approaches, such as the latent semantic indexing (Deerwester 1988), with informative priors modeling, which are estimated by Bayesian inference techniques. The definition of the topic space and of the projection of each document into this space, provide an effective tool to infer the semantic concept of each document, or generally entity. In particular, these approaches support 3 main tasks (Griffiths et al. 2007): topic extraction, word sense disambiguation and prediction.
Among all the different contexts in which these approaches have achieved significant results, in this paper we consider the application of probabilistic topic models to the recommendation problem (Hofmann 2004). As mentioned above, this choice is motivated by some interesting recent findings (Barbieri and Manco 2011) which can be summarized as follows: (i) the itemselection probability computed for each user is a key component for generating accurate itemranking functions; (ii) among all competitors, LDA provides the best results measured in precision and recall of the recommendation list. These promising results motivate us in exploring extensions of topic models which may provide better representation of the inherent sequential correlation between items, and thus provide better performances in predictions. In the following, we are going to briefly review stateoftheart probabilistic approaches to sequence data modeling, mainly focusing on topic approaches.
A simple approach to model sequential data within a probabilistic framework has been proposed in Cadez et al. (2000). In their work, the authors present a framework based on mixtures of Markov models for clustering and modeling of web site navigation logs, which is applied for clustering and visualizing user behavior on a web site. Albeit simple, the proposed model suffers from the limitation that a single latent topic underlies all the observation in a single sequence. This approach has been overtaken by other methods based on latent semantic indexing and LDA. In Wallach (2006), Wang and Wei (2007), for example, the authors propose extension of the LDA model which assume a firstorder Markov chain for the word generation process. In the resulting TokenBigram Model (see Sect. 2) and Topical ngrams, the current word depends on the current topic and the previous word observed in the sequence.
The LDA Collocation Model (Griffiths et al. 2007) introduces a new set of random variables (for bigram status) which denotes whether a bigram can be formed with the previous word token. More specifically, as represented in Fig. 9(b), the generative process specifies for each word both a topic and a collocation status. The collocation status adds a more flexible modeling than Token Bigram model which always generates bigrams and, according to this formulation, the distribution on bigram does not depend on the topic. The introduction of the collocation status enrich the generative semantic of the model and this idea can be applied to all the approaches proposed in Sect. 2.
All the previously discussed models approach the problem of sequence modeling by inferring the underlying latent topic and then generate a sequence of words according to this distribution. This perspective does not take into account the fact that words in a text document may exhibit both syntactical and semantic correlations. A Composite Model, which captures both semantic and syntactic roles, has been proposed in Griffiths et al. (2005). The graphical model for the generation of a document, given in Fig. 9(c), clarify this concept. The semantic/syntactic dependencies among words are modeled by employing two different latent variables, namely Z and C; while the semantic layer follows a simple LDA model, the syntactic one is instantiated by modeling transitions between the set of classes C through a hidden Markov model. One of these classes corresponds to the semantic class and, when is observed, enables the generation of the word according the current topic. Other classes capture word cooccurrences that are due to syntactic aspects of the modeled language.
Textual documents exhibit a natural sequential structure: people develop documents by building upon a main semantic concept, and by interleaving several segments/subsections, which express related topics, in a coherent logical flow. As described above, HTMM models topic cohesion at the level of phrases (words within the same sentence share the same latent topic), but does not model directly a smooth evolution between topics in different segments that frame a document. Sequential LDA (Du et al. 2010) is a variant of LDA which models a sequential dependency between subtopics: the topic of the current segment is closely related to the topic of its antecedent and subsequent segments. This smooth evolution of the topic flow is modeled by using a PoissonDirichlet process.
The sequential structure is not limited exclusively to words, but it can affect also sentiments. DependencySentimentLDA (Li et al. 2010) builds on the assumption that sentiments are expressed in a coherent way. Conjunctive words, such as “and” or “but”, can be used to detect sentiment transitions, and the sentiment of a word is dependent on the sentiment of its previous one.
6 Conclusion and future work
In this paper we studied three extensions of the LDA model which relax the bagofword assumption by hypothesizing that the current observation depends on previous information. For each of the proposed model we provided a Gibbs Sampling parameter estimation procedure and an experimental evaluation was accomplished by studying the models both from a model fitting and an applicative perspective. In particular, the proposed models provide a better framework for modeling contextual information in a recommendation scenario, when the data exhibit intrinsic temporal dependency.
We believe that the models and results presented in this paper open two interesting research directions. On the one side, it would be interesting to generalize the notion of “contextual information”: in this paper, a context was represented by temporal dependency. However, there are other observable features that can contribute in the likelihood of observing an item in a user’s trace, such as geographical location, tags etc.
Even further, the interaction of a user in a social network is having an increasing impact in her behavior. Analyzing the influence of the neighbors in a network (Barbieri et al. 2013) can help to better evaluate both the temporal dependencies and the likelihood of an item to be selected.
Footnotes
 1.
 2.
 3.
It is also worth noticing that higher order dependencies introduce an unpractical computational overhead, as the number of parameters grows exponentially with the order of the chain (Bishop 2006, Chap. 13).
 4.
 5.
Clearly this is where nonparametric methods should be used to provide a gradual step into the TokenBigram model. The integration of nonparametric techniques in the TokenBigram would better handle cases in which there is less data and it would automatically solve the treatment of the zerofrequency items.
 6.
Based on the SVDLIBC implementation, http://tedlab.mit.edu/~dr/SVDLIBC/. The other matrix factorization methods were obtained from the Graphlab Library, http://graphlab.org/.
Notes
Acknowledgements
We would like to thank Charles Elkan for kindly providing the Matlab code for the DCMLDA model.
References
 Bambini, R., Cremonesi, P., & Turrin, R. (2011). A recommender system for an IPTV service provider: a real largescale production environment. In F. Ricci, L. Rokach, B. Shapira, & P. Kantor (Eds.), Recommender systems handbook (pp. 299–331). Berlin: Springer. CrossRefGoogle Scholar
 Barbieri, N., Bonchi, F., & Manco, G. (2013). Cascadebased community detection. In Sixth ACM international conference on web search and data mining (WSDM’2013) (pp. 33–42). CrossRefGoogle Scholar
 Barbieri, N., Costa, G., Manco, G., & Ortale, R. (2011). Modeling item selection and relevance for accurate recommendations: a Bayesian approach. In Proceedings of the 5th ACM conference on recommender systems (RecSys’11) (pp. 21–28). CrossRefGoogle Scholar
 Barbieri, N., & Manco, G. (2011). An analysis of probabilistic methods for topn recommendation in collaborative filtering. In Proceedings of the European conference on machine learning and knowledge discovery in databases (ECMLPKDD’11) (pp. 172–187). CrossRefGoogle Scholar
 Barbieri, N., Manco, G., Ortale, R., & Ritacco, E. (2012). Balancing prediction and recommendation accuracy: hierarchical latent factors for preference data. In Proceedings of the 12th SIAM international conference on data mining (SDM’12). Google Scholar
 Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer. zbMATHGoogle Scholar
 Blei, D. M. (2011). Introduction to probabilistic topic models. Communications of the ACM. Google Scholar
 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. zbMATHGoogle Scholar
 Cadez, I., Heckerman, D., Meek, C., Smyth, P., & White, S. (2000). Visualization of navigation patterns on a web site using modelbased clustering. In Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’00) (pp. 280–284). Google Scholar
 Cremonesi, P., Koren, Y., & Turrin, R. (2010). Performance of recommender algorithms on topn recommendation tasks. In Proceedings of the 4th ACM conference on recommender systems (RecSys’10) (pp. 39–46). CrossRefGoogle Scholar
 Cremonesi, P., & Turrin, R. (2009). Analysis of coldstart recommendations in IPTV systems. In Proceedings of the 3rd ACM conference on recommender systems (RecSys’09) (pp. 233–236). Google Scholar
 Deerwester, S. (1988). Improving information retrieval with latent semantic indexing. In C. L. Borgman & E. Y. H. Pai (Eds.), Proceedings of the 51st ASIS annual meeting (ASIS ’88) (Vol. 25). Google Scholar
 Doyle, G., & Elkan, C. (2009). Accounting for burstiness in topic models. In Proceedings of the 26th international conference on machine learning (ICML’09) (p. 36). Google Scholar
 Du, L., Buntine, W. L., & Jin, H. (2010). Sequential latent Dirichlet allocation: discover underlying topic structures within a document. In Proceedings of the 10th IEEE international conference on data mining (ICDM’10) (pp. 148–157). CrossRefGoogle Scholar
 Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. In Advances in neural information processing systems (NIPS’05). Google Scholar
 Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114. Google Scholar
 Gruber, A., Weiss, Y., & RosenZvi, M. (2007). Hidden topic Markov models. Journal of Machine Learning Research, 2, 162–170. Google Scholar
 Heinrich, G. (2008). Parameter estimation for text analysis (Tech. rep.). University of Leipzig. Google Scholar
 Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems, 22(1), 89–115. CrossRefGoogle Scholar
 Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE international conference on data mining (ICDM’08) (pp. 263–272). CrossRefGoogle Scholar
 Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. IEEE Computer, 42(8), 30–37. CrossRefGoogle Scholar
 Li, F., Huang, M., & Zhu, X. (2010). Sentiment analysis with global topics and local dependency. In Proceedings of the 24th AAAI conference on artificial intelligence (AAAI’10). Google Scholar
 Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press. zbMATHGoogle Scholar
 Menon, A., & Elkan, C. (2010). Predicting labels for dyadic data. Data Mining and Knowledge Discovery, 21(2), 327–343. MathSciNetCrossRefGoogle Scholar
 Menon, A., & Elkan, C. (2011). Link prediction via matrix factorization. In Proceedings of the European conference on machine learning and knowledge discovery in databases (ECMLPKDD’11) (pp. 437–452). CrossRefGoogle Scholar
 Minka, T. P. (2000). Estimating a Dirichlet distribution (Tech. rep.). Microsoft Research. http://research.microsoft.com/enus/um/people/minka/papers/dirichlet/minkadirichlet.pdf.
 Salakhutdinov, R., & Mnih, A. (2007). Probabilistic matrix factorization. In Proceedings of the 21st annual conference on neural information processing systems (NIPS’07). Google Scholar
 Sindhwani, V., Bucak, S., Hu, J., & Mojsilovic, A. (2010). Oneclass matrix completion with lowdensity factorizations. In Proceedings of the 10th IEEE international conference on data mining (ICDM’10) (pp. 1055–1060). CrossRefGoogle Scholar
 Wallach, H., Mimno, D., & McCallum, A. (2009a). Rethinking lda: why priors matter. In Advances in neural information processing systems (NIPS’09) (pp. 1973–1981). Google Scholar
 Wallach, H., Murray, I., Salakhutdinov, R., & Mimno, D. (2009b). Evaluation methods for topic models. In Proceedings of the 26th international conference on machine learning (ICML’09). Google Scholar
 Wallach, H. M. (2006). Topic modeling: beyond bagofwords. In Proceedings of the 23rd international conference on machine learning (ICML’06) (pp. 977–984). CrossRefGoogle Scholar
 Wang, X. A. M., & Wei, X. (2007). Topical ngrams: phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE international conference on data mining (ICDM’07) (pp. 697–702). CrossRefGoogle Scholar