Keywords

1 Introduction

Users express their preferences in their consumption behaviors, through the products they purchase, the social media postings they like, the songs they listen to, the online videos they watch, etc. These behaviors are leaving increasingly greater traces of data that could be analyzed to model user preferences. Modeling these preferences has important applications, such as estimating consumer demand, profiling customer segments, or supporting product recommendation.

There are diverse forms of expression of preferences yielding different types of observations. Most of the previous works deal with ordinal preference, where the objective is to model the observed interactions between users and items [1]. In this scenario, a user’s preference for an item is commonly expressed along some ordinal scale, e.g., higher rating indicating greater liking or preference.

In this work, we are interested in another category, namely: sequential preference, where the objective is to model the sequential effect between adjacent items in a sequence. In this scenario, preference is expressed in terms of which other items may be preferred after consuming an item. For instance, a user’s stream of tweets may reveal which topics tend to follow a topic, e.g., commenting on politics upon reading morning news followed by more professional postings during working hours. The sequence of songs one listens to may express a preference for which genre follows another, e.g., more upbeat tempo during a workout followed by slower music while cooling down. Similarly, sequential preferences may also manifest in the books one reads, the movies one watches, etc.

Problem. Given a set of item sequences, we seek a probabilistic model for sequential preferences, so as to estimate the likelihood of future items in any particular sequence. Each sequence (e.g., a playlist, a stream of tweets) is assumed to have been generated by a single user.

To achieve this goal, we turn to probabilistic models for general sequences. While there are several such models studied in the literature (see Sect. 2), here we build on the foundation of the well-accepted Hidden Markov Model (HMM) [16], which has been shown to be effective in various applications, including speech- and handwriting-recognition, etc. We review HMM in Sect. 3. Briefly, it models a number of hidden states. To generate each sequence, we move from one state to another based on transition probability. Each item in the sequence is sampled from the corresponding state’s emission probability.

While HMM is fundamentally sound as a basic model for sequences, we identify two significant factors, yet unexploited, which would contribute towards greater effectiveness for modeling sequential preferences. First, the generation of an item from a state’s emission in HMM is only dependent on the state. However, as we are concerned with user-generated sequences, the selection of items may be affected by the user’s preferences. However, due to the sparsity of information on individual users, we stop short of modeling individual emissions. Rather, we model latent groups, whereby users in the same group share similar preferences over items, i.e., emissions. Second, the transition to the next state in HMM is only dependent on the previous state. We posit that context in which a transition is about to take place also plays a role. For example, in the scenario of musical playlists, let us suppose that a particular state represents the genre of soft rock. There are different songs in this genre. If a user likes the artist of the current song, she may wish to listen to more songs by the same artist. Otherwise, she may wish to change to a different genre altogether. In this case, the artist is an observed feature of the context that may influence the transition dynamically.

Contributions. In this work, we make the following contributions. First, we develop a probabilistic model for sequences, whereby transitions from one state to another state may be dynamically influenced by the context features, and emissions are influenced by latent groups of users. We develop this model systematically in Sect. 4, and describe how to learn the model parameters, as well as to generate item predictions in Sect. 5. Second, we evaluate these models comprehensively in Sect. 6 over varied datasets. Experiments on a synthetic dataset investigate the contributions of our innovations on a dataset with known parameters. Experiments on publicly available real-life sequence datasets (song playlists from Yes.com and hashtag sequences from Twitter.com) further showcase accuracy improvements in predicting the next item in sequences.

2 Related Work

Here, we survey the literature on modeling various types of user preferences.

Ordinal Preferences. First, we look at ordinal preferences, which models a user’s preference for an item in terms of rating or ranking. The most common framework is matrix factorization [11, 17, 20], where the observed user-by-item rating matrix is factorized into a number of latent factors, so as to enable prediction of missing values. Another framework is restricted Boltzmann machines [21] based on neural networks. Meanwhile, latent semantic analysis [8, 9] models the association among users, items, and ratings via multinomial probabilities. These works stand orthogonally to ours, as the main interactions they seek to model are user-to-item ratings/rankings, rather than item-to-item sequences.

Sequential Preferences. Our work falls into sequential preferences, which models sequences of items, so as to enable prediction of future items. As mentioned in Sect. 1, our contribution is in factoring dynamic context-biased transition and user-biased emission. To make the effects of these dynamic factors clear, we build on the foundation of HMM [16], and focus our comparisons against this base platform. Aside from HMM, there could potentially be different ways to tackle this problem such as probabilistic automata [7] and recurrent neural networks [14], which are beyond the scope of this paper. Other works deal with sequences, but with different objectives. Markov decision processes [2, 22, 23] are concerned with how to make use of the transitions to arrive at an “optimal policy”: a plan of actions to maximize some utility function. Sequential pattern mining [15] finds frequent sequential patterns, but these require exact matches of items in sequences. [4, 13] model sequences in terms of Euclidean distances in metric embedding space. Aside from different objectives, these works also model explicit transitions among items, in contrast to our modeling of latent states.

Hybrid Models. Efforts to integrate ordinal and sequential preferences combine the “long-term” (items a user generally likes) and “short-term” preferences (items frequently consumed within a session). [27] models the problem as random walks in a session-based temporal graph. [26] designs a two-layer representation model for items: the first layer models interaction with previous item and the second layer models interaction with the user. [6, 18] conduct joint factorization of user-by-item rating matrix and item-by-item transition matrix. It is not the focus of our current work to incorporate ordinal preferences directly, or to rely on full personalization by associating each user with an individual parameter.

Temporal Models. Aside from the notion of sequence, there are other temporal factors affecting recommendation. [19] assumes that users may change their ordinal preferences over time. [3] models the scenario where users “lose interest” over time. [10] takes into account the life stage of a consumer, e.g., products for babies of different ages, while [28] intends to model evolutions that advance “forward” in event sequences without going “backward”. [25] seeks to predict not what, but rather when to recommend an item. [5] considers how changes in social relationships over time may affect a user’s receptiveness or interest to change. In these and other cases, the key relationship being modeled is that between user and time, which is orthogonal to our focus in modeling item sequences.

3 Preliminaries

Towards capturing sequential preferences, our model builds upon HMM. The standard HMM assumes a series of discrete time steps \(t=1,2,\ldots \), where an item \(Y_t\) can be observed at step t. To model the sequential effect in this series of observed items, HMM employs a Markov chain over a latent finite state space across the time steps. As illustrated in Fig. 1, at each time step t a latent state \(X_t\) is transitioned from the previous state \(X_{t-1}\) in a Markovian manner, i.e., \(P(X_t|X_{t-1},X_{t-2},\ldots ,X_1)\equiv P(X_t|X_{t-1})\), known as the transition probability.

Fig. 1.
figure 1

A standard HMM for sequential preferences

Formally, consider an HMM with a set of observable items \(\mathcal {Y}\) and a set of latent states \(\mathcal {X}\). It can be fully specified by a triplet of parameters \(\theta =(\pi , A, B)\), such that \(\forall x,u \in \mathcal {X},y\in \mathcal {Y},t\in \{1,2,\ldots \}\),

  • \(\pi \) is the initial state distribution with \(\pi _{x} \triangleq P(X_1 = x)\);

  • A is the transition matrix with \(A_{xu} = P(X_{t} = u | X_{t-1} = x)\);

  • B is the emission matrix with \(B_{xy} = P(Y_{t} = y | X_{t} = x)\).

Given a sequence of items \(Y_1,\ldots ,Y_t\), the optimal parameters \(\theta ^*\) can be learned by maximum likelihood (Eq. 1). Note that we can easily extend the likelihood function to accommodate multiple sequences, but for simplicity we only demonstrate with a single sequence throughout the technical discussion. Moreover, given \(\theta ^*\) and a sequence of items \(Y_1,\ldots ,Y_t\), the next item \(y^*\) can be predicted by maximum a posteriori probability (Eq. 2). Both learning and prediction can be efficiently solved using the forward-backward algorithm [16].

$$\begin{aligned} \theta ^*&= \textstyle {{\mathrm{arg\,max}}}_\theta \,P(Y_1,...,Y_t;\theta ) \end{aligned}$$
(1)
$$\begin{aligned} y^*&= \textstyle {{\mathrm{arg\,max}}}_y\,P(Y_{t+1}=y|Y_1,\ldots ,Y_t;\theta ^*) \end{aligned}$$
(2)

4 Proposed Models

In a standard HMM, item emission probabilities are invariant across users, and state transition probabilities are independent of contexts at different times. However, these assumptions often deviate from real-world scenarios, in which different users and contexts may have important bearing on emissions and transitions. In this section, we model dynamic emissions and transitions respectively, and ultimately jointly, to better capture sequential preferences.

4.1 Modeling Dynamic User-Biased Emissions (SEQ-E)

It is often attractive to consider personalized preferences [18], where different user sequences may exhibit different emissions even though they share a similar transition. For instance, while two users both transit from soft rock to hard rock in their respective playlist, they might still choose songs of different artists in each genre. As another example, two users both transit from spring to summer in their apparel purchases, but still prefer different brands in each season. However, a fully personalized model catered to every individual user is often impractical due to inadequate training data for each user. We hypothesize that there exist different groups such that users across groups manifest different emission probabilities, whereas users in the same group share the same emission probabilities.

Fig. 2.
figure 2

Sequential models with dynamic user groups and contexts

In Fig. 2(a), we introduce a variable \(G_u\) to represent the group assignment of each user u. For simplicity, our technical formulation presents a single sequence and hence only one user. Thus, we omit the user notation u when no ambiguity arises. Assuming a set of groups \(\mathcal {G}\), the new model can be formally specified by the parameters \((\pi ,\sigma ,A,B)\), such that \(\forall x \in \mathcal {X}, y\in \mathcal {Y}, g\in \mathcal {G}, t\in \{1,2,\ldots \}\),

  • \(\pi \) and A are the same as in a standard HMM;

  • \(\sigma \) is the group distribution with \(\sigma _g=P(G=g)\);

  • B is the new emission tensor with \(B_{gxy} = P(Y_{t} = y | X_{t} = x, G = g)\).

4.2 Modeling Dynamic Context-Biased Transitions (SEQ-T)

In standard HMM, the transition matrix is invariant over time. In real-world applications, this assumption may not hold. The transition probability may change depending on contexts that vary with time. Consider modeling a playlist of songs, where the transitions between genres are captured. The transition probabilities could be influenced by characteristics of the current song (e.g., artist, lyrics and sentiment). A fan of the current artist may break her usual pattern of genre transition and stick to genres by the same artist for the next few songs. As another example, a user purchasing apparels throughout the year may follow seasonal transitions. If satisfied with certain qualities (e.g., material and style) of past purchases, she may buy more such apparels out of season to secure discounts, breaking the usual seasonal pattern. We call such characteristics context features.

It is infeasible to differentiate transition probabilities by individual context features directly, which would blow up the parameter space and thus pose serious computational and data sparsity obstacles. Instead, we propose to model a single context factor that directly influences the next transition. The context factor, being latent, manifests itself through the observable context features.

As illustrated in Fig. 2(b), consider a set of context features \(F=\{F^1,F^2,\ldots \}\). As feature values vary over time, let \(F_t = \left( F_t^1,F_t^2,\ldots \right) \) denote the feature vector at time t. Each feature \(F^i\) takes a set of values \(\mathcal {F}^i\), i.e., \(F_t^i \in \mathcal {F}^i,\forall i \in \{1,...,|F|\},t\in \{1,2,\ldots \}\). Similarly, let \(R_{t}\) denote the latent context factor at time t, and \(\mathcal {R}\) denote the set of context factor levels, i.e., \(R_t \in \mathcal {R},\forall t \in \{1,2,\ldots \}\). Finally, the model can be specified by the parameters \((\pi ,\rho ,A,B,C)\), such that \(\forall x,u\in \mathcal {X},i\in \{1,\ldots ,|F|\},f\in \mathcal {F}_i, t\in \{1,2,\ldots \}\),

  • \(\pi \) and B are the same as in a standard HMM;

  • \(\rho \) is the distribution of the latent context factor with \(\rho _r=P(R_t=r)\);

  • C is the feature probability matrix with \(C_{rif} = P(F_t^i = f | R_t = r)\);

  • A is the new transition tensor with \(A_{rxu} = P(X_{t}=u | X_{t-1}=x, R_{t-1} = r)\).

4.3 Joint Model (SEQ*)

As discussed, user groups and context features can dynamically bias the emission and transition probabilities, respectively. Here, we consider both users and contexts in a joint model, as shown in Fig. 2(c). Accounting for all the parameters defined earlier, the joint model is specified by a six-tuple \(\theta = (\pi ,\sigma ,\rho ,A,B,C)\). The algorithm for learning and inference will be discussed in the next section.

5 Learning and Prediction

We now present efficient learning and prediction algorithms for the joint model. Note that the user and context-biased models are only degenerate cases of the joint model— the former assumes one context factor level (i.e., \(|\mathcal {R}|=1\)) and no features (i.e., \(F=\emptyset \)), whereas the latter assumes one user group (i.e., \(|\mathcal {G}|=1\)).

5.1 Parameter Learning

The goal of learning is to optimize the parameters \(\theta =(\pi ,\sigma ,\rho ,A,B,C)\) through maximum likelihood, given the observed items and features. Consider a sequence of \(T > 1\) time steps. Let \(\underline{Y}\triangleq (Y_1,\ldots ,Y_T)\) as a shorthand; and similarly for \(\underline{F}, \underline{X},\underline{R}\). Subsequently, the optimal parameters can be obtained as follows.

$$\begin{aligned} \textstyle \theta ^*={{\mathrm{arg\,max}}}_{\theta }\log P(\underline{Y},\underline{F};\theta ) \end{aligned}$$
(3)

We demonstrate with one sequence for simpler notations. The algorithm can be trivially extended to enable multiple sequences as briefly described later.

Expectation Maximization (EM). We apply the EM algorithm to solve the above optimization problem. Each iteration consists of two steps below.

  • E-step. Given parameters \(\theta '\) from the last iteration (or random ones in the first iteration), calculate the expectation of the log likelihood function:

    $$\begin{aligned} Q(\theta |\theta ') = \textstyle \sum _{\underline{X},G,\underline{R}}P(\underline{X},G,\underline{R}|\underline{Y},\underline{F};\theta ') \log P(\underline{Y},\underline{F},\underline{X},G,\underline{R};\theta ') \end{aligned}$$
    (4)
  • M-step. Update the parameters \(\theta ={{\mathrm{arg\,max}}}_{\theta }Q(\theta |\theta ')\).

Given the graphical model in Fig. 2(c), the joint probability \(P(\underline{Y},\underline{F},\underline{X},G,\underline{R})\) can be factorized as

$$\begin{aligned} P(G)P(X_1)\cdot \prod _{t=1}^{T} \left( P(Y_t|G, X_t)P(R_{t})\prod _{i=1}^{|F|}P(F_{t}^i | R_{t})\right) \cdot \prod _{t=1}^{T-1}P(X_{t+1} | X_{t}, R_{t}). \end{aligned}$$
(5)

Maximizing the expectation \(Q(\theta |\theta ')\) is equivalent to maximize the following, assuming that \(Y_t=y_t\) and \(F_t^i=f_t^i\) are observed, \(\forall t\in \{1,\ldots ,T\}, i\in \{1,\ldots ,|F|\}\).

$$\begin{aligned}&\textstyle \sum _{x \in \mathcal {X}}P(X_1=x|\underline{Y}, \underline{F}; \theta ')\log \pi _{x} + \sum _{g \in \mathcal {G}}P(G=g|\underline{Y}, \underline{F}; \theta ')\log \sigma _g \nonumber \\ +&\textstyle \sum _{t=1}^T \sum _{r \in \mathcal {R}}P(R_{t}=r|\underline{Y}, \underline{F}; \theta ')\log \rho _{r}\nonumber \\ +&\textstyle \sum _{t = 1}^{T-1}\sum _{x \in \mathcal {X}}\sum _{u\in \mathcal {X}}\sum _{r \in \mathcal {R}} P(R_{t}=r, X_{t}=x, X_{t+1}=u|\underline{Y}, \underline{F}; \theta ')\log A_{rxu} \nonumber \\ +&\textstyle \sum _{t = 1}^{T}\sum _{x \in \mathcal {X}}\sum _{g \in \mathcal {G}}P(X_t=x, G=g| \underline{Y}, \underline{F}; \theta ') \log B_{gx{y_t}}\nonumber \\ +&\textstyle \sum _{t = 1}^{T}\sum _{i=1}^{|F|}\sum _{r \in \mathcal {R}}P(R_{t}=r | \underline{Y}, \underline{F}; \theta ')\log C_{rif_t^i} \end{aligned}$$
(6)

The optimization problem is further constrained by laws of probability, such that \(\sum _{x\in \mathcal {X}}\pi _{x} = 1, \sum _{g\in \mathcal {G}}\sigma _{g} = 1, \sum _{r \in \mathcal {R}}\rho _{r} = 1, \sum _{u\in \mathcal {X}}A_{rxu} = 1, \sum _{y \in \mathcal {Y}} B_{gxy} = 1\) and \(\sum _{f\in \mathcal {F}^i}C_{rif} = 1\). Applying Lagrange multipliers, we can derive the following updating rules.

$$\begin{aligned} \pi _x&= \frac{P(X_1 = x | \underline{Y}, \underline{F}; \theta ')}{1} = \frac{\sum _{g\in \mathcal {G}}\sum _{r\in \mathcal {R}}\gamma _{gxr}(1)}{1},\\ \sigma _g&= \frac{P(G=g| \underline{Y}, \underline{F}; \theta ')}{1} = \frac{\sum _{x\in \mathcal {X}}\sum _{r\in \mathcal {R}}\gamma _{gxr}(1)}{1},\nonumber \\ \rho _r&= \frac{\sum _{t=1}^{T}P(R_{t} = r| \underline{Y}, \underline{F}; \theta ')}{\sum _{t=1}^{T}\sum _{k\in \mathcal {R}}P(R_{t} = k| \underline{Y}, \underline{F}; \theta ')} = \frac{\sum _{g\in \mathcal {G}}\sum _{x\in \mathcal {X}}\sum _{t=1}^{T}\gamma _{gxr}(t)}{T},\nonumber \\ A_{rxu}&= \frac{\sum _{t=1}^{T-1}P(R_{t} = r,X_{t} = x,X_{t+1} = u| \underline{Y},\underline{F}; \theta ')}{\sum _{t=1}^{T-1}P(R_{t} = r,X_{t}=x| \underline{Y}, \underline{F}; \theta ')} = \frac{\sum _{t=1}^{T-1}\sum _{g\in \mathcal {G}}\xi _{gxur}(t)}{\sum _{t=1}^{T-1}\sum _{g\in \mathcal {G}}\gamma _{gxr}(t)},\nonumber \\ B_{gxy}&= \frac{\sum _{t=1}^{T}P(X_{t}=x, G =g | \underline{Y}, \underline{F};\theta ')I(y_{t} =y)}{\sum _{t=1}^{T}P(X_{t} =x, G = g | \underline{Y}, \underline{F};\theta ')} =\frac{\sum _{t=1}^{T}\sum _{r\in \mathcal {R}}\gamma _{gxr}(t)I(y_{t}=y)}{\sum _{t=1}^{T}\sum _{r\in \mathcal {R}}\gamma _{gxr}(t)},\nonumber \\ C_{rif}&= \frac{\sum _{t=1}^{T} P(R_{t} =r| \underline{Y}, \underline{F}; \theta ')I(f_{t}^i =f)}{\sum _{t=1}^{T}P(R_{t} = r| \underline{Y}, \underline{F}; \theta ')} =\frac{\sum _{t=1}^{T}\sum _{g\in \mathcal {G}} \sum _{x\in \mathcal {X}}\gamma _{gxr}(t)I(f_t^i =f)}{\sum _{t=1}^{T}\sum _{g\in \mathcal {G}}\sum _{x\in \mathcal {X}}\gamma _{gxr}(t)},\nonumber \end{aligned}$$
(7)

where \(I(\cdot )\) is an indicator function and

$$\begin{aligned} \gamma _{gxr}(t)&\triangleq P(G = g, X_{t} = x, R_{t} = r | \underline{Y}, \underline{F};\theta '), \end{aligned}$$
(8)
$$\begin{aligned} \xi _{gxur}(t)&\triangleq P(G = g, X_{t} = x, X_{t+1} = u, R_{t} = r| \underline{Y}, \underline{F};\theta ') . \end{aligned}$$
(9)

Note that, to account for multiple sequences, in each updating rule we need to respectively sum up the denominator and numerator over all the sequences.

Inference. To efficiently apply the updating rules, we must solve the inference problems for \(\gamma _{gxr}(t)\) and \(\xi _{gxur}(t)\) in Eqs. 8 and 9. Towards these two goals, similar to the forward-backward algorithm [16] for the standard HMM, we first need to support the efficient computation of the below probabilities.

$$\begin{aligned} \alpha _{gxr}(t)&= P(Y_{1}, \ldots , Y_{t}, F_{1}, ..., F_{t}, X_{t} = x, G = g, R_{t} = r; \theta ')\end{aligned}$$
(10)
$$\begin{aligned} \beta _{gxr}(t)&= P(Y_{t+1}, ..., Y_{T}, F_{t+1}, ..., F_{T} | X_{t} = x, G = g, R_{t} = r; \theta ') \end{aligned}$$
(11)

Letting \(\theta '=(\pi ',\sigma ',\rho ',A',B',C')\) and \(C'(r,t)=\prod _{i=1}^{|F|}C'_{rif_{t}^i}\), both probabilities can be computed recursively, as follows.

$$\begin{aligned} \alpha _{gxr}(t)&={\left\{ \begin{array}{ll} \pi '_{x}\sigma '_{g}\rho '_{r}C'(r,1)B'_{gxy_1}, &{} t = 1\\ \rho '_{r}C'(r,t)B'_{gxy_{t}}\,{\sum }_{u\in \mathcal {X}}{\sum }_{k\in \mathcal {R}}\alpha _{guk}(t-1)A'_{kux}, &{} \text {else} \end{array}\right. }\end{aligned}$$
(12)
$$\begin{aligned} \beta _{gxr}(t)&={\left\{ \begin{array}{ll} B'_{gxy_T}C'(r,T), &{} t = T - 1\\ {\sum }_{k\in \mathcal {R}}\,\rho '_{k}C'(k,t+1)\, {\sum }_{u\in \mathcal {X}}B'_{guy_{t+1}}A'_{rxu}\beta _{guk}(t+1), &{} \text {else} \end{array}\right. } \end{aligned}$$
(13)

Subsequently, \(\gamma _{gxr}(t)\) and \(\xi _{gxur}(t)\) can be further computed.

$$\begin{aligned} \xi _{gxur}(t)&= \frac{\alpha _{gxr}(t)A'_{xur}B'_{guy_{t+1}}\sum _{k\in \mathcal {R}}\beta _{guk}(t+1)\rho '_k C'(k,t+1)}{\sum _{h\in \mathcal {G}}\sum _{v\in \mathcal {X}}\sum _{k\in \mathcal {R}}\alpha _{hvk}(T)} \end{aligned}$$
(14)
$$\begin{aligned} \gamma _{gxr}(t)&= {\left\{ \begin{array}{ll} {\sum }_{x \in \mathcal {X}}^{}\,\xi _{gxur}(t) &{} t = T \\ {\sum }_{u \in \mathcal {X}}^{}\,\xi _{gxur}(t) &{} \text {else} \end{array}\right. } \end{aligned}$$
(15)

5.2 Item Prediction

Once the parameters are learnt, we can predict the next item of a user given her existing sequence of items \(\{Y_{1}, Y_{2}, ..., Y_{t}\}\) and context features \(\{F_{1}, F_{2}, ..., F_{t}\}\). In particular, her next item \(y^*\) can be chosen by maximum a posteriori estimation:

$$\begin{aligned} y^*&=\textstyle {{\mathrm{arg\,max}}}_y\,P(Y_{t+1}=y|Y_{1}, \ldots , Y_{t}, F_{1},...,F_{t}) \nonumber \\&=\textstyle {{\mathrm{arg\,max}}}_y\, P(Y_{1}, \ldots , Y_{t},Y_{t+1}=y, F_{1},...,F_{t}) \nonumber \\&=\textstyle {{\mathrm{arg\,max}}}_y\, P(Y_{1}, \ldots , Y_{t},Y_{t+1}=y, F_{1},...,F_{t},F_{t+1})/P(F_{t+1}) \nonumber \\&=\textstyle {{\mathrm{arg\,max}}}_y \sum _{g\in \mathcal {G}}\sum _{x\in \mathcal {X}}\sum _{r\in \mathcal {R}}\alpha _{gxr}(t+1). \end{aligned}$$
(16)

While we do not observe features at time \(t+1\), in the above we can adopt any value for \(F_{t+1}\) which does not affect the prediction. Instead of picking the best candidate item, we can rank all the candidates and suggest the top-K items.

5.3 Complexity Analysis

We conduct a complexity analysis for learning the joint model \(\text {SEQ*}\). Consider one sequence of length T with \(|\mathcal {X}|\) states, \(|\mathcal {Y}|\) items, \(|\mathcal {G}|\) user groups, \(|\mathcal {R}|\) context factor levels, |F| features and \(|\mathcal {F}|\) values for each feature. For this one sequence, the complexity of one iteration of the EM is contributed by three main steps:

  • Step 1: Calculate \(\alpha , \beta \): \(O\left( T|\mathcal {G}||\mathcal {X}||\mathcal {R}|^2(|\mathcal {X}|+|F|) \right) \). Because \(\rho '_{r}, C'(r,t)\) in Eq. 12 are independent of gxuk while \(\rho '_{k},C'(k,t+1)\) in Eq. 13 are independent of gxur, we can further simplify this to: \(O\left( T|\mathcal {R}|(|\mathcal {G}||\mathcal {X}|^2|\mathcal {R}|+|F|) \right) \).

  • Step 2: Calculate \(\xi , \gamma \) using \(\alpha , \beta \): \(O\left( T|\mathcal {G}||\mathcal {X}|^2 |\mathcal {R}|^2 |F| \right) \). As \(\rho '_k C'(k,t+1)\) in Eq. 14 is independent of gxur, we reduce it to: \(O\left( T|\mathcal {R}|(|\mathcal {G}||\mathcal {X}|^2|\mathcal {R}| + |F|) \right) \).

  • Step 3: Update \(\theta \) using \(\gamma , \xi \): \(O \left( T|\mathcal {G}||\mathcal {X}||\mathcal {R}|(|\mathcal {X}| + |F|) \right) \). As y in \(B_{gxy}\) of Eq. 7 is independent of gxr, we first compute the denominator, and update a normalized score to y in the \(B_{gxy}\) while computing the numerator. Likewise, if in \(C_{rif}\) are independent of gxr. Thus, we have: \(O \left( T|\mathcal {R}|(|\mathcal {G}||\mathcal {X}|^2 + |F|) \right) \).

The overall complexity of \(\text {SEQ*}\) is \(O\left( T|\mathcal {R}|(|\mathcal {G}||\mathcal {X}|^2|\mathcal {R}| + |F|) \right) \) for one sequence, one iteration. The complexities of lesser models are (by substitution):

  • \(\text {HMM}\) with \(|\mathcal {G}| = |\mathcal {R}| = 1, |F| = |\mathcal {F}|=0\): \(O\left( T|\mathcal {X}|^2 \right) \)

  • \(\text {SEQ-E}\) with \(|\mathcal {R}| = 1, |F| = |\mathcal {F}|=0\): \(O\left( T|\mathcal {G}||\mathcal {X}|^2 \right) \)

  • \(\text {SEQ-T}\) with \(|\mathcal {G}|=1\): \(O \left( T|\mathcal {R}|(|\mathcal {X}|^2|\mathcal {R}| + |F|) \right) \)

The result implies that the running times of our proposed models are quadratic in the number of states and context factor levels, while linear in all the other variables. HMM is also quadratic in the number of states. Comparing to HMM with the same number of states, our joint model incurs a quadratic increase in complexity only in the number of context factor levels (which is typically small), and merely a linear increase in the number of groups and context features.

6 Experiments

The objective of experiments is to evaluate effectiveness. We first look into a synthetic dataset to investigate whether context-biased transition and user-biased emission could have been simulated by increasing the number of HMM’s states. Next, we experiment with two real-life, publicly available datasets, to investigate whether the models result in significant improvements over the baseline.

6.1 Setup

We elaborate on the general setup here, and describe the specifics of each dataset later in the appropriate sections. Each dataset has of a set of sequences. We create random splits of 80:20 ratio of training versus testing. In this sequential preference setting, a sequence (a user) is in either training or testing, but not necessarily in both. This is different from a fully personalized ordinal preference setting (a different framework altogether), where a user would be represented in both sets.

Task. For each sequence in the testing set, given the sequence save the last item, we seek to predict the last item. Each method generates a top-K recommendation, which is evaluated against the held-out ground-truth last item.

Comparative Methods. Since we build our dynamic context and user factors upon HMM, it is the most appropriate baseline. To investigate the contribution of user-biased emission and context-based transition separately, we compare the two models SEQ-E and SEQ-T respectively against the baseline. To see their contributions jointly, we further compare SEQ* against the baseline. In addition, we include the result of the frequency-based method \(\text {FREQ}\) as a reference, which simply choose the most popular item in the training data.

Metrics. We rely on two conventional metrics for top-K recommendation. Inspired by a similar evaluation task in [24], the first metric we use is Recall@K.

$$\begin{aligned} Recall@K = \frac{\mathrm {number\ of\ sequences\ with\ the\ ground\ truth\ item\ in\ the\ top}\ K}{\mathrm {total\ number\ of\ sequences\ in\ the\ testing\ set}} \end{aligned}$$

If we assume the ground truth item to be the only true answer, average precision can be measured similarly (dividing by K) and would show the same trend as recall. In the experiments, we primarily study top \(1\,\%\) recommendation, i.e., Recall@1 %, but will present results for several other K’s as well. Actually, it is not clear that the other items in the top-K would really be rejected by a user [24]. Instead of precision, we rely on another metric.

The second metric is Mean Reciprocal Rank or MRR, defined as follows.

$$\begin{aligned} MRR = \frac{1}{|S_\text {test}|} \times \sum _{s \in S_\text {test}}^{} \frac{1}{\mathrm {rank\ of\ target\ item\ for\ sequence}\ s} \end{aligned}$$

We prefer a method that places the ground-truth item higher in the top-K recommendation list. Because the contribution of a very low rank is vanishingly small, we cut the list off at 200, i.e., ranks \(\ge \) 200 contribute zero to MRR. Realistically, a recommendation list longer than 200 is unlikely in realistic scenarios.

For each dataset, we create five random training/testing splits. For each “fold”, we run the models ten times with different random initializations (but with common seeds across comparative methods for parity). For each method, we average the Recall@K and MRR across the fifty readings. All comparisons are verified by one-sided paired-sample Student’s t-test at 0.05 significance level.

6.2 Synthetic Dataset

We begin with experiments on a synthetic dataset, for two reasons. First, one advantage of a synthetic dataset is the knowledge of the actual parameters (e.g., transition and emission probabilities), which allows us to verify our model’s ability to recover these parameters. Second, we seek to verify whether the effects of context-biased transition and user-biased emission could have been simulated by increasing the number of hidden states of traditional sequence model HMM.

Dataset. We define a synthetic dataset with the following configuration: 2 groups (\(|\mathcal {G}| = 2\)), 2 states (\(|\mathcal {X}| = 2\)), 2 context factor levels (\(|\mathcal {R}|\) = 2), 4 items (\(|\mathcal {Y}| = 4\)), 4 features (|F| = 4) each with 2 feature values (present or absent).

The complete set of synthetic parameters are specified in the supplementary material. Here, we discuss the key ideas. A six-tuple \(\theta = (\pi ,\sigma ,\rho ,A,B,C)\) is specified as follows: \(\pi = [0.8, 0.2]\), \(\sigma = [0.9, 0.1]\), \(\rho = [0.3, 0.7]\). The transition tensor A is such that we induce self-transition to the same state for the first context factor level, and switching to the other state for the second context factor level. The emission tensor B is such that the four (state, group) combinations each tend to generate one of the four items. The feature matrix C is such that each context factor level is mainly associated with two of the four features.

We then generate 10 thousand sequences, each of length 10 (\(T = 10\)). For each sequence, we first draw a group according to \(\sigma \). At time \(t = 1\), we draw the first hidden state \(X_1\) from \(\pi \), followed by drawing the first item \(Y_1\) from B. We also draw a context factor level from \(\rho \) and generate features via C. For time \(t = 2, \ldots , 10\), we follow the same process, but each hidden state is now drawn from A according to the previous state and context factor level at time \(t-1\).

Fig. 3.
figure 3

Performance of comparative methods on Synthetic Data for Recall@1 and MRR

Results. We run the four comparative methods on this synthetic dataset, fixing the context factor levels and groups to 2 for the relevant methods, while varying the number of states. Figure 3(a) shows the results in terms of Recall@1, i.e., the ability of each method in recommending the ground truth item as the top prediction. There are several crucial observations. First, the proposed model SEQ* outperforms the rest, attaining recall close to 85 %, while the baseline HMM hovers around 65 %. SEQ* also outperforms SEQ-T and SEQ-E.

Second, as we increase the number of states, most models initially increase in performance and then converge. Evidently, increasing the number of states alone does not lift the baseline HMM to the same level of performance as SEQ* or SEQ-T, indicating the effect of context-biased transition. Meanwhile, though SEQ-E and HMM are similar (due to inability to model context factor), SEQ* is slightly better than SEQ-T, indicating the contribution of user-biased emission. Figure 3(b) shows the results for MRR, showing similar trends and observations.

6.3 Real-Life Datasets

We now investigate the performance of the comparative methods on real-life, publicly available datasets covering two different domains: song playlists from online radio station Yes.com, and hashtag sequences from users’ Twitter streams.

Playlists from Yes.com. We utilize the yes_small datasetFootnote 1 collected by [4]. The dataset includes about 430 thousand playlists, involving 3168 songs. Noticeably, the majority of playlits has length which is shorter than 30. To keep the playlist lengths relatively balanced, we filter out playlists with fewer than two songs and retain up to the first thirty songs in each playlist. Finally, we have 250 thousand playlists (sequences) consisting of 3168 unique songs (items).

Features. We study the effect of features on the context-biased transition model SEQ-T. Each song may have tags. There are 250 unique tags. We group tags with similar meanings (e.g.,“male vocals” and “male vocalist”). As the first feature, we use a binary feature of whether the current song and the previous song shares at least one tag. For additional features, we use the most popular tags. Note that we never assume knowledge of the tags of the song to be predicted. Figure 4(a) shows the performance of SEQ-T, with two context factor levels, for various number of features. Figure 4(a) has dual vertical axes for Recall@1 % (left) and MRR (right) respectively. The trends for both metrics are similar: performance initially goes up and then stabilizes. In subsequent experiments, we use eleven features (similarity feature and ten most popular tags).

Context Factor. We then vary the number of context factor levels of SEQ-T (with eleven features). Figure 4(b) shows that for this dataset, there is not much gain from increasing the number of context factor levels beyond two. Therefore, for greater efficiency, subsequently we experiment with two context factor levels.

Fig. 4.
figure 4

Effects of features, context factor on SEQ-T & groups on SEQ-E on Yes.com

Latent Groups. We turn to the effect of latent groups on the user-biased emission model SEQ-E. Figure 4(c) shows the effect of increasing latent groups. More groups lead to better performance. Because of the diversity among sequences, having more groups increases the flexibility in modeling emissions while still sharing transitions. For the subsequent comparison to the baseline, we will experiment with two latent groups, as the earlier comparison has shown that the results with higher number of groups would be even higher.

Table 1. Performance of comparative methods on Yes.com for Recall@K

Comparison to Baseline. We now compare the proposed models SEQ-T, SEQ-E, and SEQ* to the baseline HMM. Table 1 shows a comparison in terms of Recall@K for 5, 10, and 15 states. In addition to Recall@1 % (corresponding to top 31), we also show results for Recall@50 and Recall@100. The symbol \(\dagger \) denotes statistical significance due to the effect of context-biased transition. In other words, the outperformance of SEQ-T over HMM, and that of SEQ* over SEQ-E, are significant. The symbol \(\S \) denotes statistical significance due to the effect of user-biased emission, i.e., the outperformance of SEQ-E over HMM, and that of SEQ* over SEQ-T, are significant. Finally, our overall model SEQ* is significantly better than the baseline HMM in all cases. The absolute improvement of the former over the latter in additional percentage terms is shown in the Imp. column. For all models, more states generally translate to better performance, and the improvements are somewhat smaller but still significant. Table 2 shows a comparison in terms of MRR, where similar observations hold.

Table 2. Performance of comparative methods on Yes.com for MRR
Table 3. Performance of comparative methods on Twitter.com for Recall@K
Table 4. Performance of comparative methods on Twitter.com for MRR

Hashtag Sequences from Twitter.com. We conduct similar experiments on the Twitter datasetFootnote 2 [12]. There are 130 thousand users. In our scenario, each sequence corresponds to the hashtags of a user. The average length of our dataset is 19. If a tweet has multiple hashtags, we retain the most popular one, so as to maintain the sequence among tweets. Similarly to the treatment of stop words and infrequent words in document modeling, we filter out hashtags that are too popular (frequency \(\ge \) 25000) or relatively infrequent (frequency \(\le \) 1000). Finally, we obtain 114 thousand sequences involving 2121 unique hashtags. Similarly to Yes.com, we run the models for two levels of context factor and two latent groups, but with seven features extracted from the tweet of the current hashtag (not the one to be predicted): number of retweets, number of hashtags, time intervals to the previous one and two tweets, time interval to the next tweet, and edit distances with the previous one and two observations.

The task is essentially predicting the next hashtag in a sequence. In brief, Tables 3 and 4 support that the improvements due to context-biased transition (\(\dagger \)) and user-biased emission (\(\S \)) are mostly significant. Importantly, the overall improvements by SEQ* over the baseline HMM (Imp. column) are consistent and hold up across 5, 10, and 15 states for both Recall@K and MRR.

Computational efficiency is not the main focus of experiments. We comment briefly on the running times. For the Twitter dataset, the average learning time per iteration on Intel Xeon CPU X5460 3.16 GHz with 32 GB RAM for our models with 15 states, 2 groups, 2 context factor levels are 2, 3, and 6 min for \(\text {SEQ-E}\), \(\text {SEQ-T}\) and \(\text {SEQ*}\) respectively. HMM requires less than a minute.

7 Conclusion

In this work, we develop a generative model for sequences, which models two types of dynamic factors. First, transition from one state to the next may be affected by context factor. This results in SEQ-T model, with context-biased transition. Second, we seek to incorporate how different latent user groups may have preferences for certain items. This results in SEQ-E model, with user-biased emission. Finally, we unify these two factors into a joint model SEQ*. Experiments on both synthetic and real-life datasets support the case that these dynamic factors contribute towards better performance than the baseline HMM (statistically significant) in terms of top-K recommendation for sequences.