1 Introduction

Most existing probabilistic latent topic models such as Latent Dirichlet Allocation (LDA) (Blei et al. 2001, 2003) are unsupervised probabilistic topic models which analyze a high dimensional term space and discover a low-dimensional topic space (Blei et al. 2003; Steyvers and Griffiths 2007; Blei and Lafferty 2009; Blei 2012). They have been employed for tackling text mining problems (Sun et al. 2012) including document classification (Jameel and Lam 2013b; Rubin et al. 2012; Li et al. 2015) and document retrieval (Wei and Croft 2006; Wang et al. 2007; Chen 2009; Yi and Allan 2009; Egozi et al. 2011; Andrzejewski and Buttler 2011; Wang et al. 2011, 2013a; Lu et al. 2011; Yi and Allan 2008; Cao et al. 2007a; Park and Ramamohanarao 2009; Duan et al. 2012). These models can achieve better performance via detecting the latent topic structure and establishing a relationship between the latent topic and the goal of the problem. One limitation of unsupervised topic models for document classification is that the topic model itself does not consider the class labels of documents during inference. Various advantages of considering this variable in the latent topic models have been discussed in Zhu et al. (2012a), and Blei and McAuliffe (2008). Another limitation of latent topic models is that they do not exploit the word order structure of the documents. Some works attempt to integrate the class label information into a topic model for solving document classification, for example, supervised Latent Dirichlet Allocation (sLDA) (Blei and McAuliffe 2008), multi-class supervised Latent Dirichlet Allocation (mcLDA) (Wang et al. 2009), supervised Hierarchical Dirichlet Processes Zhang et al. (2013), Storkey and Dai (2014), and maximum margin supervised topic model, MedLDA (Zhu et al. 2012a). These models have shown to improve document classification performance (Zhu et al. 2013a; Jiang et al. 2012; Zhu et al. 2014). However, one common limitation of the above models is that they do not make use of the word order structure in text documents that could interact with the class label information for solving the document classification task. Obviously, technical challenges in considering the word order structure in a supervised topic model are high. First, the mathematical derivation of Gibbs sampling equations need to be revised from that of the unigram models as our classification model considers distribution over bigrams. Such requirement involves refinement based on theoretical aspect. Bag-of-words models assume exchangeability in the probability space, whereas models which maintain the order of words in the document relax such a strong assumption (Aldous 1985). The form of input data to the model changes from the traditional word document co-occurrence matrix to full documents with word order.

Likewise, unsupervised topic models such as Topical N-Gram (TNG) (Wang et al. 2007; Wang and McCallum 2005) and Latent Dirichlet Allocation (LDA) have been used in developing document retrieval model (Wang et al. 2007; Wei and Croft 2006). But they have not been explored for document retrieval learning which can be essentially cast into a learning-to-rank problem (Hang 2011). Learning-to-rank models make use of available relevance judgment information of a document for a query in the training process. The task is then to predict a desired ordering of documents. Several learning-to-rank models have been introduced, for example, Wang et al. (2014), Zong and Huang (2014), Yu et al. (2014) and Niu et al. (2014), but none of them considers the similarity between the document and the query under a low-dimensional topic space within the topic model itself.

The main idea in both of our models is to conduct posterior regularization (Ganchev et al. 2010) in a Bayesian inference parameter learning setup (Zhu et al. 2014). In posterior regularization using Bayesian inference, we intend to find a new desired posterior which is regularized using a regularization model. In our framework, our regularization is due to a maximum margin classifier which mainly helps predict the relevant class of the data. The notion is that for points which are difficult to classify by the classifier, the classifier gets an extra classifying signal from the topic model to help classify that point to its correct class. Such hard points are mainly located at the margin of the classifier or may be generally mis-classified by the classifier without any latent topic information. This posterior regularization mainly is a new posterior obtained by the topic model.

1.1 Our main contributions

We propose two topic models that build upon previous works on topic models with word order (Wallach 2006, 2008; Noji et al. 2013; Jameel and Lam 2013b, c; Kawamae 2014; Wang et al. 2007), etc which discuss in detail the challenges, motivation, and advantages of such models for solving various text mining tasks. One of the main advantages is that such models can better capture the semantic fabric of the document, which is lost when the order of words in the document is relaxed. In particular, our models incorporate the notion of side-information within the latent topic model itself. In contrast, none of the existing topic models with word order considers it. Side-information is mainly handled by the maximum margin classifier which is tightly integrated into the topic model. Topic models with word order have shown to produce more interpretable latent topics as compared to unigram models (Wang et al. 2007; Jameel and Lam 2013b, c; Lindsey et al. 2012). In addition, they have also shown to perform better on other quantitative tasks (Jameel and Lam 2013b). But such models fail to take advantage of side-information to produce more discriminative and interpretable latent topics. Our hybrid models can accomplish such goal. Our first model is a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into our supervised topic model with maximum margin learning enabling more effective interaction among such information for solving document classification. Mathematical derivation of Gibbs sampling equations are quite complex due the Markovian assumption on the order of the words for our model. Since our classification model considers the distribution over bigrams, the framework described in Jiang et al. (2012) and Zhu et al. (2012a) needs considerable changes due to the exchangeability (Heath and Sudderth 1976) assumption (Aldous 1985). We adopt collapsed Gibbs sampler (Shao and Ibrahim 2000) framework with considerable changes from Jiang et al. (2012) because it collapses out the nuisance variables and speeds up the inference (Porteous et al. 2008). The design and the study of the interplay between the side-information and word order is an interesting finding. Our model provides insights about how word order interacts with the side-information in a topic model. The implementation of the model is also challenging, where the input is not the word co-occurrence matrix, but a full document with word order.

Another contribution is that we propose a new supervised topic model for document retrieval learning which can be regarded as a pointwise model for tackling learning-to-rank task. Available relevance assessments and word order structure are integrated into the topic model itself. We jointly model the similarity between the query and the document under a low-dimensional topic space in a maximum margin framework. The main motivation for proposing this model is that in the document retrieval learning setting, our model apart from using the usual query-dependent features such as similarity metrics between the query and the document and query-independent features (Qin et al. 2010) such as PageRank (Brin and Page 1998), can also use the topic similarity feature which can help find the similarity between the query and the document in the latent topic space. Fundamentally, even if the words between the query and the documents do not overlap, but their low-dimensional representations are semantically close or the same in their latent topic assignments, then we get a signal that they are describing about the same thematic content. We conduct extensive experiments on several publicly available benchmark datasets, and show that our model improves upon the state-of-the-art models. One major difference between our model and existing learning-to-rank models is that existing learning-to-rank models do not consider latent topic information in the learning framework. Our pointwise learning-to-rank model lays a foundation upon which future research on document retrieval learning can be done, for example, allowing further development of pairwise and listwise document retrieval learning probabilistic latent topic models. Note that we develop our model based on the design paradigm from Jiang et al. (2012) and Zhu et al. (2012a) for our document retrieval learning and classification models. An important point to note is that these methods have shown superior performance than the two-stage heuristic methods which first compute the latent topic vector representation and then these vectors are fed to another prediction model. In order to adapt the classification model for solving document retrieval learning problem, new design has to be made. First, the definition of the discriminant function needs to be designed to handle document retrieval learning task along with the other formulations that follow the discriminant function. Second, the relevance judgment associated with the query-document pair is also considered in our model. Third, the prediction task on unseen query and document pairs needs to be formulated as the prediction for the classification model will not directly work for document retrieval learning task.

1.2 Our previous works

Recently, in Jameel and Lam (2013b) we presented a topic model which is inspired from the Bigram Topic Model (BTM) (Wallach 2006). This model relaxes the bag-of-words assumption, and generates collocations just like the LDA-Collocation Model (LDACOL) (Griffiths et al. 2007). It also differs from our new models proposed in this paper as we have incorporated side-information, where our previous model is unsupervised. Our temporal model proposed in Jameel and Lam (2013c), also generates more interpretable latent topics with word order. However, this model does not consider side-information and cannot solve document retrieval learning task. Our nonparametric topic model proposed in Jameel and Lam (2013a) significantly differs from the models proposed in this paper. Although our model maintains the order of words, and shows promising empirical performance, the model proposed in Jameel and Lam (2013a) does not incorporate side-information and it is a nonparametric topic model. Recently, we also proposed a nonparametric topic model where order of words is maintained (Jameel et al. 2015). This model introduced a new non-exchangeable metaphor known as the Chinese Restaurant Franchise with Buddy Customers (CRF-BC). This model is significantly different from the models proposed in this work in that the CRF-BC model does not incorporate side-information. Also, the model is well suited for generated collocations and is nonparametric.

2 Related work

Unsupervised and supervised topic models have been applied on the document classification task (Blei et al. 2003; Blei and McAuliffe 2008; Wang et al. 2013b). An advantage that supervised topic models have over unsupervised ones is that supervised topic models consider the available side-information as response variables in the topic model itself. This helps discover more predictive low dimensional representation of the data for better classification (Zhu et al. 2012a). Blei et al., proposed the Supervised Latent Dirichlet Allocation (sLDA) (Blei and McAuliffe 2008) model which captures the real-valued document rating as a regression response. The model relies upon a maximum-likelihood based mechanism for parameter estimation. Wang et al. (2009) proposed multi-class sLDA (mcLDA) which directly captures discrete labels of documents as a classification response. The Discriminative LDA (DiscLDA) (Lacoste-Julien et al. 2008) also performs classification in a different mechanism than sLDA. Different from the above models, Zhu et al. (2012a) proposed Maximum Entropy Discrimination LDA model known as MedLDA that directly minimizes a margin based loss derived from an expected prediction rule. The MedLDA model uses a variational inference method for parameter estimation. Subsequently, Markov Chain Monte Carlo techniques were proposed in Zhu et al. (2013a, b, c) and Jiang et al. (2012). Ramage et al. (2009) proposed a supervised topic model which jointly models available class labels and text content by defining a one-to-one correspondence between latent topics and class label information. This allows their model to directly learn word-tag correspondences in the topic model itself. What has not been studied in supervised topic modeling is the role that the word order structure in the text content that could play along with the side-information in the document classification task. Our proposed supervised topic model falls in the class of parametric topic models where the number of latent topics has to supplied by the user, but recently, Kawamae Kawamae (2014) presented a nonparametric supervised n-gram topic model based on a Pitman–Yor process prior (Pitman and Yor 1997) for phrase extraction which takes the advantage of labels during training process. However, it cannot perform document retrieval learning as in our model. Moreover, in Bartlett et al. (2010), it has been stated that nonparametric models with Pitman–Yor process priors cannot scale to large scale datasets. There are other proposed supervised nonparametric topic modeling approaches such as (Perotte et al. 2011; Storkey and Dai 2014; Lakshminarayanan and Raich 2011; Xie and Passonneau 2012; Liao et al. 2014; Acharya et al. 2013). These models too cannot perform document retrieval learning task. In addition, such nonparametric topic models are computationally very expensive (Wallach et al. 2009).

Unsupervised topic models have also been used to perform document classification. As mentioned above, they do not make use of the available side-information in the topic model itself. The LDA model is one example and it achieves better performance than that of Support Vector Machines (SVM) (Joachims 1998; Cortes and Vapnik 1995; Vapnik 2000). In (Rubin et al. 2012), the authors showed a model that maintains the order of words in documents which helps achieve better classification results. In (Li and McCallum 2006), the authors presented an unsupervised hierarchical topic model which generates super and sub-topics. The authors showed good classification performance than the comparative methods. The model is represented by a Directed Acyclic graph, which has a capability to capture correlations between two levels of topics. In fact, topic models have also been used on other datasets apart from text documents for classification under the unsupervised setting (Bicego et al. 2010; Pinoli et al. 2014).

It has been studied in the past that considering the order of words in documents helps improve both quantitative and qualitative performance of probabilistic topic models. For example, Wallach (2008) has studied that word order is an important component in many applications such as natural language processing, speech recognition, text compression, etc. Therefore, bag-of-words models might not be very suitable for such applications. Wallach proposed the Bigram Topic model (BTM) which is an extension to the LDA model. The BTM adopts a Markovian assumption on the order of words in documents, and has shown to perform better than the LDA model in predictive tasks. But the BTM had limitation in that it only generates bigram words, which may not be desirable for some tasks. Griffiths et al. (2007) proposed the LDA collocation model (LDACOL) which can generate either unigram or bigram words based on the context information. But in LDACOL model, only the first term has a topic assignment whereas the second term does not, which was addressed in the topical n-gram model (TNG) (Wang and McCallum 2005; Wang et al. 2007). Some improvements to the BTM have been proposed in Noji et al. (2013). In all these works it has been suggested that word order plays important role in topic models. In terms of qualitative results, words appear more interpretable (Lindsey et al. 2012), and in terms of quantitative results it has been shown to improve many applications such as document classification (Jameel and Lam 2013b), information retrieval (Wang et al. 2007), etc.

Learning-to-rank models have been extensively investigated and they can be categorized into pointwise, pairwise, and listwise approaches (Liu 2009). One early work used some bag-of-features in training a SVM model in order to conduct document retrieval learning which can be regarded as a pointwise approach for the learning-to-rank task (Nallapati 2004). This approach predicts a binary relevance prediction. Documents are then ranked based on the confidence scores given by the discriminative classifier. Subsequently other discriminative learning-to-rank models have been proposed such as those which handle multi-class relevance assessments (Busa-Fekete et al. 2013; Li et al. 2007). Many state-of-the-art learning-to-rank models have been proposed recently. For example, Gao et al. (Gao and Yang 2014) recently presented a listwise learning-to-rank model, a novel semi-supervised rank learning model which is extended to an adaptive ranker to domains where no training data is available. In (Lai et al. 2013), the authors presented a sparse learning-to-rank model for information retrieval. Dang et al. (2013) proposed a two-stage learning-to-rank framework to address the problem of sub-optimal ranking when many relevant documents are excluded from the ranking list using bag-of-words retrieval models. In (Tan et al. 2013), the authors proposed a model which directly optimizes the ranking measure without resorting to any upper bounds or approximations. However, a major difference between these learning-to-rank models and our proposed document retrieval learning model is that our model considers the latent topic information unified within a discriminative framework.

In the past, few proposals have been made to conduct document retrieval where the low-dimensional latent semantic space has been used. In (Li and Xu 2014), the authors summarize many of those works. The main motivation for incorporating the semantic information in document retrieval task is mainly to compute the similarity between the latent factors which is based on the semantic content of the document. In (Bai et al. 2010), the authors proposed a discriminative model called supervised semantic indexing which can be trained on labeled data. Their model can compute query-document and document-document similarity in the semantic space. Their focus is primarily on traditional document retrieval than learning-to-rank using an extensive set of feature values. Gao et al. (2011), and Jagarlamudi and Gao (2013) proposed topic models which jointly consider the query and the title of the document to conduct document retrieval task using a language modeling framework. Their motivation for considering title fields in the documents is mainly because queries (Broder 2002) as well as titles are mostly short in nature, thus short document titles could represent more informative power than the entire document for a query. One difference between our model and their framework is that their model is not designed to solve the learning-to-rank task considering feature instances. Our model jointly learns the query and document pair along with the associated relevance label in the latent topic space.

Our document retrieval learning framework is also closely related to some works in posterior regularization. The objective of the posterior regularization framework is to restrict the space of the model parameters on unlabeled data as a way to guide the model towards some desired behaviour. In (Ganchev et al. 2010), the authors proposed a framework which incorporates side-information into the parameter estimation in the form of linear constraints on posterior expectations. Recently, Zhu et al. (2012b, 2014) introduced Bayesian posterior regularization under an information theoretic formulation, and applied their framework on infinite latent SVM. Earlier, the same authors had extended the Zellner’s view of the optimization framework described in Zellner (1988) to propose a regularized Bayesian regularization framework for multi-task learning problem (Zhu et al. 2011). The authors mainly added a convex function to the optimization framework proposed by Zellner. Models such as MedLDA (Zhu et al. 2009, 2012a) and some of its extension are based on such frameworks (Zhu et al. 2013a; Jiang et al. 2012).

Relational topic models, such as the one described in Chang and Blei (2009), incorporate side-information in the form of connections on information networks. Such connections can be social network friends as used in Yuan et al. (2013) or scholar citation networks. In (Tang et al. 2011), the authors proposed a topic model with supervised information for advertising. These models are not designed to handle document retrieval learning which can be cast as a learning-to-rank problem. Also, in our model we incorporate the latent topic model from the BTM model to better capture latent semantic information. The supervising signal is used in the maximum margin framework.

3 Background

We first present a brief background in this section that would help understand our proposed models described later. We start with a basic topic model known as Latent Dirichlet Allocation (LDA) (Blei et al. 2003). We present the details of main part of the LDA model. Then we will present the optimization framework of the posterior distribution obtained from LDA. This optimization framework will be then extended to incorporate loss functions from maximum-margin classifier. We will present an example of a supervised topic model that makes use of the optimization framework of LDA by extending it to incorporate some posterior constraints in Bayesian inference leading to what is known as regularized Bayesian inference framework.

3.1 Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a generative probabilistic topic model for collections of discrete data such as text document collections. The model assumes that documents exhibit multiple latent topics. Therefore, each document is a mixture of a number of topics. In LDA, it represents a latent topic as a probability distribution of words taken from a vocabulary set. A document is denoted by \(d \in \left\{ 1,\ldots ,D\right\} \) where \(D\) is the total number of documents in the collection. Let \(\varvec{W}=\left\{ \varvec{w}^d\right\} _{d=1}^{D}\) denote all the words in all the documents in the collection where each \(\varvec{w}^d\) denotes the words in the document \(d\). \(N^d\) is the number of words in the document \(d\). \(w_n^d\) is the word at the position \(n\) in the document \(d\). \(K\) is the total number of latent topics as specified by the user. \(z_n^d\) is the topic assignment of the word \(w_n^d\). \(\varvec{Z}=\left\{ \varvec{z}^d\right\} _{d=1}^{D}\) are topic assignments to all the words. \(\varvec{\varTheta }=\left\{ \varvec{\theta }^d\right\} _{d=1}^{D}\) are topic distributions for all documents. Let \(\varvec{\varPhi }=\left\{ \varvec{\phi }_k\right\} _{k=1}^{K}\) denote the word-topic distribution. Let \(V\) denote the number of words in the vocabulary. Let \(\varvec{\alpha }\) be the vector denoting the hyperparameter values for the document-topic distributions. Let \(\varvec{\beta }\) denote the vector of hyperparameter values for the word-topic distributions.

The LDA model describes the generative procedure of each document in the collection. Each document is generated from a mixture of topics that pervades the document. Each of those topics is in turn responsible for generating the words without giving importance to the order of the occurrence of the words in those documents.

The generative process of the LDA model is written as:

  1. 1.

    Draw topic proportion for each document \(d\) denoted as \(\varvec{\theta }^d\) from Dirichlet(\(\varvec{\alpha }\)), \(\varvec{\theta }^d\) is the topic proportions for a document,

  2. 2.

    Draw \(\phi _k\) for each topic \(k\) from Dirichlet(\(\varvec{\beta }\)),

  3. 3.

    For each word \(w_n^d\) in the document \(d\),

    1. (a)

      Draw a topic assignment \(z_n^d|\varvec{\theta }^d\) from Multinomial(\(\varvec{\theta }^d\))

    2. (b)

      Draw the observed word \(w_n^d|z_n^d,\varvec{\varPhi }\) from Multinomial(\(\phi _{z_n^d}\))

The probability of a document collection \({\mathbb {D}}\) in LDA is given as:

$$ p\left( {\mathbb {D}}|\varvec{\alpha }, \varvec{\beta }\right) = \prod _{d=1}^{D}\int P\left( \varvec{\theta }^d|\varvec{\alpha }\right) \left( \prod _{n=1}^{N^d} \sum _{z_{n}^d}P\left( z_{n}^d|\varvec{\theta }^d\right) P\left( w_{n}^d|z_{n}^d, \varvec{\beta }\right) \right) {\text {d}}\varvec{\theta }^d $$
(1)

The posterior distribution inferred by the LDA model can be written as:

$$ P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta }\right) = \frac{P_0\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) P\left( \varvec{W}|\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) }{P\left( \varvec{W}|\varvec{\alpha },\varvec{\beta }\right) } $$
(2)

where \(P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })\) is the posterior distribution of the model. Let the prior distribution represented as \(P_0(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta })\), and it is defined as:

$$ P_0\left( \varvec{\varTheta },\varvec{\varPhi },\varvec{Z}|\varvec{\alpha },\varvec{\beta }\right) =\left( \prod _{d=1}^{D}P\left( \varvec{\theta }^d|\varvec{\alpha }\right) \prod _{n=1}^{N^d}P\left( z_{n}^d|\varvec{\theta }^d\right) \right) \prod _{k=1}^{K}P\left( \varvec{\phi }_k|\varvec{\beta }\right) $$
(3)

\(P(\varvec{W}|\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })\) is the likelihood. \(P(\varvec{W}|\varvec{\alpha },\varvec{\beta })\) is the marginal probability distribution.

3.2 Learning using Bayesian inference

Equation 2 presented in Sect. 3.1 can be further translated into an information theoretical optimization problem (Jiang et al. 2012; Zhu et al. 2012a, 2013a, 2014). An advantage of considering this paradigm is that it can be easily extended to incorporate some regularization terms on the desired posterior distribution obtained using Bayes’ theorem. It can lead to a learning model where the posterior distribution obtained using the Bayes’ theorem is directly regularized using a learning model which considers side-information. The regularizer can be obtained from the maximum-margin learning principle, and then can be integrated into the Bayesian learning paradigm leading to regularized Bayesian inference using maximum-margin learning. In principle, this hybrid model could achieve better prediction performance than using a topic model or a maximum-margin classifier alone because this hybrid model inherits the prediction power from both maximum margin prediction learning and topic models. It is well known that maximum margin classifiers have shown strong generalization performance (Burges 1998), and topic models have also shown good performance on document classification task (Rubin et al. 2012; Li and McCallum 2006). Therefore, we can expect that the hybrid model can inherit advantages of both of these models. When conducting posterior inference, we can directly regularize the posterior distribution, which leads to a new posterior regularized by a constraint. Some supervised topic models such as MedLDA (Zhu et al. 2012a), Monte Carlo MedLDA (Jiang et al. 2012), etc. are based on this paradigm.

According to the findings described in Zellner (1988), Eq. (2) can be transformed to an optimization problem which can be written as follows:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}}}{\text {minimize}}\quad {\text {KL}}\left[ P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })||P_0(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) \right] -{\mathbb {E}}_P \left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] \\&{\text {subject \, to}} \quad \, P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}}, \end{aligned} \end{aligned}$$
(4)

where \({\mathbb {P}}\) is the probability distribution space, and \({\text {KL}}(P||P_0)\) is the Kullback–Leibler divergence from \(P\) to \(P_0\). The above optimization interpretation will be useful in our later discussion where we will show how this technique can be used to derive a new maximum margin learning framework using a topic model. We present how the posterior distribution can be transformed into an optimization problem depicted in Eq. (4) in "Appendix”.

3.3 Maximum Margin Entropy Discrimination - LDA (MedLDA)

As mentioned above, our proposed model can be regarded as a supervised topic model where the class label information is incorporated into a topic model itself. Supervised topic models have been used for both classification and regression tasks. One example of a supervised topic model is supervised LDA (sLDA) (Blei and McAuliffe 2008) which is based on extending LDA via the likelihood principle. Another recent supervised topic model is MedLDA (Zhu et al. 2009, 2012a; Jiang et al. 2012) whose graphical model is presented in Fig. 1. Note that in this model, \(\beta \) is not used explicitly, but can be used as a prior to make the model fully Bayesian (Zhu et al. 2012a). MedLDA combines a maximum margin learning algorithm based on Support Vector Machines (SVM) for label prediction, and a topic model based on LDA for the semantic content of the words.

Fig. 1
figure 1

Graphical representation of the MedLDA model

The class label for the document \(d\) is denoted by \(y^d\) which takes on one of the values \({\texttt{Y}} =\left\{ 1,\ldots , M\right\} \). Let \(\overline{\varvec{z}}^d\) denote a \(K\) dimensional vector with each element \({\overline{z}}_k^d=\frac{1}{N^d}\sum _{n=1}^{N^d} {\mathbb {I}}(z_{n}^d=k)\). \({\mathbb {I}}(.)\) is an indicator function which equals to 1 if the predicate holds else it is 0. \(\varvec{f}(y,\varvec{\overline{z}}^d)\) is a \(MK\)-dimensional vector whose elements from (\(y-1\))K to \(yK\) are \(\overline{\varvec{z}}^d\) and the rest are all 0. Let \(\varvec{\eta }\) denote the parameters of the maximum margin classification model. Let \(C\) be a regularization constant, \(\xi ^d\) be the slack variable, and \(l^d(y)\) be the loss function for the label \(y\); all of which are positive. \(\varvec{\xi }\) are the nonnegative auxiliary parameters and are usually referred to as the slack variables. Consider the Zellner’s interpretation shown in Equation 4. In a regularized Bayesian framework setting a convex function is added to the optimization framework described above (Zhu et al. 2011). One choice of such convex function is to borrow ideas from a maximum margin classifier model, and this equation can be written as:

$$\begin{aligned} \underset{P(\varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }) \in {\mathbb {P}}, \varvec{\xi }}{\text {minimize}}\quad&{\text {KL}}[P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W}, \varvec{\alpha },\varvec{\beta })||P_0(\varvec{\varTheta },\varvec{Z}, \varvec{\varPhi }|\varvec{\alpha },\varvec{\beta })]-{\mathbb {E}}_P[\log {\text {P}}(\varvec{W}|\varvec{Z},\varvec{\varPhi })] + B(\varvec{\xi }) \nonumber \\ {\text {subject \, to }}\quad \,&P(\varvec{\eta }, \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }) \in {\mathbb {P}}(\varvec{\xi }), \end{aligned}$$
(5)

where \(B(\varvec{\xi })\) is a convex function which usually refers to the hinge loss function of the maximum margin classifier. \(\varvec{\eta }\) denotes the parameters of the maximum margin classifier. \({\mathbb {P}}(\varvec{\xi })\) is the subspace of probability distribution that satisfies a set of constraints. One can note that as stated in Sect. 3.2, we can add a loss function to the optimization view of the Bayes’ theorem obtained from LDA. Thus the interpretation given by Zellner, can be easily used to develop supervised topic models for prediction tasks.

Considering a maximum margin based topic model for label prediction, MedLDA, the soft-margin for MedLDA can be written as:

$$\begin{aligned} \underset{p \left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}},\xi }{{\text {minimize}}}\quad&{\text {KL}}\left[ P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })||P_0(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta })\right] -{\mathbb {E}}_P\left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] + \frac{C}{D} \sum _{d=1}^{D} \xi ^d \nonumber \\ {\text {subject \, to }} \quad&{\mathbb {E}}_p\left[ \varvec{\eta }^{\intercal } \varvec{f} \left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right] \ge l^d(y), \xi ^d \ge 0, \forall d, \forall y, \end{aligned}$$
(6)

One can see from the above equation that MedLDA conducts regularized Bayesian inference which is of the same form as depicted in Eq. (5). Therefore, MedLDA is a hybrid topic model which takes advantages from topic model and maximum margin learning framework. Equation (6) can also be written as:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in {\mathbb {P}},\xi }{{\text {minimize}}}\quad {\text {KL}}\left[ P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta }\right) ||P_0\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) \right] - {\mathbb {E}}_P\left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] \\&\quad +\frac{C}{D} \sum _d {\text {argmax}}_y\left( l^d\left( y\right) \right) -{\mathbb {E}}_P\left[ \varvec{\eta }^{\intercal } \left( \varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right) \right] \end{aligned} \end{aligned}$$

The component \(\frac{1}{D} \sum _d {\text {argmax}}_y(l^d(y))-{\mathbb {E}}_p[\varvec{\eta }^{\intercal }( \varvec{f}(y^d,\varvec{\overline{z}}^d)-\varvec{f}(y,\varvec{\overline{z}}^d))]\) is the hinge loss which is defined as an upper bound of the prediction error on the training data.

One characteristic of MedLDA is to conduct posterior regularization where the posterior distribution obtained using a topic model is regularized with maximum margin constraints. This leads to a posterior which is mainly helpful in classifying those points which lie on the margin of the classifier or are mis-classified. The latent topic information supplied by the topic model helps classify such hard instances, for which the maximum margin classifier would find it difficult to accomplish. This mechanism makes this model different from those two stage approaches where one can compute the latent topic information using a topic model, and then use that latent topic information as an added feature in the classification task. Two stage approach for prediction might involve error propagation from one stage to another, which can be mitigated in such single stage models as MedLDA.

4 Supervised topic model with word order for document classification

4.1 Model description

We propose a document classification model based on a latent topic model that integrates the class label information and the word order structure into the topic model itself. It enables interaction among such information for more effective modeling for document classification. There are two main components. One component is a topic model with word order. The other component is the maximum margin model. One fundamental difference between MedLDA and our proposed model is that our model exploits the word order structure of a document. The design of the above two components leads to latent topic representation that is more discriminative, and also advantageous for supervised document classification learning problem.

The document content modeling component of our model is primarily a bigram topic model which captures dependencies between the words in sequence. Each topic is characterized by a distribution of bigrams. The goal of our model is to generate a latent topic representation that is suitable for classification task. We adopt the same notation from Sect. 3. In our model, word generation is defined by the conditional distribution \(P(w_n^d|w_{n-1}^d, z_n^d)\). The word-topic distribution denoted by \(\varvec{\varPhi }\) is different from MedLDA. \(\varvec{\varPhi }=\left\{ \varvec{\phi }_{kv}\right\} _{v,k=1}^{V,K}\) are word-topic distribution. We depict the graphical model of our model in Fig. 2. Note that we show the hyperparameter \(\beta \) explicitly in the graphical model. The generative process of our model is depicted below:

  1. 1.

    Draw Multinomial distribution \(\phi _{zw}\) from a Dirichlet prior \(\beta \) for each topic \(z\) and each word \(w\),

  2. 2.

    For each document \(d\)

    1. (a)

      Draw a topic proportion θ d for the document d from Dirichlet (α ), where Dirichlet (α) is the Dirichlet distribution with the parameter α,

    2. (b)

      For each word w d n ,

      (i) Draw a topic z d n from Multinomial (θ d)

      (ii) Draw a word w d n from the distribution over words for the context defined by the topic z d n and the previous word w d n−1 from Multinomial (\(\phi _{w_{n-1}^d z_n^d}\))

  3. 3.

    Draw the class label parameter \(\varvec{\eta }\) from Normal (\(0,\varvec{\eta }_0\)), where \(\varvec{\eta }_0\) is the hyperparameter for \(\varvec{\eta }\) and is sampled \(M\) times, where \(M\) is the number of classes considered in the classification problem,

  4. 4.

    Draw a class label \(y^d|(\varvec{z}^d,\varvec{\eta })\) according to Eqs. (8)–(10).

Let \(\varvec{b}^d\) denote \(\{b_{n,n+1}^d\}_{n=1}^{N^d-1}\), where \(b_{n,n+1}^d\) denotes the words at the positions \(n\) and \(n+1\) in the document \(d\) written as \(b_{n,n+1}^d=(w_n^d,w_{n+1}^d)\). \(\varvec{W}=\{\varvec{b}^d\}_{d=1}^{D}\) is the word order information. The prior distribution defined in the model is expressed as:

$$\begin{aligned} P_0(\varvec{\varTheta },\varvec{\varPhi },\varvec{Z})=\left( \prod _{d=1}^{D}P(\varvec{\theta }^d|\varvec{\alpha }) \prod _n^{N^d}P\left( z_{n}^d|\varvec{\theta }^d\right) \right) \prod _{k=1}^{K}\prod _{v=1}^{V} P\left( \varvec{\phi }_{kv}|\varvec{\beta }\right) \end{aligned}$$
(7)

In our model, the objective is to infer the joint distribution \(P(\varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta })\), where \(\varvec{\eta }\) is a random variable representing the parameter of the classification model. In addition, the discriminant function is defined as:

$$ F\left( y,\varvec{\eta },\varvec{z};\varvec{b}^d\right) = \varvec{\eta }^{\intercal } \varvec{f}\left( y;\varvec{\overline{z}}^d\right) $$
(8)

The above latent function cannot be directly used for prediction tasks for an observed input document as it involves random variables. Therefore, we take the expectation and define the effective discriminant function as follows:

$$ F\left( y;\varvec{b}^d\right) = {\mathbb {E}}_{p\left( \varvec{\eta },\varvec{z}| \varvec{b}^d\right) } \left[ F\left( y,\varvec{\eta },\varvec{z};\varvec{b}^d\right) \right] $$
(9)

The prediction rule incorporating the word order structure in the classification task is:

$$ \hat{y}=\underset{y}{{{\text {argmax}}}}\quad F\left( y;\varvec{b}^d\right) $$
(10)

Let \(C\) be a regularization constant, \(\xi ^d\) be the slack variable and \(l^d(y)\) be the loss function for the label \(y\); all of which are positive. The soft-margin framework for our model can be written as:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }\right) \in \mathbb {P}, \varvec{\xi }}{\text {minimize}}\quad {\text {KL}}\left[ P\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{W},\varvec{\alpha },\varvec{\beta }\right) ||P_0\left( \varvec{\eta },\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }|\varvec{\alpha },\varvec{\beta }\right) \right] - \mathbb {E}_q\left[ \log {\text {P}}\left( \varvec{W}|\varvec{Z},\varvec{\varPhi }\right) \right] \\&\quad + \frac{C}{D} \sum _d {\text {argmax}}_y\left( l^d\left( y\right) \right) -\mathbb {E}_P\left[ \varvec{\eta }^{\intercal } \left( \varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right) \right] \\&{\text {subject \, to }}\quad \mathbb {E}_P\left[ \varvec{\eta }^{\intercal }\left( \varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y,\varvec{\overline{z}}^d\right) \right) \right] \ge l^d(y) - \xi ^d, \xi ^d \ge 0, \forall d, \forall y, \end{aligned} \end{aligned}$$
(11)
Fig. 2
figure 2

Graphical representation of our proposed document classification model

4.2 Posterior inference

We use Collapsed Gibbs sampling for computing the posterior inference considering the word order structure in the document. Collapsed Gibbs sampler collapses out the nuisance parameters, and speeds up the posterior inference (Shafiei and Milios 2006). Eq. (11) can be solved in two steps in alternate manner. The first step is to estimate \(P(\varvec{\eta })\) given \(P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })\). In the second step, we need to estimate \(P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })\) given \(P(\varvec{\eta })\). We can estimate \(P(\varvec{\eta })\) from the algorithm described in Jiang et al. (2012) where we make use of Lagrange multipliers, but our topic modeling component is different and thus the distribution \(P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })\) needs to be estimated. We define \(\varvec{\kappa }\) as follows:

$$ \varvec{\kappa }=\sum _{d=1}^{D} \sum _{y^d} \lambda ^d_{y^d} \varvec{\Delta } \varvec{f}\left( y^d,\mathbb {E}\left[ \varvec{\overline{z}}^d\right] \right), $$
(12)

where \(\varvec{\kappa }\) is the mean of classifier parameters \(\varvec{\eta }\). When we place a \(*\) with \(\kappa \), it denotes the optimum solution. We describe an outline for estimation of topical bigrams below.

First, we can factorize the topic model component and the maximum margin parameter component as follows:

$$ P\left( \varvec{\eta },\varvec{\varTheta },\varvec{\varPhi },\varvec{Z}\right) = P\left( \varvec{\eta }\right) P\left( \varvec{\varTheta },\varvec{\varPhi },\varvec{Z}\right) $$
(13)

Let \(\varvec{\Delta }\varvec{f}\left( y^d,\varvec{\overline{z}}^{d}\right) \) be defined as follows:

$$ \varvec{\Delta }\varvec{f}\left( y^d,\varvec{\overline{z}}^{d}\right) =\varvec{f}\left( y^d,\varvec{\overline{z}}^d\right) -\varvec{f}\left( y, \varvec{\overline{z}}^d\right) $$
(14)

Based on Eq. (13), the formulation for the optimum solution is given as follows:

$$\begin{aligned} P\left( {\varvec{\varTheta },\varvec{Z},\varvec{\varPhi }}\right) \propto P\left( \varvec{\varTheta },\varvec{Z},\varvec{\varPhi },\varvec{W}\right) {\text {e}}^{\varvec{\kappa ^{\left( *\right) }}^{\intercal } \sum _{d=1}^{D}\sum _{y^d} \left( \lambda ^d_{y^d}\right) ^{*}\varvec{\Delta }\varvec{f}\left( y^d,\varvec{\overline{z}}^{d}\right) } \end{aligned}$$
(15)

where \(\lambda _{y^d}^{d}\) is the Lagrange multiplier. The problem now is to efficiently draw samples from \(P(\varvec{\varTheta },\varvec{Z},\varvec{\varPhi })\) and also compute the expectation statistics of the maximum margin classifier used in our model. In order to simplify the integrals, we can take advantage of conjugate priors. We can integrate out the intermediate variables \(\varvec{\varTheta },\varvec{\varPhi }\) and build a Markov chain whose equilibrium distribution is the resulting marginal distribution \(P(\varvec{Z})\).

Let \(Z\) be a normalization constant. We get the following marginalized posterior distribution for our model after integrating out \(\varvec{\varTheta },\varvec{\varPhi }\):

$$\begin{aligned} P\left( \varvec{Z}\right) =\frac{P\left( \varvec{W},\varvec{Z}|\varvec{\alpha },\varvec{\beta }\right) }{Z} {\text {e}}^{\varvec{\kappa ^{\left( *\right) }}^{\intercal } \sum _{d=1}^{D} \sum _{y} \left( \lambda ^d_y\right) ^{*}\varvec{\Delta }\varvec{f}\left( y,\varvec{\overline{z}}^{d}\right) } \end{aligned}$$
(16)

The original BTM model proposed in Wallach (2006) used EM algorithm for doing the approximation. But we have used collapsed Gibbs sampler. Therefore, in order to solve the first component on the right hand side of the above equation, collapsed Gibbs sampling for the model has to be implemented. The second component can be solved using any existing SVM implementation with some modifications based on the formulations used in our model.

Let \(m_{zwv}\) be the number of times the word \(w\) is generated by the topic \(z\) when preceded by the word \(v\). \(q_{dz}\) is the number of times a word is assigned to the topic \(z\) in the document \(d\). The element \(\kappa _{y^dk}\) represents the contribution of the topic \(k\) in classifying a data point to the class \(y^d\). The transition probability along with the maximum margin constraint can be expressed as:

$$\begin{aligned} \begin{aligned} P\left( z_n^d|\varvec{W},\varvec{Z}_{\lnot n},\alpha ,\beta \right) =&\left( \frac{\alpha _{z_n^d}+q_{dz_n^d}-1}{\sum _{z=1}^{K} \left( \alpha _z+q_{dz}\right) -1} \times {\text {e}}^{\frac{1}{N^d} \sum _{y} \left( \lambda ^d_y\right) ^{*}\left( \varvec{\kappa }^{*}_{{y_d}k}-\varvec{\kappa }^{*}_{yk}\right) } \right) \\&\times \frac{\beta _{w_n^d}+m_{z_n^d w_{n}^dw_{n-1}^d}-1}{\sum _{v=1}^{V}\left( \beta _v+m_{z_n^d w_{n}^dv}\right) -1} \end{aligned} \end{aligned}$$
(17)

Note that all the counts used above exclude the current case i.e., the word being visited during sampling. When we use a \(\lnot \) sign in the subscript of a variable, it means that the variable corresponding to the subscripted index is removed from the calculation of the count. In the above equation, \(-1\) mainly arises from the chain rule expansion of the Gamma function. The posterior estimates of the model can be written as:

$$\begin{aligned} \begin{aligned} P\left( z_n^d|\varvec{W},\varvec{Z}_{\lnot n},\alpha ,\beta \right) =&\left( \frac{\alpha _{z_n^d}+q_{dz_n^d}}{\sum _{z=1}^{K} \left( \alpha _z+q_{dz}\right) } \times {\text {e}}^{\frac{1}{N^d} \sum _{y} \left( \lambda ^d_y\right) ^{*} \left( \varvec{\kappa }^{*}_{{y_d}k}-\varvec{\kappa }^{*}_{yk}\right) } \right) \\&\times \frac{\beta _{w_n^d}+m_{z_n^d w_{n}^dw_{n-1}^d}}{\sum _{v=1}^{V} \left( \beta _v+m_{z_n^d w_{n}^dv}\right) } \end{aligned} \end{aligned}$$
(18)

4.3 Prediction for unseen documents

Our prediction framework also follows similar strategy for unseen documents using topic models as used in many other works (Jiang et al. 2012; Yao et al. 2009). Let the unseen document be denoted as \(d^{\text {new}}\). We consider the notion of word order. The input for prediction task are unlabeled test data. The output is to predict the label for the new document \(d^{\text {new}}\). We compute the point estimate of topics obtained in the matrix \(\varvec{\varPhi }\) from the training data. This matrix is used in the prediction task. When the unseen document is given to the model, we need to determine the latent dimensions \(\varvec{z}^{d^{\text {new}}}\) for this unseen document. This is computed using the MAP estimate of \(\varvec{\varPhi }\) to obtain \(\hat{\varvec{\varPhi }}\). Specifically, we compute the \(z_n^{d^{\text {new}}}\) in each new document \(d^{\text {new}}\) as follows:

$$ P\left( z_n^{d^{\text {new}}}|\varvec{z}_{\lnot n}^{d^{\text {new}}}\right) \propto \hat{\phi }_{\left( z_n^{d^{\text {new}}},w_n^{d^{\text {new}}},w_{n-1}^{d^{\text {new}}}\right) } \left( \alpha _{z_n^{d^{\text {new}}}}+q_{dz_n^{d^{\text {new}}}}\right) $$
(19)

Expectation statistics computation can be derived in a similar manner as the classifier described in Jiang et al. (2012).

5 Document classification experiments

5.1 Experimental setup

We conduct extensive experiments on document classification using some benchmark test collections. We also compare with many related comparative methods. In addition, we present some high quality topical words showing how our model generates interpretable topical words. In all our experiments for topic models, we run the sampler for 1000 iterations.Footnote 1 We have also removed stopwordsFootnote 2 and performed stemming using Porter’s stemmer.Footnote 3 Text pre-processing and vector space generation was done using Gensim package.Footnote 4 Fivefold cross validation is used as in Zhu et al. (2012a). In each fold, the macro-average across the classes is computed. Each model is run for five times. We take the average of the results obtained for all the runs and in all the folds.

We use four datasets, namely, 20 Newsgroups dataset,Footnote 5 OHSUMED-23 dataset.Footnote 6 TechTC-300 Test Collection for Text Categorization,Footnote 7 and Reuters 21578 text categorization collection.Footnote 8 In OHSUMED-23, as adopted in Joachims (1998), we used the first 20,000 documents. We present the details about the datasets in Table 1. In the table, the first column presents the names of different datasets. The second column describes the total number of classes in the dataset. The third column presents the total number of documents in that entire dataset. The fourth column shows the average number of documents in the each class. The fifth column presents the average length of the documents in the entire dataset. One can see that we have used both small and large document collections.

Table 1 Details about different datasets used in the document classification experiments

In all our datasets, we used the validation set for determining the number of topics. The validation set consisted of approximately 20 % of the documents. The training set comprised of approximately 60 % documents and the test set consisted of approximately 20 % of the documents. We use Precision, Recall and F-measure to measure the classification performance. The definitions for these metrics in the classification task can be found in Jameel and Lam (2013b). We solve multiclass classification problem by decomposing into binary classification problems in each class. But this procedure also introduces the problem related to unbalanced data as stated in Nallapati (2004). We therefore adopted the technique of under-sampling in which samples from majority class in both classes are made equal (Nallapati 2004). Empirical evidence suggests that such method generally produces better results as pointed by Zhang and Mani (2003). We used the training set to train the model and we varied the number of topics from 10 to 100 in steps of 10 as in Jameel and Lam (2013b). Then the trained model was validated on the validation set. We performed this procedure in each fold and computed the average F-measure. The number of topics which produced the best F-measure is the output of the validation process process. Then we used the test set to test the models using the number of topics obtained from the validation process. We set the loss function (\(l^d(y)\)) to a constant function 16 just as in Jiang et al. (2012). For simplicity, we assume all symmetric bigram Dirichlet prior, and we set the value of \(\beta \) to 0.01. The settings for other hyperparameters remain the same as in Jiang et al. (2012) for fair comparison. As experimented in Wang and McCallum (2006), we also found not much variation in results with different hyperparameter values. Hyperparameter values of the other topic models (supervised and unsupervised) are the same as used in their respective works and their available publicly shared implementations. This ensures that we are using the best configurations for each of the models. In (Jiang et al. 2012), the authors conduct extensive experimentation to find the best \(C\) value. We use the same \(C\) value for fair comparison. We also found that different values of \(C\) did not have much effect on the results.

We chose a wide range of comparative methods as follows. (1) Gibbs MedLDA (Zhu et al. 2013a) denoted as gMedLDA, (2) Variational MedLDA (Zhu et al. 2009) denoted as vMedLDA, (3) Supervised LDA denoted as sLDA (Blei and McAuliffe 2008), (4) Discriminative LDA (Lacoste-Julien et al. 2008) denoted as DiscLDA, (5) LDA (Blei et al. 2003), (6) We use LDA+SVM in the same way as described in (Zhu et al. 2012a), (7) Bigram Topic Model BTM (Wallach 2006), (8) Following procedure as adopted for LDA+SVM, we do the same for BTM+SVM, (9) LDA-Collocation model (LDACOL) (Griffiths et al. 2007), (10) LDACOL+SVM, (11) Topical N-gram (TNG) (Wang et al. 2007), (12) TNG+SVM, Joachims (1998), (13) a recently proposed model NTSeg (Jameel and Lam 2013b), (14) NTSeg+SVM, (15) SVM. The features for linear SVM are same as that in Zhu et al. (2013a).

5.2 Quantitative results

We present our main classification results in Tables 2, 3, 4 and 5. We observe that our model has outperformed all the comparative methods. In all datasets, our F-measure results are statistically significant based on the sign test with a \(p\) value \(<\)0.05 against each of the comparative methods. By maintaining the word order and considering an extra side-information helps in improving classification results to a great extent. Since we are capturing the inherent word order semantics in the document, just like other structured unsupervised topic models, we obtained improvements over the comparative methods.

Table 2 Table depicting precision, recall and F-measure values for different models in the 20 Newsgroups dataset
Table 3 Table depicting precision, recall and F-measure values for different models in the OHSUMED-23 dataset
Table 4 Table depicting precision, recall and F-measure values for different models in the TechTC300 dataset
Table 5 Table depicting precision, recall and F-measure values for different models in the Reuters dataset

In Table 6 we present the results for the number of topics obtained during the validation process. These topics were subsequently used in the test set to compute the final results that we have depicted in Tables 2, 3, 4 and 5.

Table 6 Table depicting the number of latent topics \(K\) obtained using the validation process, which was used in the test set for different models in different datasets

In Tables 7, 8, 9, and 10, we study the effect of document classification performance as measured by F-measure when we vary the number of topics from 10 to 100 for topic models in different datasets. As we begin from \(K=10\) in the 20 Newsgroups dataset, we see that our model does not perform very well in the beginning. Nevertheless, it still outperforms other topic models. Our model performs very well after \(K \ge 70\). Similarly, in the OHSUMED-23 dataset, our model also does not perform well until \(K \le 60\). Nevertheless, it still outperforms other topic models. Then it gains good improvement as we increase the number of latent topics. Also, the unsupervised n-gramFootnote 9 topic models’ performance cannot be discarded. One observation is that the recently proposed unsupervised n-gram topic model NTSeg has done well when compared to other unsupervised topic model in the 20 Newsgroups dataset. Similar pattern is observed in the OHSUMED-23 dataset. In the TechTC300, all the models show poor performance. This shows that the dataset has difficult examples which the topic models find difficult to classify. In Reuters too our model shows good performance as the number of latent topics is varied from 10 to 100. It suggests that considering the word order can offer some contributions to document classification performance. Our model can outperform the other comparative methods because it inherits the advantages of both n-gram unsupervised topic models and supervised topic models. Note that as exemplified in Jameel and Lam (2013b) and many other works which follow word order, computational complexity of the models that follow word order is generally higher than those of their bag-of-words counterparts. Nevertheless, models incorporating word order structure have shown superior performance than the bag-of-words models (Jameel and Lam 2013b). Several attempts have been made recently to speed up the inference procedures for both supervised and unsupervised topic models such as Zhu et al. (2013b, (2013c) and Porteous et al. (2008).

Table 7 The effect of the number of topics on document classification measured by F-measure in the 20 Newsgroups dataset
Table 8 The effect of the number of topics on document classification measured by F-measure in the OHSUMED-23 dataset
Table 9 The effect of the number of topics on document classification measured by F-measure in the TechTC-300 dataset
Table 10 The effect of the number of topics on document classification measured by F-measure in the Reuters-21578 dataset

5.3 Examples of topical words

We present some high probability topical words in topics and compare our model with some related n-gram and supervised topic models, including BTM (Wallach 2006), LDACOL (Griffiths et al. 2007), TNG (Wang et al. 2007), PDLDA (Lindsey et al. 2012), NTSeg (Jameel and Lam 2013b), MedLDA (Zhu et al. 2012a). We present top five most representative words from a topic describing semantically similar theme from each model. We chose the documents from comp.graphics class in order to present the list of topical words in this experiment experiments as adopted in Zhu et al. (2012a).

The objective for presenting a list of topical words for comparison is to show the words in each topic and whether they give some insight about the topic. Obviously, words which are ambiguous will not make sense to a reader about the topic, and we can then infer that the topic model is unable to generate interpretable latent topics. Note that many works related to topic models present some top-k words from some topics, but this analysis cannot be regarded as a very strong indication about the superiority of one topic model over the other. This is why quantitative analysis is very important which we have already shown, and where our model has performed better than the comparative methods.

From the results shown in Tables 11 and 12, we can make two observations. First, our model generates more fine grained topical words as compared to other topic models. Second, our model generates more interpretable latent topics as compared to other topics. Words such as “video memory”, “ simple routing”, “package zip” appear to make some sense to a reader. For example, “package zip” is a bigram which might be describing about zipping the contents of a file. Overall, most of the bigrams in the topic generated by our model seem to suggest that our model has generated words which relate to the domain “computer graphics”. Other models rather generate ambiguous n-grams or they generate unigrams which do not offer much understanding to the user, for instance, bigrams generated by the BTM model does not seem to suggest that the topic is describing about “computer graphics” as words such as “compgraph path”, “xref compgraph”, etc are not very insightful to a reader.

Table 11 Top five probable words from a topic from comp.graphics class of 20 Newsgroups dataset
Table 12 Top five probable words from a topic from comp.graphics class of 20 Newsgroups dataset

6 Topic model for document retrieval learning

6.1 Model description

We also investigate a supervised low-dimensional latent topic model for document retrieval learning. Suppose that some relevance assessments of documents for some queries are available for training. Our goal is to learn a model that can predict the relevance of an unseen test query-document pair, and rank the documents based on the predicted relevance score. This problem setting is similar to the pointwise learning-to-rank problem. Manual relevance assessments can be modeled as a response variable in our topic model. In addition, the word order structure of the text content is also considered. The main motivation for considering the word order is to capture the semantic story inherent in the document which is supposedly lost when the order of words in the document is broken. Similar to our proposed document classification model, there are two main components in our document retrieval learning model. One component is a topic model which measures the goodness of fit of the text content of documents and queries. Queries are modeled as short documents in a similar manner as in Wu and Zhong (2013) and Salton et al. (1975). Our topic model considers the word order structure in documents and queries. The second component deals with the relevance prediction within a maximum margin framework. Labels are mainly predicted using the maximum margin framework in our pointwise retrieval learning model. The dataset can be represented as (\((d,q),y_{(d,q)}\)) composed of query-document pairs \((d,q)\) along with the relevance assessment label denoted by \(y_{(d,q)}\) which signifies the relevance of the document \(d\) to the query \(q\). Let \(c(d,q)\) be the total number of query-document pairs in the training set. Let the number of documents in the training set be \(D\); the number of queries in the training set be \(Q\). As adopted in Nallapati (2004), the confidence scores obtained from the discriminant function is used to rank documents in our proposed model. Let the words in the document \(d\) be represented by \(\varvec{w}^d\) and the words in the query \(q\) be represented by \(\varvec{w}^q\). Let the set of topics used in the document \(d\) be represented as \(\varvec{z}^d\), and the set of topics in the query \(q\) be represented by \(\varvec{z}^q\).

There are several fundamental differences between our document retrieval learning framework with those of the previously proposed supervised topic models. In our model, each input data instance consists of a pair of document and query instead of a single document. In contrast to other supervised topic models such as Jiang et al. (2012), Zhu et al. (2009, (2012a), the property of the feature vector is different. In our retrieval learning model, feature vector includes different query-dependent and query-independent features which are useful for conducting the learning-to-rank task.

We first describe a new discriminant function which is suited for handling document retrieval learning problem. Therefore, the discriminant function of our model is designed as follows:

$$ F(y,\varvec{\eta },(d,q))= \varvec{\eta }^{\intercal } \varvec{f}( y,(d,q)) $$
(20)

where \(\varvec{\eta }\) represents the model parameters which are essentially feature weights. \(\varvec{f}( y,(d,q))\) is a vector of features which are designed to be useful for retrieval. The new definitions of \(\varvec{\eta }\) and \(\varvec{f}( y,(d,q))\) make it suitable to handle document retrieval task. Some examples of features are depicted in Table 13. Note that just as in LETOR learning-to-rank datasets (Qin et al. 2010), these features are computed for the entire dataset \({\mathbb {D}}\) before generating the training, test and the validation sets. \(c(w_n^d,d)\) is the number of times the word \(w_n^d\) appears in the document \(d\). \(N^q\) is the number of words in the query \(q\). \(|.|\) denotes the size function. idf is the inverse document frequency. The first six features have also been used in Nallapati (2004) where readers can find the motivation behind the design of these features. Some minor refinements to some of these six features were made in Xu and Li (2007) and Qin et al. (2010), and we use these refined features in our experimental setup. The last feature, called topic similarity feature, is a similarity measure between the topics of the query and the document in the low-dimensional topic space generated by our model. Let \(\varvec{Z}^d=\left\{ \varvec{z}^d\right\} _{d=1}^{D}\) be topic assignments to all the words of the training documents; \(\varvec{Z}^q=\left\{ \varvec{z}^q\right\} _{q=1}^{Q}\) be topic assignments to all the words in the training queries; \(\varvec{\varTheta }^d=\left\{ \varvec{\theta }^d\right\} _{d=1}^{D}\) be topic distributions for all training documents; \(\varvec{\varTheta }^q=\left\{ \varvec{\theta }^q\right\} _{q=1}^{Q}\) be topic distributions for all training queries; \(\varvec{\varPhi }=\left\{ \varvec{\phi }_{kv}\right\} _{v,k=1}^{V,K}\) be the word-topic distribution. In order to compute the topic similarity in the low-dimensional topic space between the document and the query, we make use of the topic-document and topic-query distributions \(\varvec{\varTheta }^d\) and \(\varvec{\varTheta }^q\). In each of these distributions, we consider each document or query represented as a \(K \times 1\), which mainly is \(P(z \in K | d)\) or \(P(z \in K | q)\) where \(d\) is a document and \(q\) is a query, low-dimensional vector in the latent topic space. Each of the values in this vector can be considered as a weight for the corresponding latent topic (Hazen 2010) or simply the contribution of a topic to a document. Consider a document \(d\) associated with a query \(q\), and thus is also represented by its own low-dimensional latent topic vectors. Let the latent topic vector for the document \(d\) be denoted as \(v^d=K_{d} \times 1\) and let the latent topic vector of the query \(q\) be represented as \(v^q=K_{q} \times 1\). We compute the cosine similarityFootnote 10 between these two vectors. The intuitive idea is that if the two vectors are close to each other in the latent topic space i.e. if they are semantically related to each other even though they do not share the same words, they tend to have a high cosine similarity value in the latent topic space. In fact, works such as Liu et al. (2009) and Maas et al. (2011) have also used cosine similarity between words and documents in the latent topic space. Other similarity metrics such as KL-Divergence could also be used.

Table 13 Features used in our discriminant function in our document retrieval learning model

Unlike the classification model where we took the expectation, the effective discriminant function which is obtained from Eq. (20) as follows:

$$ F(y,(d,q))= [F(y,\varvec{\eta },(d,q))] $$
(21)

The prediction rule is given in Eq. (22), where our objective is to find a label is as follows:

$$ \hat{y}=\underset{y}{{\text {argmax }}} \quad F(y,(d,q)) $$
(22)

The following maximum margin constraints are imposed:

$$ F(y_{(d,q)},(d,q))-F(y,(d,q)) \ge l_{(d,q)}(y)- \xi _{(d,q)}, \forall y \in Y, \forall (d,q) $$
(23)

where \(l_{(d,q)}(y)\) is a non-negative loss function. \(\xi _{(d,q)}\) are non-negative slack variables which are meant for inseparable data instances. \(C\) is a positive regularization constant. The soft-margin framework for our model is described below:

$$\begin{aligned} \begin{aligned}&\underset{P\left( \varvec{\varTheta }^d,\varvec{\varTheta }^q,\varvec{Z}^d,\varvec{Z}^q,\varvec{\varPhi }\right) \in {\mathbb {P}},\xi ,\varvec{\eta }}{{\text {minimize }}} {\text {KL }}\left[ P\left( \varvec{\varTheta }^d,\varvec{\varTheta }^q,\varvec{Z}^d,\varvec{Z}^q,\varvec{\varPhi }\right) ||P_0\left( \varvec{\varTheta }^d,\varvec{\varTheta }^q,\varvec{Z}^d,\varvec{Z}^q,\varvec{\varPhi }\right) \right] \\&\quad -\mathbb {E}_P\left[ \log P\left( \varvec{W}^d,\varvec{W}^q|\varvec{\varTheta }^d,\varvec{\varTheta }^q,\varvec{Z}^d,\varvec{Z}^q,\varvec{\varPhi }\right) \right] + \frac{C}{c(d,q)} \sum _{(d,q)} \xi _{(d,q)} \\&{\text {subject\, to }}\quad \left[ \varvec{\eta }^{\intercal } \left( \varvec{f}(y_{(d,q)},d,q)-\varvec{f}(y,d,q,)\right) \right] \ge l_{(d,q)}(y) - \xi _{(d,q)}, \xi _{(d,q)} \ge 0, \forall (d,q),\forall y \end{aligned} \end{aligned}$$
(24)

6.2 Posterior inference

In order to proceed with the derivation of the collapsed Gibbs sampling, we need to define a joint distribution for words and the topics along with the regularization effects due to the maximum margin posterior constraints. In this model too we need to alternatively find the optimal solution using maximum margin classifier and solve the topic model component. But unlike the posterior inference of the classification model, we can directly adopt implementation from existing SVM algorithm to find the optimum solution of the classifier. Let \(\varvec{\eta }^{(*)}\) denote the optimum parameter weights. This joint distribution is written as:

$$\begin{aligned} \begin{aligned} P\left( \varvec{Z}^d,\varvec{W}^d,\varvec{Z}^q,\varvec{W}^q|\varvec{\alpha },\varvec{\beta }\right) =&P\left( \varvec{W}^d|\varvec{Z}^d,\varvec{\beta }\right) \times P\left( \varvec{W}^q|\varvec{Z}^q,\varvec{\beta }\right) \times P\left( \varvec{Z}^d|\varvec{\alpha }\right) \times P\left( \varvec{Z}^q|\varvec{\alpha }\right) \\&\times \, {\text {e}}^{\varvec{\eta ^{(*)}}^{\intercal } \sum _{(d,q)} \sum _{y=1}^{M} \left( \lambda ^y_{(d,q)}\right) ^{*}\left( \varvec{f}(y_{(d,q)},(d,q))-\varvec{f}(y,(d,q))\right) } \end{aligned} \end{aligned}$$
(25)

After some manipulations, we can come up with the following update equation:

$$\begin{aligned} \begin{aligned} P\left( z_n^d,z_n^q|\mathbf {W}^d,{\mathbf {W}}^q,{\mathbf {Z}}_{\lnot n}^d,{\mathbf {Z}}_{\lnot n}^q,\alpha ,\beta \right) =&\left( \frac{\alpha _{z_n^d}+m_{z_n^d w_n^d}-1}{\sum _{z=1}^{K} \left( \alpha _z+m_{z}\right) -1} \times \frac{\alpha _{z_n^q}+m_{z_n^qw_n^q}-1}{\sum _{z=1}^{K} \left( \alpha _z+m_{z}\right) -1} \right. \\&\left. \times \, {\text {e}}^{\frac{1}{\left( N^d+N^q\right) } \sum _{y=1}^{M} \left( \lambda ^y_{(d,q)}\right)^{*} \left( \varvec{f}\left( y_{(d,q)},(d,q)\right) -\varvec{f}(y,(d,q))\right)^{*}} \right) \\&\times \frac{\beta _{w_n^d}+m_{z_n^d w_n^d w_{n-1}^d}-1}{\sum _{v=1}^{V}\left( \beta _v+m_{z_n^d w_n^d v}\right) -1} \times \frac{\beta _{w_n^q}+m_{z_n^q w_n^q w_{n-1}^q}-1}{\sum _{v=1}^{V}\left( \beta _v+m_{z_n^q w_n^q v}\right) -1} \end{aligned} \end{aligned}$$
(26)

where \(m_{zwv}\) is the number of times the word \(w\) is generated by the topic \(z\) when preceded by the word \(v\) and is applicable to a document and a query when super-scripted by \(d\) or \(q\) respectively. \(m_{zw}\) is the number of times a word \(w\) in the document has been sampled in the topic \(z\), and is applicable to a document and query when super-scripted by \(d\) or \(q\) respectively.

One can argue that asymmetric priors may work better especially on short documents such as queries. Many previous works for short documents have assumed asymmetric priors in their topic models such as Yan et al. (2013) and Hasler et al. (2014). Our model is flexible enough to accommodate asymmetric priors, but in this paper we only test our model using symmetric priors for simplicity. In (Nallapati 2004) the author discussed some shortcomings in discriminative models for IR, in particular, the out-of-vocabulary words. The author has also suggested a few ways of dealing with those shortcomings. We also follow those strategies in this paper.

6.3 Ranking unseen query-document pairs

The prediction task on test data using the prediction rule given in Eq. (22) can be realized as follows. Let (\(q^{\text {new}},d^{\text {new}}\)) be an unseen test query-document pair for which we need to predict the relevance label. The task is to compute the latent topic representations of \(q^{\text {new}}\) and \(d^{\text {new}}\) using the topic space that has been learned from the training data. These latent components for the unseen query and the document can be obtained from \(\hat{\varvec{\varPhi }}\) which is the maximum aposteriori estimate of \(P(\varvec{\varPhi })\) computed during the training process. Suppose there are \(J\) samples from a proposal distribution, \(\hat{\varvec{\varPhi }}\) is obtained using the samples from the following equation:

$$ \hat{\phi }_{zwv} \propto \frac{1}{J} \sum _{j=1}^{J} \left( \beta _{w_n^{d}}+m_{z_n^{d} w_n^d w_{n-1}^d}^{(j)}\right) \times \left( \beta _{w_n^{q}}+m_{z_n^{d} w_n^d w_{n-1}^d}^{(j)}\right) $$
(27)

where the counts are assigned in the jth sample. The latent components for the unseen document and the query can be computed as follows.

$$\begin{aligned} \begin{aligned}&P\left( z_n^{d^{\text {new}}},z_n^{q^{\text {new}}}|{\mathbf {W}}^{d^{\text {new}}},{\mathbf {W}}^{q^{\text {new}}},{\mathbf {Z}}_{\lnot n}^{d^{\text {new}}},{\mathbf {Z}}_{\lnot n}^{q^{\text {new}}},\alpha ,\beta \right) \propto \hat{\phi }_{z_n^{d^{\text {new}}} w_n^{d^{\text {new}}}w_{n-1}^{d^{\text {new}}}} \left( \alpha _{z_n^{d^{\text {new}}}}+m_{z_n^{d^{\text {new}}}}\right) \\&\quad \times \,\hat{\phi }_{z_n^{q^{\text {new}}} w_n^{q^{\text {new}}}w_{n-1}^{d^{\text {new}}}} \left( \alpha _{z_n^{q^{\text {new}}}}+m_{z_n^{q^{\text {new}}}}\right) \end{aligned} \end{aligned}$$
(28)

where the count for the word being sampled is excluded. We compute the similarity between the query and the document in the latent topic space. Note that \(y_{(d,q)}\) can be dropped during the prediction step. The maximum margin prediction of labels for unseen vectors follows the standard maximum margin formulation (Yu and Kim 2012). Note that this formalism is different from the expectation based maximum margin classifier discussed previously for document classification. When the task of computing the similarity score is accomplished, it can be used in Eq. (20) to compute the prediction score. Documents can be ranked based on this confidence score.

7 Retrieval learning experiments

7.1 Experimental setup

We conduct document retrieval learning experiments using benchmark text collections. We will show the performance of our model by conducting extensive quantitative analysis. In addition, we will also present some high probability topical words from topics, and show how our model is able to generate better topical words leading to more interpretable topics. In all our experiments, we run the Gibbs sampler of our model for 1000 iterations. We removed stopwords, and performed stemming using Porter’s stemmer.

We use four test collections for our experiments. We used a benchmark OHSUMED test collection (latest versionFootnote 11) from the LETOR (Qin et al. 2010) dataset. This dataset consists of 45 comprehensive features along with query-document pairs with their relevance judgments. It has been used extensively in evaluating several learning-to-rank algorithms. We obtained raw documents and queries of this dataset from the webFootnote 12 in order to get the word order. This dataset contains the document-id along with the list of features, which will help us relate which set of features in LETOR OHSUMED is associated with which document. Our proposed feature i.e. the topic similarity feature is treated as one feature, in addition to the existing 45 features. It has approximately 60 % query-document pairs in the training set, 20 % in the validation set, and the rest in the test set in each of the fivefolds. For a particular fold, the queries involved in the training, the validation, and the test set are different. Validation set is used by the comparative learning-to-rank models for parameter tuning and determining the number of iterations. Our second collection is AQUAINT used in TREC HARD.Footnote 13 Basic details about this dataset can be found in Allan (2005). Note that we only consider document-level relevance assessments in AQUAINT, and leave out the passage-level judgments. The third dataset is WT2G,Footnote 14 along with the standard relevance judgments and topics (401 - 450) obtained from the TREC site. The fourth dataset is the Category B English documents from ClueWeb09 collection. This dataset has been obtained from the authors of Asadi and Lin (2013). In order to create the training, test and validation datasets for AQUAINT and WT2G, we adopted the strategies popularly used in the learning-to-rank problems. We chose the same percentage of query-document pairs in the training, test and validation set in each fold as in LETOR OHSUMED dataset. The features used for AQUAINT and WT2G datasets are given in Table 13. Note that only the number of features differ in the datasets that we generated (WT2G and AQUAINT) when compared to LETOR OHSUMED. We present the number of features used in the document retrieval learning experiments in Table 14. Based on our proposed model, we also investigate another variant, called Variant 1, which we will test empirically and show its performance. In this model we ignore the word order structure in queries, but maintain the word order structure in documents. The reason is that queries are mostly short, and the role of word order might not be very significant. In addition, we also compare with another variant of our model and name it Variant 2 where word order is not maintained in both queries and the documents. We use NDCG@5 and NDCG@10 as our metrics, similar to the metrics used in Cai et al. (2011). NDCG is well suited for our task because it is defined by an explicit position discount factor and it can leverage the judgments in terms of multiple ordered categories (Järvelin and Kekäläinen 2002).

Table 14 Number of features in each dataset used in document retrieval learning experiments

In order to determine the number of topics \(K\), the parameter \(C\), and the constant loss function \(l_{(d,q)}(y)\) in our model, we use the validation set. We first train our model on the training set, and measure NDCG@5 and NDCG@10 performance on the validation set. The number of topics and the model parameters can be automatically determined from the validation process. We then test our model using the test set. We varied the number of topics from 50 to 300 in steps of 10. We varied the values of \(C\) in multiples of 10. We vary \(l_{(d,q)}(y)\) from 1 to 20 in steps of 1. We have again set a weak \(\beta \) prior which is 0.01. We have use symmetric Dirichlet priors for our model. We also found that varying the value of the hyperparameter does not drastically affect the results and this finding is consistent with Wang and McCallum (2006). We also found out experimentally that different values of \(C\) does not significantly change the performance of the model. The experimental results are averaged over fivefolds for all the models. Each model is run only one time in each fold.

We compare the performance of our model with a range of comparative methods including popular learning-to-rank models in RankLibFootnote 15 such as MART (Friedman 2001), RankNet (Burges et al. 2005), AdaRank (Xu and Li 2007), Coordinate Ascent (Metzler and Croft 2007), LambdaRank (Quoc and Le 2007), LambdaMART (Wu et al. 2010), ListNet (Cao et al. 2007b), Random Forests (Breiman 2001) which is a popular pointwise learning-to-rank model. In addition, we used Ranking SVM (Joachims 2002)Footnote 16 and \({\texttt{SVM }}^{MAP}\) (Yue et al. 2007).Footnote 17 The list of first six features in Table 13 are also used in these comparative methods as in Nallapati (2004) for learning (first 45 features in case of LETOR OHSUMED). Note that the seventh feature (or 46th in case of LETOR OHSUMED) involves latent topic information which cannot be used in the comparative methods. In order to conduct the experiments for the comparative learning-to-rank models, we followed standard learning-to-rank experimental procedures for each comparative method. Some models have standard published parameter values, for example, for LETOR OHSUMED, the values for Ranking SVM Footnote 18 and \({\texttt{SVM} }^{MAP}\) Footnote 19 are online.

We present detailed parameter settings obtained from the validation dataset in each fold for our model in Tables 15, 16, 17, 18 and 19. In addition, we also present parameter settings for our Variant 1 and Variant 2 models in Tables 20, 21, 22, 23, 24, and Tables 25, 26, 27, 28, and 29, respectively.

Table 15 Values for different parameters obtained using the validation set for our model in Fold 1
Table 16 Values for different parameters obtained using the validation set for our model in Fold 2
Table 17 Values for different parameters obtained using the validation set for our model in Fold 3
Table 18 Values for different parameters obtained using the validation set for our model in Fold 4
Table 19 Values for different parameters obtained using the validation set for our model in Fold 5
Table 20 Values for different parameters obtained using the validation set for Variant 1 in Fold 1
Table 21 Values for different parameters obtained using the validation set for Variant 1 in Fold 2
Table 22 Values for different parameters obtained using the validation set for Variant 1 in Fold 3
Table 23 Values for different parameters obtained using the validation set for Variant 1 in Fold 4
Table 24 Values for different parameters obtained using the validation set for Variant 1 in Fold 5
Table 25 Values for different parameters obtained using the validation set for Variant 2 in Fold 1
Table 26 Values for different parameters obtained using the validation set for Variant 2 in Fold 2
Table 27 Values for different parameters obtained using the validation set for Variant 2 in Fold 3
Table 28 Values for different parameters obtained using the validation set for Variant 2 in Fold 4
Table 29 Values for different parameters obtained using the validation set for Variant 2 in Fold 5

Note that we do not choose any unsupervised topic model for comparison primarily because they cannot make use of relevance judgment information during the training process. Thus they are always at disadvantages when compared with the learning-to-rank methods and our model, which explicitly uses the information of relevance labels during the training process. Also, supervised topic models such as sLDA cannot be directly used for comparison as one needs to make significant changes to this model to handle the document retrieval learning problem. In addition, the learning-to-rank models have already shown state-of-the-art results in this task, and thus they can be regarded as strong comparative methods. Our model does not directly use word proximity features in the learning setup (MacDonald et al. 2013). What our model does is to use word order for finding the best model to fit the data as it has been shown in the literature that topic models with word order improve model selection (Jameel and Lam 2013b; Kawamae 2014). Such proximity features have indeed helped improve the learning-to-rank performance, but in this work our objective is to present the robustness of our model.

7.2 Quantitative results

We present results obtained from all the test collections in Tables 30, 31, 32, and 33. From the results, we can see that our model outperforms all the comparative methods. The improvements that we obtain are statistically significant according to Wilcoxon signed rank test (with 95 % confidence) against each of the comparative methods in on all the datasets except NDCG@5 in ClueWeb-2009 dataset where Variant 2 has also done better. Our results show that the latent topic information generated by our model which is then used to compute query-document similarity plays a significant role. Word order too plays a role where we are able to detect better topics than unigram models.

Table 30 NDCG@5 and NDCG@10 values for different models in the LETOR OHSUMED dataset
Table 31 NDCG@5 and NDCG@10 values for different models in the AQUAINT dataset
Table 32 NDCG@5 and NDCG@10 values for different models in the WT2G dataset
Table 33 NDCG@5 and NDCG@10 values for different models in the ClueWeb-2009 Category B English dataset

In the OHSUMED collection, we find that our main proposed model in which word order is maintained in both queries and documents performs better than other models. Looking closely at NDCG@5 results, we can see that our model performs considerably better with statistically significant results than comparative models. Variant 2 does not perform better than Variant 1 at NDCG@5, thereby bringing out the importance of word order in retrieval learning task. However, models such as SVM-MAP and RankNet also do better in this collection. The reason is mainly due to the mechanism of these models, which optimize a different objective function. Coordinate Ascent model also performs better, but does not outperform our main proposed model. At NDCG@10, we see improvement in Variant 1 and Variant 2 models where we can see that the performance gap has narrowed, but they still do not outperform out model. However, the improvement of our model is still statistically significant. Other models such as Ranking SVM, Coordinate Ascent, RankNet, and SVM-MAP also perform better in this dataset. In AQUAINT collection, we notice consistent superior performance of our model when compared with comparative models, with improvements that are statistically significant. We also find that gap between the performance of our model when compared with Variant 2 especially at NDCG@5 is also reduced. Models such as SVM-MAP and RankNet also perform better in this dataset. Also, we can see that the difference between Variant 1 and Variant 2 is not much in this dataset. We see some interesting results in WT2G dataset. Many models do better in this dataset and are quite close in performance when compared with our model especially at NDCG@5. At NDCG@10, our model consistently does better. But in ClueWeb-2009 dataset, we can see that Variant 2 matches the performance of our model. Even at NDCG@10, many models are close to our model in performance. This suggests that spam and noisy pages have some impact on our model. Also, we can conclude that maintaining word order may not be a good way to model those collections which have noisy documents. The bag-of-words model can also do better in noisy collections.

We have seen from the results obtained in this experiments that considering order of words in both queries and documents simultaneously, helps improve the performance of document retrieval learning using topic models, and relaxing the order of words either queries or documents does not help in improving the results. The reason for good performance is primarily because our model is able to capture the semantic dependencies in text and matches words based on word proximity. We also found that noise has an impact on our model. Therefore, it can be concluded that in collections which are very noisy and contain many spam pages, the bag-of-words model can also be adopted.

One interesting facet to consider is to study the effect of the number of topics in the document retrieval learning experiment for our models. In order to study the effect on the number of topics, we varied the number of topics in the training set in each fold. We used the same set of parameters obtained in each fold in each dataset as we have shown earlier except the number of topics which we specify manually in this set of experiments. After training the model on the training set, we used the test set directly to find the effect of the number of topics. We present results by averaging results obtained from all fivefolds. In Table 34, we vary the number of topics from 50 to 290 in steps of 20 and present the results therein for our model. In the OHSUMED dataset we can see that as we increase the number of topics, the results improve until certain number of topics and begin to deteriorate again as we keep on increasing the number of topics. This gives us an insight about the dependence between the number of topics and the retrieval learning results for our models. But we do not find any noticeable pattern when the number of topics is varied. What we do observe is that the effect when the number of topics is varied is not huge. Most of the values appear very close to each other in all datasets.

Table 34 NDCG@5 (denoted as N@5), and NDCG@10 (denoted as N@10) results obtained from our model when we vary the number of topics from 50 to 290

In addition, we also present results obtained from Variant 1 in Table 35 in different datasets. We can observe that effect of topics is not very noticeable in this model also. We have similar observation in Table 36.

Table 35 NDCG@5 (denoted as N@5), and NDCG@10 (denoted as N@10) results obtained from Variant 1 when we vary the number of topics from 50 to 290
Table 36 NDCG@5 (denoted as N@5), and NDCG@10 (denoted as N@10) results obtained from Variant 2 when we vary the number of topics from 50 to 290

It is quite interesting to see that our model outperforms some of the powerful learning-to-rank models. Our model can perform consistently well with more (in LETOR OHSUMED) and less number of features (in WT2G and AQUAINT). This shows that the generalization ability of our proposed model is very robust. The results suggest that incorporating topic similarity helps improve document retrieval performance. One reason why topic models help improve document retrieval performance as we compare the similarity between the document and the query based on latent factors rather than just the words (Wei and Croft 2006; Sordoni et al. 2013). Hence, this feature which our model computes is extremely important for document retrieval learning task.

7.3 Investigation on topic enhancements for comparative models

In this section, we present results where we add the latent topic feature as one of the features in addition to the existing list of features in a two stage approach. Our motivation is to study where latent topic feature obtained either from LDA or BTM can help improve the performance of the comparative models. Results of our model and its variants will remain the same as shown the previous experiment described in Sect. 7.2.

7.3.1 Employing LDA

In this set of experiments, for all the comparative methods, we manually append a latent topic similarity feature. The procedure is to first conduct latent topic modeling using the LDA model on the set of documents used in the learning-to-rank experiments. Then we use an existing method described in Wei and Croft (2006) to compute the query-document topic similarity. We obtain a score for each number of latent topic (\(K\)) which we vary from 10 to 100. Then we create the training, test and validation datasets based on the same split as used in the previous experiment. We use the validation set to train the parameters of the comparative models. We obtain the best topic \(K\) from the validation set which gives the best NDCG@5 and NDCG@10 across all topics in the validation set.

We present results for this set of experiments on different datasets in Tables 37, 38, 39 and 40. This topic enhanced setting is used in the comparative methods only.

Table 37 NDCG@5 and NDCG@10 values for different models in the LETOR OHSUMED dataset when the comparative models are enhanced with latent topic feature obtained from the LDA model
Table 38 NDCG@5 and NDCG@10 values for different models in the AQUAINT dataset when the comparative models are enhanced with latent topic feature obtained from the LDA model
Table 39 NDCG@5 and NDCG@10 values for different models in the WT2G dataset when the comparative models are enhanced with latent topic feature obtained from the LDA model
Table 40 NDCG@5 and NDCG@10 values for different models in the ClueWeb-2009 Category B English dataset when the comparative models are enhanced with latent topic feature obtained from the LDA model

Our results show that even by manually adding the latent topic feature computed externally, the comparative methods cannot outperform our proposed model. From the results in all datasets, we can make a conclusion that in majority of the cases the results of the comparative methods have improved by adding the latent topic similarity feature. But the results could not outperform our proposed document retrieval learning model. The reason lies in the inherent design of the model where it is embedded with the latent topic model and maximum margin prediction. Even the closest learning-to-rank model Ranking SVM could not outperform our model.

The improvements that we obtain are statistically significant according to Wilcoxon signed rank test (with 95 % confidence) against each of the comparative methods in all the datasets except NDCG@5 in ClueWeb-2009 dataset. We can notice from that the comparative methods have improved when the latent topic feature is added. In terms of performance, the gap between the comparative methods and our model has also reduced. In LETOR OHSUMED dataset, SVM-MAP and Coordinate Ascent models perform better. In ClueWeb-2009 dataset, most of the models are able to narrow the performance gap, but our model still remains competitive.

Another interesting note is the length of the query and the performance of our model. We have noticed that for longer queries our model performs relatively better as compared to shorter queries. The reason may be due to the fact that the word order can convey more information to our model for longer queries as compared to shorter queries.

7.3.2 Employing BTM

In this set of experiments, instead of using the LDA model, we use the BTM model which considers word order. The procedure for adding latent topic information is similar to that described in Sect. 7.3.1, except that the retrieval formulation using language modeling technique needs to be changed a bit in order to incorporate word order. We present the retrieval formulations below.

The query likelihood model scoring for each document \(d\) is done by calculating the likelihood of its model in generating a query \(q\). This can be written as \(P_{\text {LM}}(q|d)\). Under the bag-of-words assumption, we can write the following likelihood function:

$$ P_{\text {LM}}(q|d)=\prod _{i=1}^{N^q} P(q_i|d) $$
(29)

The above Eq. (29) is specified by a document model where we can consider Dirichlet smoothing (Zhai and Lafferty 2004). Therefore, Eq. (29) can be expressed as:

$$ P_{\text {LM}}(q|d)=\frac{N^d}{N^d+\mu } P_{\text {ML}}(q|d)+\left( 1-\frac{N^d}{N^d+\mu }\right) P_{\text {ML}}(q|{\mathbb {D}}) $$
(30)

where \(P_{\text {LM}}(q|d)\) is the maximum likelihood estimate for the query \(q\) generated in the document \(d\). \(P_{\text {ML}}(q|{\mathbb {D}})\) is the maximum likelihood estimate for the query \(q\) generated in the entire collection \({\mathbb {D}}\). \(\mu =1000\) is the smoothing prior. This prior value has been adopted from the work of Zhai and Lafferty (2004).

In order to calculate the query likelihood for the BTM model using the language modeling framework, we need to sum over all the topic variables for each word. The posterior estimates can be used in the likelihood model. The query likelihood for the query \(q\) given the document \(d\) from BTM is written as \(P_{\text {BTM}}(q|d)\). Therefore, the likelihood function can be written as:

$$ P_{\text {BTM}}(q|d)=\prod _{i=1}^{N^q} P_{\text {BTM}}(q_i|q_{i-1},d) $$
(31)

where \(P_{\text {BTM}}(q_i|q_{i-1},d)\) can be expressed as:

$$ P_{\text {BTM}}(q_i|q_{i-1},d) = \sum _{k_i=1}^{K} P(q_i|\varvec{\varPhi }_{k_i},q_{i-1})P\left( k_i|\varvec{\theta }^d\right) $$
(32)

Similar to the framework described in Wei and Croft (2006), we can adopt the following:

$$ P(q|d)=\lambda P_{\text {LM}}(q|d)+(1-\lambda )P_{\text {BTM}}(q|d) $$
(33)

where \(\lambda \) is a weighting parameter. For consistency in the experiments performed using the LDA model in Sect. 7.3.1, we set the value of \(\lambda =0.7\).

We present the results obtained by adding the topic information using BTM in Tables 41, 42, 43, and 44. In all our experiments, the improvement shown by our model is statistically significant according to Wilcoxon signed rank test (with 95 % confidence) against each of the comparative methods in all the datasets except NDCG@5 in ClueWeb-2009 dataset.

Table 41 NDCG@5 and NDCG@10 values for different models in the LETOR OHSUMED dataset when the comparative models are enhanced with latent topic feature obtained from the BTM model
Table 42 NDCG@5 and NDCG@10 values for different models in the AQUAINT dataset when the comparative models are enhanced with latent topic feature obtained from the BTM model
Table 43 NDCG@5 and NDCG@10 values for different models in the WT2G dataset when the comparative models are enhanced with latent topic feature obtained from the BTM model
Table 44 NDCG@5 and NDCG@10 values for different models in the ClueWeb-2009 Category B English dataset when the comparative models are enhanced with latent topic feature obtained from the BTM model

In the OHSUMED dataset as depicted in Table 41, we can notice that our model still remains competitive compared with other models. We achieve very good performance at NDCG@5, but the other models also do very well at NDCG@10. When compared to the results obtained using the LDA model as depicted in Table 37 i.e. when latent topic information obtained from the LDA model is used, we can see that indeed performance (when compared to the results in Table 37) of comparative models has improved when word order is maintained in the topic model, and that topic feature is used in the learning-to-rank models. Looking more closely, we notice that at NDCG@5, most of the comparative models have shown improved performance except LambdaMART, ListNet, and SVM-MAP. In fact, the performance of ListNet and LambdaMART have actually deteriorated to some extent suggesting that latent topic information with word order did not give much help to the model. Even at NDCG@10, ListNet could recover from its poor performance, but not SVM-MAP and LambdaMART. We also notice that at NDCG@10, in Table 41, gap between our model and comparative models has lessened. In AQUAINT as depicted in Table 42, we notice that our model has performed better than comparative models. At NDCG@5, we notice that performance of three models has deteriorated as compared to that in LDA as depicted in Table 38. These models are MART, Coordinate Ascent, and SVM-MAP. But the change in results is not very significant. At NDCG@10, for AQUAINT as depicted in Table 42, we notice that MART and SVM-MAP show an improvement when compared to LDA as depicted in Table 38. In addition, the performance of LambdaRank has deteriorated when latent topic information with word order is added to the model at NDCG@10. In WT2G as depicted in Table 43, we notice good improvement in the comparative models when compared to that in LDA as depicted in Table 39 at both NDCG@5 and NDCG@10. But the performance of these models is not good when compared with our model. LambdaRank, at NDCG@5, does not show an improvement when latent topic from BTM is added to the list of features. Similarly, RankNet shows no such improvement. In ClueWeb09 collection as depicted in Table 44, at NDCG@5, many models have in fact shown lowering of NDCG@5 results, suggesting that spam and noisy text is having some impact on the results. Models such as RankNet, AdaRank, Coordinate Ascent have in fact deteriorated when compared with results listed in Table 40. Models such as ListNet and SVM-MAP show no change in performance. At NDCG@10, RankBoost, Coordinate Ascent, and SVM-MAP show no performance improvement. AdaRank performance has in fact deteriorated.

From the above results, in general, they reveal that by incorporating latent topic information using word order in the comparative learning-to-rank methods does help improve performance. But since the approach is two stage, the comparative models are not able to do better than our proposed model. We can conclude that word order has helped improve the performance to some extent, but it is not consistent in all our results.

7.4 Topical words examples

We can see from Tables 45 and 46 that our model has generated words which appear more meaningful than the other models. From the list of top five words, it can be noted that our model is describing about “Egypt” and the news related to the revolution during that time. We have only considered words from documents in order to present results in this table. AQUAINT collection does not have documents indexed in different classes just like those we have used in classification experiments, therefore supervised topic models such as MedLDA, etc. might not generate interpretable words in topics as they cannot use an extra side-information while learning. Therefore, for this comparison, we have only considered unsupervised n-gram topic models. Our model uses query-document relevance label (during learning) for generating words. We can see that words such as “president nasser” and “foreign minister” are more insightful in comparison to the words such as “hk salem” and “today” generated by the NTSeg model. Much research has already been done in topic models with word order where it has been shown empirically that n-gram models generate more interpretable latent topics than unigram models (Lindsey et al. 2012; Jameel and Lam 2013b, c; Wang et al. 2007; Griffiths et al. 2007). But what those n-gram models fail to consider side-information which can help generate even better latent topical representations. We have shown empirically that our model has generated more meaningful latent topic models than comparative models.

Table 45 Top five probable words from a topic from AQUAINT collection
Table 46 Top five probable words from a topic from AQUAINT collection

8 Conclusions

We have presented supervised topic models which maintain word order in the document. We first propose a bigram supervised topic model with maximum margin framework, and compare the performance of the model with comparative methods. From the empirical analysis, we demonstrate that our model outperforms many comparative methods. We then extend the supervised bigram topic model to handle document retrieval learning task. This model takes as input the query-document pairs. Relevance assessments given manually by annotators are the response variables. The experimental analysis shows that our model outperforms many popular learning-to-rank models. By presenting a list of topical words in topics we showed how our model generates better topical words than the comparative methods. Results clearly show that learning with side-information helps the model generate more interpretable topics with words that are insightful to a reader.