DOLDA: a regularized supervised topic model for high-dimensional multi-class regression

Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial intelligence and statistics, 2013) together with an efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al. in Biometrika 97:465–480, 2010). We propose a computationally efficient parallel Gibbs sampler for the new model. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We evaluate the model’s predictive accuracy and scalability, and demonstrate DOLDA’s advantage in interpreting the generated predictions.


INTRODUCTION
During the last decades more and more textual data have become available, creating a growing need to statistically analyze large amounts of textual data.The hugely popular Latent Dirichlet Allocation (LDA) model introduced by Blei et al. (2003) is a generative probability model where each document is summarized by a set of latent semantic themes, often called topics; formally, a topic is a probability distribution over the vocabulary.An estimated LDA model is therefore a compressed latent representation of the documents.LDA is a mixed membership model where each document is a mixture of topics, where each word (token) in a document belongs to a single topic.The basic LDA model is unsupervised, i.e. the topics are learned solely from the words in the documents without access to document labels.
In many situations there are also other information we would like to incorporate in modeling a corpus of documents.A common example is when we have labeled documents, such as ratings of movies together with a movie description, illness category in medical journals or the location of the identified bug together with bug reports.In these situation, one can use a so called supervised topic model to find the semantic structure in the documents that are related to the class of interest.One of the first approaches to supervised topic models was proposed by Mcauliffe and Blei (2008).The authors propose a supervised topic model based on the generalized linear model framework, thereby making it possible to link binary, count and continuous response variables to topics that are inferred jointly with the regression/classification effects.In this approach the semantic content of a text in the form of topics predicts the response variable y.This approach is often referred to as downstream supervised topic models, contrary to an upstream supervised approach where the label y governs how the topics are formed, see e.g.Ramage et al. (2009).
Many downstream supervised topic models have been studied, mainly in the machine learning literature.Mcauliffe and Blei (2008) focus on downstream supervision using generalized linear regression models.Jiang et al. (2012) propose a supervised topic model using a max-margin approach to classification and Zhu et al. (2013) propose a logistic supervised topic model using data augmentation with polya-gamma variates .Perotte et al. (2011) use a hierarchical binary probit model to model a hierarchical label structure in the form of a binary tree structure.
Most of the proposed supervised topic models have been motivated by trying to find good classification models and the focus has naturally been on the predictive performance.However, the predictive performance of most supervised topic models are just slightly better than using a Support Vector Machine (SVM) with covariates based on word frequencies (Jameel et al., 2015).While predictive performance is certainly important, the real attraction of supervised topic models comes from their ability to learn semantically relevant topics and to use those topics to produce accurate interpretable predictions of documents or textual data.The interpretability of a model is an often neglected feature, but is crucial in real world applications.As an example, Parnin and Orso (2011) show that bug fault localization systems are quickly disregarded when the users can not understand how the system has reached the predictive conclusion.Compared to other text classification systems, topic models are very well suited for interpretable predictions since topics are an abstract entity that are possible for humans to grasp.The problems of interpretability in multi-class supervised topic models can be divided into three main areas.
First, most supervised topic models use a logit or probit approach where the model is identified by the use of a reference category, to which the effect of any covariate is compared.This defeats one of the main purposes of supervised topic models since this complicates the interpretability of the models.
Second, to handle multi-class categorization a topic should be able to affect multiple classes, and some topics may not influence any class at all.In most supervised topic modeling approaches (such as Jiang et al. 2012;Zhu et al. 2013;Jameel et al. 2015) the multi-class problem is solved using binary classifiers in a "one-vs-all" classification approach.This approach works well in the situation of evenly distributed classes, but may not work well for skewed class distributions Rubin et al. (2012).A one-vs-all approach also makes it more difficult to interpret the model.Estimating one model per class makes it impossible to see which classes that are affected by the same topic and which topics that do not predict any label.In these situations we would like to have one topic model to interpret.The approach of one-vs-all predictions are also costly from an estimation point of view since we need to estimate one model per class Zheng et al. (2015).
Third, there can be situations with hundreds of classes and hundreds of topics (see Jonsson et al. (2016) for an example).Without regularization or variable selection we would end up with a model with too many parameters to interpret and very uncertain parameter estimates.In a good predictive supervised topic model one would like to find a small set of topics that are strong determinants of a single document class label.This is especially relevant when the number of observations in different classes are skewed, a problem common in real world situations (Rubin et al., 2012).In the more rare classes we would like to induce more shrinkage while in the situation with more data we would like to have less shrinkage in the model.
Multi-class regression is a non-trivial problem in Bayesian modeling.Historically, the multinomial probit model has been preferred due to the data augmentation approach proposed by Albert and Chib (1993).Augmenting the sampler using latent variables lead to straight forward Gibbs sampling with conditionally-conjugate updates of the regression coefficients.The Albert-Chib sampler often tend to mix slowly, and the same holds for improved sampler such as the parameter expansion approach in Imai and van Dyk (2005).Recently, a similar data augmentation approach using polya-gamma variables is proposed for the Bayesian logistic regression model by Polson et al. (2013).This approach preserve the conditional-conjugacy in the case of a Normal prior for the regression coefficients and has been the foundation for the supervised topic model in Zhu et al. (2013).
In this paper we explore a new approach to supervised topic models that produce accurate multi-class predictions from semantically interpretable topics using a fully Bayesian approach, hence solving all three of the above mentioned problems.The model combines LDA with the recently proposed Diagonal Orthant (DO) probit model Johndrow et al. (2013) for multi-class classification with an efficient Horseshoe prior that achieves sparsity and interpretation by aggressive shrinkage (Carvalho et al., 2010).The new Diagonal Orthant Latent Dirichlet Allocation (DOLDA) 1 model is demonstrated to have competitive predictive performance yet producing interpretable multi-class predictions from semantically relevant topics.

Handling the challenges for high-dimensional interpretable supervised topic models.
To solve the first and second challenge identified in the Introduction, reference classes and multi-class models, we propose to use the Diagonal Orthant (DO) probit model in Johndrow et al. (2013) as an alternative to the multinomial probit and logit models.Johndrow et al. (2013) propose a Gibbs sampler for the model and shows that it mixes well.One of the benefits of the DO model is that all classes can be independently modeled using binary probit models when conditioning on the latent variable, thereby removing the need for a reference class.The parameters of the model can be interpreted as the effect of the covariate on the marginal probability of a specific class, which make this model especially attractive when it comes to interpreting the inferred topics.This model also include multiple classes in an 1 DOLDA is Swedish for hidden or latent.

Φ
The matrix with word-topic probabilities : The word probabilities for topic k: efficient way that makes it possible to estimate a multi-class linear model in parallel over the classes.
The third problem of modeling supervised topic models is that the semantic meanings of all topics do not necessarily have an effect on our label of interest; one topic may have an effect on one or more classes, and some topics may just be noise that we do not want to use in the supervision.In the situation with many topics and many classes we will also have a very large number of parameters to analyze.The Horseshoe prior in Carvalho et al. (2010) was specifically designed to filter out signals from massive noise.This prior uses a local-global shrinkage approach to shrink some (or most) coefficients to zero while allowing for sparse signals to be estimated without any shrinkage.This approach has shown good performance in linear regression type situations (Castillo et al., 2015), something that makes it straight forward to incorporate other covariates into our model, which is rarely done in the area of supervised topic models.Different global shrinkage parameters are used for the different classes to handle the problem with unbalanced number of observations in different classes.This makes it possible to shrink more when there are less data for a given class and shrink less in classes with more observations.2.2.Generative model.The generative model is described below.See also a graphical description of the model in Figure 2.1.A summary of the notation is given in Table 1 (1) For each topic k = 1, ..., K (a) Draw a distribution over words The intercept is estimated using a normal prior.

INFERENCE
3.1.The MCMC algorithm.Markov Chain Monte Carlo (MCMC) is used to estimate the model parameters.We use different global shrinkage parameters τ l for each class, motivated by the fact that the different classes can have different number of observations.This gives the following sampler for inference in DOLDA.
(1) Sample the latent variables a ) for l = y d and a d,l ∼ N − ((x z) T d η l , 1) for l = y d , where N + and N − is the positive and negative truncated normal distribution, truncated at 0.
(2) Sample all the regression coefficients as in an ordinary Bayesian linear regression per class label l where η l ∼ MV N µ l , ((X z) T (X z) + τ 2 l Λ l ) −1 and Λ l is a diagonal matrix with the local shrinkage parameters λ i per parameter in η l and µ l = ((X z) T (X z) + τ 2 l Λ l ) −1 (X z) T a l (3) Sample the global shrinkage parameters τ l at iteration j using the following two step slice sampling: where I indicates the truncation region for the truncated gamma.(4) Sample each local shrinkage parameter λ i,l as where n (d) is a D × K count matrix containing the sufficient statistics for Θ. (6) Sample the topic-vocabulary distributions Φ where n (w) is a K × V count matrix containing the sufficient statistics for Φ.

Efficient parallel sampling of z.
To improve the speed of the sampler we cache the calculations done in the supervised part of the topic indicator sampler and parallelize the sampler.Some commonly used text corpora have several hundreds of millions topic indicators, so efficient sampling of the z are absolutely crucial in practical applications.The basic sampler for z can be slow due to the serial nature of the collapsed sampler and the fact that the supervised part of p(z i,d ) involves a dot product.
The supervised part of document d can be expressed as exp g ¬i d,k where By realizing that sampling a topic indicator just means updating a small part of this equation we can derive the relationship where the expression ∑ L l η l,k η l,z i can be calculated once per iteration in η and be stored in a two-dimensional array of size K 2 .We can then use the above relationship to update the supervision after sampling each topic indicator by calculating g ¬i d,k "on the fly" based on the previous supervised contribution g in the following way Caching g ¬i d,k leads to an order of magnitude speed up for a model with 100 topics.
To further improve the performance we parallelize the sampler and use that documents are conditionally independent given Φ.By sampling Φ, instead of marginalizing it out, we reduce the efficiency of the MCMC somewhat, but we will converge to the true posterior and the gain from parallelization is usually far greater than the reduced efficiency (Magnusson et al., 2015).
In summary, we have the following sampler for z i,d that can be sampled in parallel over the documents, and the elements in Φ can be sampled in parallel over topics.The code is publicly available at https://github.com/lejon/DiagonalOrthantLDA.It is also straightforward to use the recently proposed cyclical Metropolis-Hastings proposals in Zheng et al. (2015) for inference in DOLDA.The additional sampling of λ p,l and τ l in our model can be done in O(K + P) and is hence not affecting the overall complexity of the sampler.But, as shown in Magnusson et al. (2015), it is not obvious that the reduction in sampling complexity will result in a faster sampling when MCMC efficiency is taken into account.

Evaluation of convergence and prediction.
We evaluate the convergence of the MCMC algorithm by monitoring the log-likelihood over the iterations:

LDA marginal LL
To make predictions for a new document we first need to sample the topic indicators of the given document from where φk,v is the mean of the last part of the posterior draws of Φ.We use the posterior mean based on the last iterations instead of integrating out Φ to avoid potential problems with label switching.However, we have not seen any indications of label switching after convergence in our experiment, probably because the data sets used for document predictions are usually quite large.The topic indicators are sampled for the predicted document using the fast PC-LDA sampler in Magnusson et al. (2015).The mean of the sampled topic indicator vector for the predicted document, znew , is then used for class predictions: This is a maximum a posteriori estimate, but it is straightforward to calculate the whole predictive distribution for the label.We collected a dataset containing the 8648 highest rated movies at IMDb.com.We use both the textual description as well as information about producers and directors to classify a given movie to a genre.We also analyze the classical 20 Newsgroup dataset to compare the accuracy with state-of-the-art supervised models.Our companion paper (Jonsson et al., 2016) applies the DOLDA model developed here to bug localization in a large scale software engineering context using a dataset with 15 000 bug reports each belonging to one of 118 classes.We evaluate the proposed topic model with regard to accuracy and distribution of regression coefficients.The experiments are performed on 2 sockets with 8-core Intel Xeon E5-2660 Sandy Bridge processors at 2.2GHz and 32 GB DDR3 1600 memory at the National Super Computer centre (NSC) at Linkï¿oeping University.4.1.Data and priors.The datasets are tokenized and a standard stop list of English words are removed, as well as the most rare word types that makes up of 1 % of the total tokens; we only include genres with at least 10 movies.
In all experiments we used a relative vague priors setting α = β = 0.01 for the LDA part of the model and c = 100 for the prior variance of the η coefficients in the normal model and for the intercept coefficient when using the Horseshoe prior.The accuracy experiment for IMDb was conducted using 5-fold cross validation and the 20 Newsgroups corpus used the same train and test set as in Zhu et al. (2012) to enable direct comparisons of accuracy.In the analysis of the IMDb dataset no cross-validation was conducted, instead the whole data set was used for estimation.

Results.
20 Newsgroups. Figure 4.1 displays the accuracy on the hold-out test set for the 20 Newsgroups dataset for different number of topics.The accuracy of our model is slightly lower than MedLDA and SVM using only textual features, but higher than both the classical supervised multi-class LDA and the ordinary LDA together with an SVM approach.
We can also see from Figure 4.1 that the accuracy of using the DOLDA model with the topics jointly estimated with the supervision part outperforms a two-step approach of first estimating LDA and then using the DO probit model with the pre-estimated mean topic indicators as covariates.This is true for both the Horseshoe prior and the normal prior, but the difference is just a few percent in accuracy.
The advantage of DOLDA is that it produces interpretable predictions with semantically relevant topics.It is therefore reassuring that DOLDA can compete in accuracy with other less interpretable models, even when the model is dramatically simplified by aggressive argin MedLDA c performs better than the likelihood-based down-LDA, sLDA, and the baseline LDA+SVM.The best performances of i.e., MedLDA c and DiscLDA) are comparable.However, MedLDA c esting, as we shall see in Section 5.3.2.Moreover, the different aps used in MedLDA c (i.e., variational approximation) and DiscLDA ethods) can also make the performance different.In our alternaapsed variational inference (Teh et al., 2006) method for MedLDA c tion for submission), we were able to achieve slightly better results.onal method is much more expensive.Finally, since MedLDA c alin principle into its training, our conjecture is that the combination ot further improve the performance much on this task.We believe een MedLDA c and MedLDA c +SVM are due to the tuning of reguiency, we do not change the regularization constant C during training f MedLDA c would be improved if we select a good C in different esentation is changing.We perform multi-class classification on 20 Newsgroups with all the a balanced distribution over the categories.For the test set, which al, the smallest category has 251 documents and the largest category raining set, which contains 11269 documents, the smallest and the and 599 documents, respectively.Therefore, the naïve baseline that ory for all the test documents has the classification accuracy 0.0532.th LDA+SVM, multi-sLDA, DiscLDA, and the standard multi-class the SVM struct package with a cost function as ∆ℓ d (y) ℓI(y ̸ = y d ) g q(η η η) and build the SVM classifiers for LDA+SVM.The parameter alidation. 17The average results as well as standard deviations over yield the best results.In most cases, the selected ℓ's are around 16.  Horseshoe shrinkage for interpretational purposes.Our next data set illustrates the interpretational strength of DOLDA.See also our companion paper (Jonsson et al., 2016) in the software engineering literature for further demonstrations of DOLDAs ability to produce interpretable predictions in industrial applications without sacrificing prediction accuracy.

2261
IMDb. Figure 4.2 displays the accuracy on the IMDb dataset as a function of the number of topics.The estimated DOLDA model also contains several other discrete covariates, such as the film's director and producer.The accuracy of the more aggressive Horseshoe prior is better than the normal prior for all topic sizes.A supervised approach with topics and supervision inferred jointly is again outperforming a two-step approach.The Horseshoe prior gives somewhat higher accuracy than the normal prior, and incorporating the Horseshoe prior let us handle many additional covariates since the shrinkage prior will act as a type of variable selection.
To illustrate the interpretation of DOLDA we fit a new model using only topics as covariates.Note first in Figure 4.3 how the Horseshoe prior is able to distinguish between so called signal topics and noise topics; the Horseshoe prior is aggressively shrinking a large fraction  of the regression coefficient toward zero.This is achieved without the need of setting any hyper-parameters.
The Horseshoe shrinkage makes it easy to identify the topics that affect a given class.This is illustrated for the Romance genre in the IMDb dataset in Figure 4.4.This genre consists of relatively few observations (only 39 movies), and the Horseshoe prior therefore shrinks most coefficients to zero, keeping only one large signal topic that happens to have a negative effect on the Romance genre.The normal prior on the other hand gives a much more dense, and therefore much less interpretable solution.
Digging deeper in the interpretation of what triggers a Romance genre prediction, Table 3 shows the 10 top word for Topic 39.From this table it is clear that the signal topic identified using the Horseshoe prior is some sort of "crime" topic that is negatively correlated with the Romance genre, something that makes intuitive sense.The crime topic is clearly expected to be positively related to the Crime genre, and Figure 4.5 shows that this is indeed the case.
Topic 33 earth space planet alien human future years world time mission Topic 39 police murder detective killer case investigation crime crimes solve murdered TABLE 3. Top words in topics using the Horseshoe prior.We can also from Figure 4.5 see that Topic 33 is negatively correlated with the Crime genre.In Table 3 we can see that Topic 33 seems to be some sort of Sci-Fi topic containing top words such as "space", "alien" and "future".This topic has the largest positive correlation with the Sci-Fi movie genre, which again makes intuitive sense.

CONCLUSIONS
Several supervised topic models have recently been proposed with the purpose to identify topics that can successfully be used to classify documents.We have here proposed DOLDA, a supervised topic model with special emphasis on generating semantically interpretable predictions.An important component of the model to ease interpretation is the DO-probit model without a reference class.By coupling the DO-probit model with an aggressive Horseshoe prior with shrinkage that is allowed to vary over the different classes it is possible to create a highly interpretable classification model.At the same time the DOLDA model comes with very few hyper parameters that needs tuning, something that is needed in many other supervised topic models such as (Jiang et al., 2012;Zhu et al., 2012;Li et al., 2015).Our experiments show that the gain in interpretation from using DOLDA comes with only a small reduction in prediction accuracy compared to the state-of-the art supervised topic models; moreover, DOLDA outperforms other fully Bayesian models such as the original supervised LDA model.It is also clearly shown that learning the topics jointly with the classification part of the model gives more accurate predictions than a two step approach where a topic model is first estimated and a classifier is then trained on the learned topics.
The prior for Θ: D × K K The number of topics i.e K = |K| M #of topics indicators in each document by topics: D × K L The number of labels/categories a Matrix of latent gaussian variables: D × L L The set of labels/categories η Coefficient matrix for each label and covariate: (K + P) × L D #of observations/documents i.e.D = |D| η 0 Prior for η: L × (K + P) D The set of observations/documents z n,d Topic indicator for token n in document d P The number of non-textual covariates/features z Proportion of topic indicators by document: D × K N The total number of tokens w n,d Token n in document d N d The number of tokens in document d w d Vector of tokens in document d: 1 × N d N # obs topic-word type indicators: K × V y d Label for document d FIGURE 2.1.The Diagonal Orthant probit supervised topic model (DOLDA) models for: (a) binary and (b) multi-class classificaroup data.08) to set K = 2K 0 + K 1 , where K 0 is the number of class-specific shared topics, and K 1 = 2K 0 .Here, we set K 0 = 1, • • •, 8, 10.

FIGURE 4 . 2 .
FIGURE 4.2.Accuracy for DOLDA on the IMDb data with normal and Horseshoe prior and using a two step approach with the Horseshoe prior.

FIGURE 4
FIGURE 4.3.Coefficients for the IMDb dataset with 80 topics using the normal prior (left) and the Horseshoe prior (right).

FIGURE 4
FIGURE 4.4.Coefficients for the genre Romance in the IMDb dataset with 80 topics using the Horseshoe prior (upper) and a normal prior (below).

FIGURE 4 . 5 .
FIGURE 4.5.Regression coefficients for the class Crime for the IMDb dataset with 80 topics using the Horseshoe prior (upper) and a normal prior (below).