Hierarchical Dirichlet scaling process
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s10994-016-5621-5
- Cite this article as:
- Kim, D. & Oh, A. Mach Learn (2017) 106: 387. doi:10.1007/s10994-016-5621-5
- 954 Downloads
Abstract
We present the hierarchical Dirichlet scaling process (HDSP), a Bayesian nonparametric mixed membership model. The HDSP generalizes the hierarchical Dirichlet process to model the correlation structure between metadata in the corpus and mixture components. We construct the HDSP based on the normalized gamma representation of the Dirichlet process, and this construction allows incorporating a scaling function that controls the membership probabilities of the mixture components. We develop two scaling methods to demonstrate that different modeling assumptions can be expressed in the HDSP. We also derive the corresponding approximate posterior inference algorithms using variational Bayes. Through experiments on datasets of newswire, medical journal articles, conference proceedings, and product reviews, we show that the HDSP results in a better predictive performance than labeled LDA, partially labeled LDA, and author topic model and a better negative review classification performance than the supervised topic model and SVM.
Keywords
Topic modeling Dirichlet process Hierarchical Dirichlet process1 Introduction
The hierarchical Dirichlet process (HDP) is an important nonparametric Bayesian prior for mixed membership models, and the HDP topic model is useful for a wide variety of tasks involving unstructured text (Teh et al. 2006). To extend the HDP topic model, there has been active research in dependent random probability measures as priors for modeling the underlying association between the latent semantic structure and covariates, such as time stamps and spatial coordinates (Ahmed and Xing 2010; Ren et al. 2011).
A large body of this research is rooted in the dependent Dirichlet process (DDP) (MacEachern 1999) where the probabilistic random measure is defined as a function of covariates. Most DDP approaches rely on the generalization of Sethuraman’s stick breaking representation of DP (Sethuraman 1991), incorporating the time difference between two or more data points, the spatial difference among observed data, or the ordering of the data points into the predictor dependent stick breaking process (Duan et al. 2007; Dunson and Park 2008; Griffin and Steel 2006). Some of these priors can be integrated into the hierarchical construction of DP (Srebro and Roweis 2005), resulting in topic models where temporally- or spatially-proximate data are more likely to be clustered. These existing DP approaches, however, cannot be easily extended to model underlying topics of a document collection. One reason is that the extension requires to develop a new tractable inference algorithm from models with intractable posterior distributions.
We suggest the hierarchical Dirichlet scaling process (HDSP) as a new way of modeling a corpus with various types of covariates such as categories, authors, and numerical ratings. The HDSP models the relationship between topics and covariates by generating dependent random measures in a hierarchy, where the first level is a Dirichlet process, and the second level is a Dirichlet scaling process (DSP). The first level DP is constructed in the traditional way of a stick breaking process, and the second level DSP with a normalized gamma process. With the normalized gamma process, each topic proportion of a document is independently drawn from a gamma distribution and then normalized. Unlike the stick breaking process, the normalized gamma process keeps the same order of the atoms as the first level measure, which allows the topic proportions in the random measure to be controlled. The DSP then uses that controllability to guide the topic proportions of a document by replacing the rate parameter of the gamma distribution with a scaling function that defines the correlation structure between topics and labels. The choice of the scaling function reflects the characteristics of the corpus. We show two scaling functions, the first one for a corpus with categorical labels, and the second for a corpus with both categorical and numerical labels.
The HDSP models the topic proportions of a document as a dependent variable of observable side information. This modeling approach differs from the traditional definition of a generative process where the observable variables are generated from a latent variable or parameter. For example, Zhu et al. (2009) and Mcauliffe and Blei (2007) propose generative processes where the observable labels are generated from a topic proportion of a document. However, a more natural model of the human writing process is to decide what to write about (e.g., categories) before writing the content of a document. This same approach is also successfully demonstrated in Mimno and McCallum (2012).
The outline of this paper is as follows. In Sect. 2, we describe related work and position our work within the topic modeling literature. In Sect. 3, we describe the gamma process construction of the HDP and how scale parameters are used to develop the HDSP with two different scaling functions. In Sect. 4, we derive a variational inference for the latent variables. In Sect. 5, we verify our approach on a synthetic dataset and demonstrate the improved predictive power on real world corpora. In Sect. 6, we discuss our conclusions and possible directions for future work.
2 Related work
For model construction, the model most closely related to HDSP is the discrete infinite logistic normal (DILN) model (Paisley et al. 2012) in which the correlations among topics are modeled through the normalized gamma construction. DILN allocates a latent location for each topic in the first level, and then draws the second level random measures from the normalized gamma construction of the DP. Those random measures are then scaled by an exponentiated Gaussian process defined on the latent locations. DILN is a nonparametric counterpart of the correlated topic model (Blei and Lafferty 2007) in which the logistic normal prior is used to model the correlations between topics. The HDSP is also constructed through the normalized gamma distribution with an informative scaling parameter, but our goal in HDSP is to model the correlations between topics and labels. The doubly correlated nonparametric topic model (DCNT) proposed by Kim and Sudderth (2011) also takes documents’ metadata into account to model the correlation among topics and metadata. Unlike the HDSP, the DCNT is constructed through a logistic stick-breaking process (Ren et al. 2011) which is originally proposed for modeling contiguous and spatially localized segments.
The Dirichlet-multinomial regression topic model (DMR-TM) (Mimno and McCallum 2012) also models the label dependent topic proportions of documents, but it is a parametric model. The DMR-TM places a log-linear prior on the parameter of the Dirichlet distribution to incorporate arbitrary types of observed labels. The DMR-TM takes the “upstream” approach in which the latent variable or latent topics are conditionally generated from the observed label information. The author-topic model (Rosen-Zvi et al. 2004) also takes the same approach, but it is a specialized model for authors of documents. Unlike the “downstream” generative approach used in the supervised topic model (Mcauliffe and Blei 2007), the maximum margin topic model (Zhu et al. 2009), and the relational topic model (Chang and Blei 2009), the upstream approach does not require specifying the probability distribution over all possible values of observed labels.
The HDSP is a new way of constructing a dependent random measure in a hierarchy. In the field of Bayesian nonparametrics, the introduction of DDP (Sethuraman 1991) has led to increased attention in constructing dependent random measures. Most such approaches develop priors to allow covariate dependent variation in the atoms of the random measure (Gelfand et al. 2005; Rao and Teh 2009) or in the weights of atoms (Griffin and Steel 2006; Duan et al. 2007; Dunson and Park 2008). These priors replace the first level of the HDP to incorporate a document-specific covariate for generating a dependent topic proportion. The HDSP allows covariate dependent variation in the weights of atoms, where the variation is controlled by the scaling function that defines the correlation between atoms and labels. A proper definition of the scaling function gives the flexibility to model various types of labels.
Several topic models for labeled documents use the credit attribution approach where each observed word token is assigned to one of the observed labels. Labeled LDA (L-LDA) allocates one dimension of the topic simplex per label and generates words from only the topics that correspond to the labels in each document (Ramage et al. 2009). An extension of this model, partially labeled LDA (PLDA), adds more flexibility by allocating a pre-defined number of topics per label and including a background label to handle documents with no labels (Ramage et al. 2011). The Dirichlet process with mixed random measures (DP-MRM) is a nonparametric topic model which generates an unbounded number of topics per label but still excludes topics from labels that are not observed in the document (Kim et al. 2012).
3 Hierarchical Dirichlet scaling process
In this section, we describe the hierarchical Dirichlet scaling process (HDSP). First we review the HDP with an alternative construction using the normalized gamma process construction for the second level DP. We then present the HDSP where the second level DP is replaced by Dirichlet scaling process (DSP). Finally, we describe two scaling functions for the DSP to incorporate categorical and numerical labels.
3.1 The normalized gamma process construction of HDP
3.2 Hierarchical Dirichlet scaling process
3.3 Scaling functions
Now we propose two scaling functions to express the correlation between topics and labels of documents. A scaling method is properly defined by two factors: 1) a proper prior over the scaling parameter \(w_k\), 2) a plausible scaling function between topic specific scaling parameter \(w_k\) and the observed labels of document \(r_m\).
The choice of scaling function reflects the modeler’s perspective with respect to the underlying relationship between topics and labels. The first scaling function scales each topic by the product of the scaling parameters of the observed labels. This reflects the modeler’s assumption that a document with a set of observed labels is likely to exhibit topics that have high correlation with all of the observed labels. With the second scaling function, the scaling weight changes exponentially as the value of label changes. This reflects the modeler’s assumption that two documents with the same set of observed labels but with different values are likely to exhibit different topics.
3.4 HDSP as a dependent Dirichlet process
Similarly, the HDSP constructs a dependent random measure with covariates. However, unlike the DDP-DP approach, \(G_0\) is no longer a function of covariates. The HDSP defines a single global random measure \(G_0\) and then scales \(G_0\) based on the covariates with the scaling function. With a proper, but relatively simple, scaling function that reflects the correlation between covariates and topics, the HDSP models any structures or types of covariates, whereas the DDP requires a complex dependent process for different types of covariates (Griffin and Steel 2006).
4 Variational inference for HDSP
The posterior inference for Bayesian nonparametric models is important because it is intractable to compute the posterior over an infinite dimensional space. Approximation algorithms, such as marginalized MCMC (Escobar and West 1995; Teh et al. 2006) and variational inference (Blei and Jordan 2006; Teh et al. 2008), have been developed for the Bayesian nonparametric mixture models. We develop a mean field variational inference (Jordan et al. 1999; Wainwright and Jordan 2008) algorithm for approximate posterior inference of the HDSP topic model. The objective of variational inference is to minimize the KL divergence between a distribution over the hidden variables and the true posterior, which is equivalent to maximizing the lower bound of the marginal log likelihood of observed data.
In this section, we first derive the inference algorithm for the first scaling function with a fully factorized variational family. Variational inference algorithms can be easily modularized with the fully factorized variational family, and the variation in a model only affects the update rules for the modified parts of the model. Therefore, for the second scaling function, we only need to update the part of the inference algorithm related to the new scaling function.
4.1 Variational inference for the first scaling function
Corpus-level updates At the corpus level, we update the variational distribution for the scaling parameter \(w_{kj}\), corpus level stick length \(V_k\) and word topic distribution \(\eta _{ki}\).
4.2 Variational inference for the second scaling function
There might be possible alternatives for a scaling function with respect to characteristics of dataset used. Introducing a new scaling function requires a new inference algorithm, and this can be cumbersome. Recently, several approaches have been proposed to bypass the complex derivation of variational updates (Kingma and Welling 2014; Ranganath et al. 2014; Tran et al. 2016). Most of these approaches rely on re-parameterization tricks and stochastic updates with random samples from variational distributions. Although these methods are unbiased estimators of the variational parameters, sometimes they suffer from high variance of the samples, especially, when they are applied for the whole ELBO (Ranganath et al. 2014). We suggest to infer the scaling irrelevant parameters using the provided variational updates and scaling relevant parameters using these black-box techniques to reduce the possible high variances of these approaches.
5 Experiments
In this section, we describe how the HDSP performs with real and synthetic data. We fit the HDSP topic model with three different types of data and compare the results with several comparison models. First, we test the model with synthetic data to verify the approximate inference. Second, we train the model with categorical data whose label information is represented by binary values. Third, we train the model with mixed-type of data whose label information has both numerical and categorical values.
5.1 Synthetic data
There is no naturally-occurring dataset with the observable weights between topics and labels, so we synthesize data based on the model assumptions to verify our model and the approximate inference. First, we check the difference between the original topics and the inferred topics via simple visualization. Then, we focus on the differences between the inferred and synthetic weights. For all experiments with synthetic data, the datasets are generated by following the model assumptions with the first scaling function, and the posterior inferences are done with the first scaling function. We set the truncation level T at twice the number of topics. We terminate the variational inference when the fractional change of the lower bound falls below \(10^{-3}\), and we average all results over 10 individual runs with different initializations.
Figure 2 shows the results of the HDP and the HDSP on the synthetic dataset. Figure 2b, c are the heat maps of topics inferred from each model. We match the inferred topics to the original topics using KL divergence between the two sets of topic distributions. There are no significant differences between the inferred topics of HDSP and HDP. In addition to the topics, HDSP infers the scaling parameters between topics and labels, which are shown in Fig. 2e. The results show that the relative differences between original scaling parameters are preserved in the inferred parameters through the variational inference.
With the second experiment, we show that the inferred parameters preserve the relative differences between labels and topics in the dataset. For this experiment, we generate 1,000 documents with ten randomly drawn topics from Dirichlet(0.1) with the vocabulary size of 20. To generate the weights between topics and labels, we randomly place the topics and labels into three dimensional euclidean space, and use the distance between a topic and label as a scaling parameter. Let \(\theta _k \in \mathbb {R}^3\) be a location of topic k and \(\theta _j \in \mathbb {R}^3\) be a location of label j. We use \(|\theta _k - \theta _j|_2\) as an inverse scaling parameter \(w_{kj}^{-1}\) between topic k and label j, so the scaling weight increase as a distance between a topic and a label decreases. The location of topics and labels are uniformly drawn from three dimensional euclidean space, so the total volume is \(x^3\), then we vary the x value from 1 to 20 for each experiment.
5.2 Categorical data
Datasets used for the experiments in 5.2
Docs | Vocab | Labels | Labels/doc | Doc/labels | |
---|---|---|---|---|---|
RCV | 23,149 | 9911 | 117 | 3.2 | 729.7 |
OHSUMED | 7505 | 7056 | 52 | 5.2 | 722.0 |
NIPS | 2484 | 14,036 | 2865 | 2.4 | 1.6 |
5.2.1 Experimental settings
5.2.2 Evaluation metric
The goal of our model is to construct the dependent random probability measure given multiple labels. Therefore, our interest is to see the increments of predictive performance when the label information is given.
5.2.3 Experimental results
Figure 5 shows the predictive performance of our model against the comparison models. For the OHSUMED and RCV corpora, both HDSP and wHDSP outperform all others. Among these models, L-LDA restricts the modeling flexibility the most; the PLDA relaxes that restriction by adding an additional latent label and allowing multiple topics per label. HDSP and wHDSP further increase the modeling flexibility by allowing all topics to be generated from each label. This is reflected in the results of predictive performance of the three models; L-LDA shows the worst performance, then PLDA, and HDSP and wHDSP show the lowest perplexity. For the NIPS data, we compare HDSP and wHDSP to ATM, and again, HDSP and wHDSP show the lowest perplexity.
5.2.4 Modeling data with missing labels
We also test our model with partially labeled data which have not been previously covered in topic modeling. Many real-world data fall into this category where some of the data are labeled, others are incompletely labeled, and the rest are unlabeled. For this experiment, we randomly remove existing labels from the RCV and OHSUMED corpora. To remove observed labels in the training corpus, we use Bernoulli trials with varying parameters to analyze how the proportion of observed labels affects the heldout predictive performance of the model.
5.2.5 Modeling data with single category
5.3 Mixed-type data
In this section, we present the performance of the second scaling function with a corpus of product reviews which has real-valued ratings and category information.
The number of reviews for each rating and category in the Amazon dataset
# reviews | Percentage | |
---|---|---|
Total | 24,259 | 100 |
5-star | 12,382 | 52 |
4-star | 5040 | 20 |
3-star | 1905 | 8 |
2-star | 1723 | 7 |
1-star | 3209 | 13 |
Category | # reviews | |
---|---|---|
Canister vacuum | 3535 | |
Digital SLR | 4189 | |
Laptop | 4252 | |
MP3 | 3659 | |
Air conditioner | 568 | |
Space heater | 3859 | |
Coffee machine | 4197 |
To evaluate the performance of wHDSP, we classify the ratings of the reviews based on a trained model. We use 90% of the corpus to train models and the remaining 10% of the corpus to test the models. To classify the rating of each review in the test set, we compute the perplexity of the given review with varying ratings from one to five, and choose the rating that shows the lowest perplexity. Generally, computing the perplexity of heldout document requires complex approximation schemes (Wallach et al. 2009), but we compute the perplexity based on the expected topic distribution given category and rating information, which requires a finite number of computations.
We compare the wHDSP with the supervised LDA (SLDA), LDA + SVM, as well as classifiers Naive Bayes, SVM, and decision trees (CART). For the LDA + SVM approach, we first train the LDA model and then use the inferred topic proportion and categories as features of the SVM. For the SLDA model, the category information cannot be used because the model is designed to learn and predict the single response variable. For both models, we set the number of topics to 50, 100, and 200.
In many applications, classifying negative feedback of users is more important than classifying positive feedback. From the negative feedback, companies can identify possible problems of their products and services and use the information to design their next product or improve their services. In most online reviews, however, the proportion of negative feedback is smaller than the proportion of positive feedback. For example, in the Amazon data, about 51% of reviews are rated as five-star, and 72% rated as four or five. A classifier trained by such skewed data is likely to be biased toward the majority class.
F1 of wHDSP and the other models for the Amazon review corpus
F1 | Ratings | ||||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | |
wHDSP | 0.600 | 0.161 | 0.185 | 0.316 | 0.687 |
wHDSP-no-cate | 0.428 | 0.087 | 0.099 | 0.061 | 0.658 |
LDA50 + SVM | 0.392 | 0.036 | 0.038 | 0.134 | 0.684 |
LDA100 + SVM | 0.454 | 0.078 | 0.073 | 0.265 | 0.678 |
LDA200 + SVM | 0.508 | 0.032 | 0.100 | 0.284 | 0.681 |
SLDA50 | 0.603 | 0.000 | 0.021 | 0.140 | 0.741 |
SLDA100 | 0.606 | 0.000 | 0.021 | 0.067 | 0.740 |
SLDA200 | 0.580 | 0.015 | 0.011 | 0.140 | 0.727 |
SVM | 0.403 | 0.000 | 0.000 | 0.007 | 0.716 |
NaiveBayes | 0.634 | 0.028 | 0.085 | 0.469 | 0.652 |
DecisionTree | 0.457 | 0.088 | 0.154 | 0.355 | 0.628 |
Macro and micro F1 of the wHDSP and the other models
5-Ratings | MacroF1 | MicroF1 |
---|---|---|
wHDSP | 0.390 | 0.522 |
wHDSP* | 0.267 | 0.474 |
LDA50 + SVM | 0.257 | 0.518 |
LDA100 + SVM | 0.310 | 0.520 |
LDA200 + SVM | 0.321 | 0.527 |
LDA200 + SVM* | 0.309 | 0.533 |
SLDA50 | 0.301 | 0.584 |
SLDA100 | 0.287 | 0.588 |
SLDA200 | 0.294 | 0.577 |
SVM | 0.225 | 0.560 |
NaiveBayes | 0.374 | 0.545 |
DecisionTree | 0.336 | 0.477 |
We perform a rating prediction task with and without the category information of reviews to see the effect of using both the category and rating information on the wHDSP and LDA + SVM approaches. The results represented by wHDSP* in Table 4 and Fig. 10b show the performance of rating prediction with the wHDSP trained without category information. For wHDSP*, the model performs worse than wHDSP, which indicates the model, without category information, cannot distinguish the review ratings which depend on topical context. The LDA + SVM without categories achieves 0.309 macro F1 and 0.533 micro F1, which are comparable to the LDA + SVM with the category information. Unlike the wHDSP, the decision boundaries of SVM are not improved with the additional category information. The result supports that for learning decision boundaries between ratings over different categories, the approach of including category information to train topics is more effective than using topics and the category information independently.
6 Discussions
We have presented the hierarchical Dirichlet scaling process (HDSP), a Bayesian nonparametric prior for a mixed membership model that lets us analyze underlying semantics and observable side information. The combination of the stick breaking process with the normalized gamma process in HDSP is a more controllable construction of the hierarchical Dirichlet process because each atom of the second level measure inherits from the first level measure in order. HDSP also allows more flexibility and the capability of modeling side information by the scaling functions that plug into the rate parameter of the gamma distribution. The choice of the scaling function is the most important part of the model in terms of establishing a link between topics and observed labels. We developed two scaling functions but the choice of scaling function depends on the modeler’s intention. For example, the well known linking functions from the generalized linear model can be used as scaling functions, or one can use several scaling functions together on purpose. We showed that the application of HDSP to topic modeling correctly recovers the topics and topic-label weights of synthetic data. Experiments with the real dataset show that the first scaling function is more suited for partially labeled data, and the second scaling function is more suited for a dataset with both numerical and categorical labels.
Hierarchical Dirichlet scaling process opens up a number of interesting research questions that should be addressed in future work. First, in the two scaling functions we proposed to model the correlation structure between topics and side information, we simply defined the relationship between topic k and label j through the scaling parameter \(w_{kj}\). However, this approach does not consider the correlation within topics and labels. Taking inspiration from previous work (Blei and Lafferty 2007; Mimno et al. 2007; Paisley et al. 2012) that showed correlations among topics, we can define a scaling function with a prior over the topics and labels to capture their complex relationships. Second, our posterior inference algorithm based on mean-field variational inference is tested with tens of thousands documents. However, modern data analysis requires inference of massive and/or streaming data. For a fast and efficient posterior inference, we can apply parallel or distributed algorithms based on a stochastic update (Hoffman et al. 2013; Ahn et al. 2014). Furthermore, we fix the number of labels before training but we need to find a way to model the unbounded number of labels for streaming data.
In this paper, we limit our discussions of the HDP to the two level construction of the DP and refer to it simply as the HDP.
Acknowledgements
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government (MSIP) (No.B0101-15-0307, Basic Software Research in Human-level Lifelong Machine Learning (Machine Learning Center)).
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.