Additive regularization of topic models
 1.7k Downloads
 8 Citations
Abstract
Probabilistic topic modeling of text collections has been recently developed mainly within the framework of graphical models and Bayesian inference. In this paper we introduce an alternative semiprobabilistic approach, which we call additive regularization of topic models (ARTM). Instead of building a purely probabilistic generative model of text we regularize an illposed problem of stochastic matrix factorization by maximizing a weighted sum of the loglikelihood and additional criteria. This approach enables us to combine probabilistic assumptions with linguistic and problemspecific requirements in a single multiobjective topic model. In the theoretical part of the work we derive the regularized EMalgorithm and provide a pool of regularizers, which can be applied together in any combination. We show that many models previously developed within Bayesian framework can be inferred easier within ARTM and in some cases generalized. In the experimental part we show that a combination of sparsing, smoothing, and decorrelation improves several quality measures at once with almost no loss of the likelihood.
Keywords
Probabilistic topic modeling Regularization of illposed problems Probabilistic latent sematic analysis Latent Dirichlet allocation EMalgorithm1 Introduction
Topic modeling is a rapidly developing branch of statistical text analysis (Blei 2012). A probabilistic topic model of a text collection defines each topic by a multinomial distribution over words, and then describes each document with a multinomial distribution over topics. Such representation reveals a hidden thematic structure of the collection and promotes the usage of topic models in information retrieval, classification, categorization, summarization and segmentation of texts.
Latent Dirichlet allocation (LDA) (Blei et al. 2003) is the most popular probabilistic topic model. LDA is a twolevel Bayesian generative model, in which topic distributions over words and document distributions over topics are generated from prior Dirichlet distributions. This assumption reduces model complexity and facilitates Bayesian inference due to the conjugacy of Dirichlet and multinomial distributions.
Hundreds of LDA extensions have been developed recently to model natural language phenomena and to incorporate additional information about authors, time, labels, categories, citations, links, etc., (Daud et al. 2010).
Nevertheless, building combined and multiobjective topic models remains a difficult problem in Bayesian approach because of a complicated inference in the case of a nonconjugate prior. This open issue is little discussed in the literature. An evolutionary approach has been proposed recently (Khalifa et al. 2013), but it seems to be computationally infeasible for large text collections.
Another difficulty is that Dirichlet prior conflicts with natural assumptions of sparsity. A document usually contains a small number of topics, and a topic usually consists of a small number of domainspecific terms. Therefore, most words and topics must have zero probabilities. Sparsity helps to save memory and time in modeling large text collections. However, Bayesian approaches to sparsing (Shashanka et al. 2008; Wang and Blei 2009; Larsson and Ugander 2011; Eisenstein et al. 2011; Chien and Chang 2013) suffer from an internal contradiction with Dirichlet prior, which can not produce vectors with zero elements.
To address the above problems we introduce a nonBayesian semiprobabilistic approach—Additive Regularization of Topic Models (ARTM). Learning a topic model from a document collection is an illposed problem of approximate stochastic matrix factorization, which has an infinite set of solutions. To choose a better solution, we add regularization penalty terms to the loglikelihood. Any problemoriented regularizers or their linear combination may be used instead of Dirichlet prior or together with it. The idea of ARTM is inspired by Tikhonov’s regularization of illposed inverse problems (Tikhonov and Arsenin 1977).
Additive regularization differs from Bayesian approach in several aspects.
Firstly, we do not aim to build a fully generative probabilistic model of text. Many requirements for a topic model can be more naturally formalized in terms of optimization criteria rather than prior distributions. Regularizers may have no probabilistic interpretation at all. The structure of regularized models is so straightforward that their representation and explication in terms of graphical models is no longer needed. Thus, ARTM falls into the trend of avoiding excessive probabilistic assumptions in natural language processing.
Secondly, we use the regularized expectation–maximization (EM) algorithm instead of more complicated Bayesian inference. We do not use conjugate priors, integrations, and variational approximations. Despite these fundamental differences both approaches often result in the same or very similar learning algorithms, but in ARTM the inference is much shorter.
Thirdly, ARTM considerably simplifies both design and inference of multiobjective topic models. At the design stage we formalize each requirement for the model in a form of a regularizer—a criterion to be maximized. At the inference stage we simply differentiate each regularizer with respect to the model parameters.
ARTM also differs from previous regularization techniques each designed for a particular regularizer such as KLdivergence, Dirichlet prior, \(L_1\) or \(L_2\) penalty terms (Si and Jin 2005; Chien and Wu 2008; Wang et al. 2011; Larsson and Ugander 2011). ARTM is not an incremental improvement of a particular topic model, but a new instrument for building and combining topic models much easier than in the stateoftheart Bayesian approach.
The aim of the paper is to introduce a new regularization framework for topic modeling and to provide an initial pool of useful regularizers.
The rest of the paper is organized as follows.
In Sect. 2 we describe probabilistic latent semantic analysis (PLSA) model, the historical predecessor of LDA. We introduce the EMalgorithm from optimizational point of view. Then we show experimentally on synthetic data that both PLSA and LDA give nonunique and unstable solutions. Further we use PLSA as a more appropriate base for a stronger problemoriented regularization.
In Sect. 3 we introduce the ARTM approach and prove general equations for regularized EMalgorithm. It is a major theoretical contribution of the paper.
In Sect. 4 we work out a pool of regularizers by revising known topic models. We propose an alternative interpretation of LDA as a regularizer that minimizes Kullback–Leibler divergence with a fixed multinomial distribution. Then we consider regularizers for smoothing, sparsing, semisupervised learning, topic correlation and decorrelation, topic coherence maximization, documents linking, and document classification. Most of them require tedious calculations within Bayesian approach, whereas ARTM leads to similar results “in one line”.
In Sect. 5 we combine three regularizers from our pool to build a highly sparse and well interpretable topic model. We propose to monitor many quality measures during EMiterations to choose the regularization path empirically for a multiobjective topic model. In our experiment we measure sparsity, kernel size, coherence, purity, and contrast of the topics. We show that ARTM improves all measures at once almost without any loss of the holdout perplexity.
In Sect. 6 we discuss advantages and limitations of ARTM.
2 Topic models PLSA and LDA
Let \(D\) denote a set (collection) of texts and \(W\) denote a set (vocabulary) of all terms from these texts. Each term can represent a single word as well as a key phrase. Each document \({d\in D}\) is a sequence of \(n_d\) terms \((w_1,\ldots ,w_{n_d})\) from the vocabulary \(W\). Each term might appear multiple times in the same document.
Assume that each term occurrence in each document refers to some latent topic from a finite set of topics \(T\). Text collection is considered to be a sample of triples \((w_i,d_i,t_i),\, {i=1,\ldots ,n}\) drawn independently from a discrete distribution \(p(w,d,t)\) over a finite space \(W\times D \times T\). Term \(w\) and document \(d\) are observable variables, while topic \(t\) is a latent (hidden) variable. Following the “bag of words” model, we represent each document by a subset of terms \(d\subset W\) and the corresponding integers \(n_{dw}\), which count how many times the term \(w\) appears in the document \(d\).
Theorem 1
This statement follows from Karush–Kuhn–Tucker (KKT) conditions. We will prove a more general theorem in the sequel. The system of Eqs. (6)–(8) can be solved by various numerical methods. Particularly, the simpleiteration method is equivalent to the EM algorithm, which is typically used in practice.
EM algorithm repeats two steps in a loop.
The maximization step or Mstep (7), (8) can therefore be interpreted as frequency estimates for the conditional probabilities \(\phi _{wt}\) and \(\theta _{td}\).
Algorithm 2.1 reorganizes EM iterations by incorporating the Estep inside the Mstep. Thus it avoids storage of a threedimensional array \(p_{tdw}\). Each EM iteration is a run through the entire collection.
In contrast, the nonuniqueness, which causes the instability of the solution, is a serious problem. The likelihood (4) depends on the product \(\varPhi \varTheta \), which is defined up to a linear transformation: \(\varPhi \varTheta = (\varPhi S) (S^{1}\varTheta )\), where \({\varPhi ' = \varPhi S}\) and \({\varTheta ' = S^{1}\varTheta }\) are stochastic matrices. The transformation \(S\) is not controlled by EMlike algorithms and may depend on random initialization.
These facts show that the Dirichlet distribution is too weak as a regularizer. More problemoriented regularizers are needed to formalize additional restrictions on the matrices \(\varPhi ,\varTheta \) and to ensure uniqueness and stability of the solution. Therefore our starting point will be the PLSA model, free of regularizers, but not the LDA model, even though it is more popular in recent research works.
3 EMalgorithm with additive regularization
Document \(d\) is called regular if \({n_{td} + \theta _{td} \frac{\partial R}{\partial \theta _{td}} > 0}\) for at least one topic \({t\in T}\). If the reverse inequality holds for all \({t\in T}\) then document \(d\) is called overregularized.
Theorem 2
Note 1
If a topic \(t\) is overregularized then (15) gives \(\phi _t=0\). In this case we have to exclude the topic \(t\) from the model. Topic overregularization is a mechanism that can eliminate irrelevant topics and optimize the number of topics.
Note 2
If a document \(d\) is overregularized then Eq. (16) gives \(\theta _d=0\). In this case we have to exclude the document \(d\) from the model. For example, a document may be too short, or have no relation to the thematics of a given collection.
Proof
Equations for \(\theta _{td}\) are derived analogously thus finalizing the proof. \(\square \)
The system of Eqs. (14)–(16) defines a regularized EMalgorithm. It keeps Estep (6) and redefines Mstep by regularized Eqs. (15), (16). Thus, the EMalgorithm for learning regularized topic models can be implemented by easy modification of any EMlike algorithm at hand. Particularly, in Algorithm 2.1 we are to modify only steps 8 and 9 according to Eqs. (15), (16).
4 Regularization criteria for topic models
In this section we collect a pool of regularizers that can be used in any combination or separately. We revise some of wellknown topic models that were originally developed within Bayesian approach. We show that ARTM gives similar or more general results through a much simpler inference based on Theorem 2.
The nonBayesian interpretation of the smoothing regularization in terms of KLdivergence is simple, natural, and avoids complicated inference.
Smoothing regularization for semisupervised learning Consider a collection, which is partially labeled by experts: each document \(d\) from a subset \({D_0\subseteq D}\) is associated with a subset of topics \({T_d \subset T}\), and each topic \(t\) from a subset \({T_0\subset T}\) is associated with a subset of terms \({W_t \subset W}\). It is usually expected that labeling information helps to improve the interpretability of topics.
Correlated topic model (CTM) was first introduced by Blei and Lafferty (2007) to find strong correlations between topics. For example, a document about geology is more likely to also be about archeology than genetics.
Coherence regularization A topic is called coherent if its most frequent words typically appear nearby in the documents—either in the training collection, or in some external corpus like Wikipedia. An average topic coherence is considered to be a good interpretability measure of a topic model (Newman et al. 2010b).
Let \({C_{wv} = \hat{p}(w\ {\vert }\ v)}\) denote an estimate of the cooccurrence of word pairs \({(w,v)\in W^2}\). Usually, \(C_{wv}\) is defined as a portion of the documents that contain both words \(v\) and \(w\) in a sliding window of ten words.
Thus we conclude that there is no commonly accepted approach to the coherence optimization in the literature. All approaches that we have found so far can be easily expressed in terms of ARTM without Dirichlet priors.
Document classification Let \(C\) be a finite set of classes. Suppose each document \(d\) is labeled by a subset of classes \(C_d \subset C\). The task is to infer a relationship between classes and topics, to improve a topic model by using labeling information, and to learn a decision rule, which is able to classify new documents. Common discriminative approaches such as SVM or Logistic Regression usually give unsatisfactory results on large text collections with a big number of unbalanced and interdependent classes. Probabilistic topic model can benefit in this situation because it processes all classes simultaneously (Rubin et al. 2012).
There are many examples of document labeling in the literature. Classes may refer to text categories (Rubin et al. 2012; Zhou et al. 2009), authors (RosenZvi et al. 2004), time periods (Cui et al. 2011; Varadarajan et al. 2010), cited documents (Dietz et al. 2007), cited authors (Kataria et al. 2011), users of documents (Wang and Blei 2011). More information about special models can be found in the survey (Daud et al. 2010). All these models fall into several groups and all of them can be easily expressed in terms of ARTM. Below we consider a close analogue of Dependency LDA (Rubin et al. 2012), one of the most general topic models for document classification.
Theorem 3
We omit the proof, which is analogous to the proof of Theorem 2.
Regularization term \(R(\varPhi ,\varPsi ,\varTheta )\) can include Dirichlet prior for \(\varPsi \), as in Dependency LDA, but sparsing seems to be a more natural choice.
Another useful example of \(R\) is label regularization.
5 Combining regularizers for sparsing and improving interpretability
Interpretability of a topic is a poorly formalized requirement. Essentially what it means is that, provided with the list of the most frequent terms and the most representative documents of a topic, a human can understand its meaning and give it an appropriate name. The interpretability is an important property for information retrieval, systematization and visualization of text collections.
Most of the existing approaches involve human assessment. Newman et al. (2009) ask experts to assess the usefulness of topics by a 3point scale. Chang et al. (2009) prepare lists of 10 most frequent words for each topic, intruding one random word into each list. A topic is considered to be interpretable if experts can correctly identify the intrusion word. Humanbased approach is important at research stage, but it prohibits a fully automatic construction of the topic model.
Coherence is the most popular automatic measure, which is known to correlate well with human estimates of the interpretability (Newman et al. 2010a, b; Mimno et al. 2011). Coherence measures how often the most probable words of the topic occur nearby in the documents from the underlying collection or from external polythematic collection such as Wikipedia.
Domainspecific topic \({t\in S}\) contains terms of a particular domain area. Domainspecific distributions \(p(w\ {\vert }\ t)\) are sparse and weakly correlated. Their corresponding distributions \(p(d\ {\vert }\ t)\) are also sparse, because each domainspecific topic occurs in a relatively small number of documents.
Background topic \({t\in B}\) contains common lexis words. Background distributions \(p(w\ {\vert }\ t)\) and \(p(d\ {\vert }\ t)\) are smooth, because background words occur in many documents. A topic model with background can be considered as a generalization of robust models, which use only one background distribution (Chemudugunta et al. 2007; Potapenko and Vorontsov 2013).
Then we define the regularization trajectory as a multidimensional vector \(\mathbf {\tau }\), which is a function of the number of iteration and, possibly, of the model quality measures. In our experiments we choose the regularization trajectory by analyzing experimentally how the change of regularization coefficients affects quality measures of the model during iterations.
Quality measures Learning a topic model from a text collection can be considered as a constrained multicriteria optimization problem. Therefore, the quality of a topic model should also be measured by a set of criteria. Below we describe a set of quality measures that we use in our experiments.
The sparsity of a model is measured by the ratio of zero elements in matrices \(\varPhi \) and \(\varTheta \) over domainspecific topics \(S\).

kernel size \({{{\mathrm {ker}}}_t = W_t}\), the reasonable values for it are about \(\frac{W}{T}\);

purity \({{\mathrm {pur}}}_t = \sum _{w \in W_t} p(w\ {\vert }\ t)\), the higher the better;

contrast \({{\mathrm {con}}}_t = \frac{1}{W_t} \sum _{w \in W_t} p(t\ {\vert }\ w)\), the higher the better.
Finally, we define the corresponding measures of kernel size, purity, contrast, and coherence for the topic model by averaging over domainspecific topics \({t\in S}\).
Text collection In our experiments we used the NIPS dataset, which contains \({D = 1{,}566}\) English articles from the Neural Information Processing Systems conference. The length of the collection in words is \({n \approx 2.3 \times 10^6}\). The vocabulary size is \({W \approx 1.3 \times 10^4}\). We held out \({D'=174}\) documents for the testing set. In the preparation step we used BOW toolkit (McCallum 1996) to perform changing to lowcase, punctuation elimination, and stopwords removal.
Experimental results In all experiments within this paragraph the number of iterations was set to 40, and the number of topics was set to \(T=100\) with \(B=10\) background topics.
In Table 1 we compare PLSA (first row), LDA (second row) and multiple regularized topic models. First three columns define a combination of regularizers. Other columns correspond to the quality measures described above.
We use a regularized EMalgorithm with smoothing (23) for LDA model with symmetric Dirichlet prior and usually recommended parameters \({\alpha =0.5},\, {\beta =0.01}\).
We use a uniform smoothing for background topics with \({\alpha = 0.8},\, {\beta = 0.1}\).
We use a uniform distribution \({\beta _w = \frac{1}{W}}\) or background distribution \({\beta _w = \frac{n_w}{n}}\) for sparsing domainspecific topics.
Topic models with various combinations of regularizers: smoothing (Sm), sparsing (Sp) with uniform (u) or background (b) distribution, and decorrelation (Dc)
Sm  Sp  Dc  \({{\fancyscript{P}}}\)  \({{\fancyscript{B}}}\)  \({{\fancyscript{S}}}_\varPhi \)  \({{\fancyscript{S}}}_\varTheta \)  con  pur  ker  \({{\fancyscript{C}}}^{\text {ker}}\)  \({{\fancyscript{C}}}^{10}\)  \({{\fancyscript{C}}}^{100}\) 

\(\)  \(\)  \(\)  1,923  0.00  0.000  0.000  0.43  0.14  100  0.84  0.25  0.17 
\(+\)  \(\)  \(\)  1,902  0.00  0.000  0.000  0.42  0.12  82  0.93  0.26  0.17 
\(\)  u  \(\)  2,114  0.24  0.957  0.867  0.53  0.20  71  0.91  0.25  0.18 
\(\)  b  \(\)  2,507  0.51  0.957  0.867  0.46  0.56  151  0.71  0.60  0.58 
\(\)  \(\)  \(+\)  2,025  0.57  0.561  0.000  0.46  0.38  109  0.82  0.94  0.56 
\(+\)  u  \(\)  1,961  0.25  0.957  0.867  0.51  0.20  64  0.97  0.26  0.18 
\(+\)  b  \(\)  2,025  0.49  0.957  0.867  0.45  0.52  128  0.77  0.55  0.55 
\(+\)  \(\)  \(+\)  1,985  0.59  0.582  0.000  0.46  0.39  97  0.87  0.93  0.57 
\(+\)  u  \(+\)  2,010  0.73  0.980  0.867  0.56  0.73  78  0.94  0.94  0.62 
\(+\)  b  \(+\)  2,026  0.80  0.979  0.867  0.52  0.89  111  0.81  0.96  0.83 
In experiments we use convergence charts to compare different models and to choose regularization trajectories \({\mathbf {\tau }= (\alpha _0,\alpha _1,\beta _0,\beta _1,\gamma )}\). A convergence chart represents each quality measure of the topic model as a function of the iteration step. These charts give insight into the effects of each regularizer when it is used alone or in combination with others.
Figures 4, 5, and 6 show convergence charts for PLSA and two ARTM regularized models. Quality measures are shown in three charts for each model. The left chart represents a holdout perplexity \({{\fancyscript{P}}}\) on the lefthand axis, sparsity \({{\fancyscript{S}}}_\varPhi ,{{\fancyscript{S}}}_\varTheta \) of matrices \(\varPhi ,\varTheta \) and background ratio \({{\fancyscript{B}}}\) on the righthand axis. The middle chart represents kernel size (ker) on the lefthand axis, purity (pur) and contrast (con) on the righthand axis. The right chart represents the coherence of top10 words \({{\fancyscript{C}}}^{10}\), top100 words \({{\fancyscript{C}}}^{100}\), and kernel words \({{\fancyscript{C}}}^{\text {ker}}\) on the lefthand axis.
Figure 4 shows that PLSA does not sparse matrices \(\varPhi ,\varTheta \) and gives too low topic purity. Also it does not determine background words.
Figure 5 shows the cumulative effect of sparsing domainspecific topics (with background distribution \(\beta _w\)) and smoothing background topics.
Figure 6 shows that decorrelation augments purity and coherence. Also it helps to move common lexis words from the domainspecific topics to the background topics. As a result, the background ratio reaches almost 80 %.
Again, note the important effect of regularization for the illposed problem: some of quality measures may change significantly even after the likelihood converges, either with no change or with a slight increase of the perplexity.
It is better to switch on sparsing after the iterative process enters into convergence stage making clear which elements of the matrices \(\varPhi , \varTheta \) are close to zero. An earlier or a more abrupt sparsing may lead to an increase of perplexity. We enabled sparsing at the 10th iteration and gradually adjusted the regularization coefficient to turn into zeros 8 % of the nonzero elements in each vector \(\theta _d\) and 10 % in each column \(\phi _t\) per iteration.
Smoothing of the background topics should better start straight from the first iteration, with constant regularization coefficients.
Decorrelation can be activated also from the first iteration, with a maximum regularization coefficient that does not yet significantly increase perplexity. For our collection we chose \({\gamma =2\times 10^5}\).
6 Discussion and conclusions
Learning a topic model from text collection is an illposed problem of stochastic matrix factorization. It generally has infinitely many solutions, which is why solutions computed algorithmically are usually unstable and depend on random initialization. Bayesian regularization in the latent Dirichlet allocation does not cope with this problem, indicating that Dirichlet prior is too weak as a regularizer. More problemoriented regularizers are needed to restrict the set of solutions.
In this paper we propose a semiprobabilistic approach named ARTM—Additive Regularization of Topic Models. It is based on the maximization of the weighted sum of the loglikelihood and additional regularization criteria. Learning a topic model is considered as a multicriteria optimization problem, which then is reduced to a singlecriterion problem via scalarization. To solve the optimization problem we use a general regularized EMalgorithm. Compared to the dominant Bayesian approach, ARTM avoids excessive probabilistic assumptions, simplifies the inference of the topic model and allows to use any combination of regularizers.
ARTM provides the theoretical background for developing a library of unified regularizers. With such a library topic models for various applications could be build simply by choosing a suitable combination of regularizers from a pool.
In this paper we introduced a general framework of ARTM under the following constraints, which we intend to remove in further research work.
We confined ourselves to a bagofwords representation of text collection, and have not considered more sophisticated topic models such as hierarchical, multigram, multilingual, etc. Applying additive regularization to these models will probably require more efforts.
We have worked out only one numerical method—regularized EMalgorithm, suitable for a broad class of regularizers. Alternative optimization techniques as well as their convergence and stability have not yet been considered.
Our review of regularizers is far from being complete. Besides, in our experimental study we have investigated only three of them: sparsing, smoothing, and decorrelation. We argue that this combination improves the interpretability of topics and therefore it is useful for many topic modeling applications. Extensive experiments with combinations of a wider set of regularizers are left beyond the scope of this paper.
Finally, having faced with a problem of regularization trajectory optimization, we confined to a very simple visual technique for monitoring convergence process and comparing topic models empirically.
Notes
Acknowledgments
The work was supported by the Russian Foundation for Basic Research Grants 140700847, 140700908, 140731176, by Skolkovo Institute of Science and Technology (project 081R) and by the program of the Department of Mathematical Sciences of Russian Academy of Sciences “Algebraic and combinatoric methods of mathematical cybernetics and information systems of new generation”. We thank Alexander Frey and Maria Ryskina for their help and valuable discussions, and Vitaly Glushachenkov for his experimental work on synthetic data.
References
 Asuncion, A., Welling, M., Smyth, P., & Teh, Y. W. (2009). On smoothing and inference for topic models. In Proceedings of the international conference on uncertainty in artificial intelligence, pp. 27–34.Google Scholar
 Blei, D., & Lafferty, J. (2007). A correlated topic model of science. Annals of Applied Statistics, 1, 17–35.MathSciNetCrossRefMATHGoogle Scholar
 Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.MathSciNetCrossRefGoogle Scholar
 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.MATHGoogle Scholar
 Chang, J., Gerrish, S., Wang, C., BoydGraber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems (NIPS), pp. 288–296.Google Scholar
 Chemudugunta, C., Smyth, P., & Steyvers, M. (2007). Modeling general and specific aspects of documents with a probabilistic topic model (Vol. 19). Cambridge: MIT Press.Google Scholar
 Chien, J. T., & Chang, Y. L. (2013). Bayesian sparse topic model. Journal of Signal Processessing Systems, 74(3), 375–389.Google Scholar
 Chien, J. T., & Wu, M. S. (2008). Adaptive bayesian latent semantic analysis. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 198–207.CrossRefGoogle Scholar
 Cui, W., Liu, S., Tan, L., Shi, C., Song, Y., Gao, Z., et al. (2011). TextFlow: Towards better understanding of evolving topics in text. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2412–2421.CrossRefGoogle Scholar
 Daud, A., Li, J., Zhou, L., & Muhammad, F. (2010). Knowledge discovery through directed probabilistic topic models: A survey. Frontiers of Computer Science in China, 4(2), 280–301.CrossRefGoogle Scholar
 Dietz, L., Bickel, S., & Scheffer, T. (2007). Unsupervised prediction of citation influences. In Proceedings of the 24th international conference on machine learning, ICML ’07. ACM, New York, NY, USA, pp. 233–240.Google Scholar
 Eisenstein, J., Ahmed, A., & Xing, E. P. (2011). Sparse additive generative models of text. In ICML ’11, pp. 1041–1048.Google Scholar
 Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.Google Scholar
 Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp. 50–57.Google Scholar
 Kataria, S., Mitra, P., Caragea, C., & Giles, C. L. (2011). Context sensitive topic models for author influence in document networks. In Proceedings of the twentysecond international joint conference on artificial intelligence, IJCAI’11, Vol. 3. AAAI Press, pp. 2274–2280.Google Scholar
 Khalifa, O., Corne, D., Chantler, M., & Halley, F. (2013). Multiobjective topic modelling. In 7th International conference evolutionary multicriterion optimization (EMO 2013). Springer LNCS, pp. 51–65.Google Scholar
 Larsson, M. O., & Ugander, J. (2011). A concave regularization technique for sparse mixture models. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems 24, pp. 1890–1898.Google Scholar
 Lu, Y., Mei, Q., & Zhai, C. (2011). Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval, 14(2), 178–203.CrossRefGoogle Scholar
 Mann, G. S., & McCallum, A. (2007). Simple, robust, scalable semisupervised learning via expectation regularization. In Proceedings of the 24th international conference on machine learning, ICML ’07. ACM, New York, NY, USA, pp. 593–600.Google Scholar
 Masada, T., Kiyasu, S., & Miyahara, S. (2008). Comparing LDA with pLSI as a dimensionality reduction method in document clustering. In Proceedings of the 3rd international conference on largescale knowledge resources: construction and application, LKR’08. Springer, Berlin, pp. 13–26.Google Scholar
 McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow
 Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, EMNLP ’11, Stroudsburg, PA, USA, pp. 262–272.Google Scholar
 Newman, D., Karimi, S., & Cavedon, L. (2009). External evaluation of topic models. In Australasian document computing symposium, pp. 11–18.Google Scholar
 Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010a). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ’10. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 100–108.Google Scholar
 Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010b). Evaluating topic models for digital libraries. In Proceedings of the 10th annual joint conference on digital libraries, JCDL ’10. ACM, New York, NY, USA, pp. 215–224.Google Scholar
 Newman, D., Bonilla, E. V., & Buntine, W. L. (2011). Improving topic coherence with regularized topic models. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, & K. Weinberger (Eds.), Advances in neural information processing systems, Vol. 24, pp. 496–504.Google Scholar
 Potapenko, A. A., & Vorontsov, K. V. (2013). Robust PLSA performs better than LDA. In 35th European conference on information retrieval, ECIR2013, Moscow, Russia, 24–27 March 2013, Lecture notes in computer science (LNCS). Springer, Germany, pp. 784–787.Google Scholar
 RosenZvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The authortopic model for authors and documents. In Proceedings of the 20th conference on uncertainty in artificial intelligence, UAI ’04. AUAI Press, Arlington, VA, USA, pp. 487–494.Google Scholar
 Rubin, T. N., Chambers, A., Smyth, P., & Steyvers, M. (2012). Statistical topic models for multilabel document classification. Machine Learning, 88(1–2), 157–208.MathSciNetCrossRefMATHGoogle Scholar
 Shashanka, M., Raj, B., & Smaragdis, P. (2008). Sparse overcomplete latent variable decomposition of counts data. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems, NIPS2007 (pp. 1313–1320). Cambridge, MA: MIT Press.Google Scholar
 Si, L., & Jin, R. (2005). Adjusting mixture weights of gaussian mixture model via regularized probabilistic latent semantic analysis. In T. B. Ho, D. W. L. Cheung, & H. Liu (Eds.), Proceedings of the ninth PacificAsia conference on knowledge discovery and data mining (PAKDD), Lecture notes in computer science, Vol. 3518. Springer, Berlin, pp. 622–631.Google Scholar
 Steyvers, M., & Griffiths, T. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 5228–5235.Google Scholar
 Tan, Y., & Ou, Z. (2010). Topicweakcorrelated latent Dirichlet allocation. In 7th International symposium Chinese spoken language processing (ISCSLP), pp. 224–228.Google Scholar
 Teh, Y. W., Newman, D., & Welling, M. (2006). A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In NIPS, pp. 1353–1360.Google Scholar
 Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of illposed problems. Washington, DC: W. H. Winston.Google Scholar
 Varadarajan, J., Emonet, R., & Odobez, J. M. (2010). A sparsity constraint for topic models—Application to temporal activity mining. In NIPS2010 workshop on practical applications of sparse modeling: Open issues and new directions.Google Scholar
 Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why priors matter. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems 22: 23rd Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada, pp. 1973–1981.Google Scholar
 Wang, C., & Blei, D. M. (2009). Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In NIPS. Curran Associates, Inc., pp. 1982–1989.Google Scholar
 Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp. 448–456.Google Scholar
 Wang, Q., Xu, J., Li, H., & Craswell, N. (2011). Regularized latent semantic indexing. In SIGIR, pp. 685–694.Google Scholar
 Wu, Y., Ding, Y., Wang, X., & Xu, J. (2010). A comparative study of topic models for topic clustering of Chinese web news. In 2010 3rd IEEE international conference on computer science and information technology (ICCSIT), Vol. 5, pp. 236–240.Google Scholar
 Zhou, S., Li, K., & Liu, Y. (2009). Text categorization based on topic model. International Journal of Computational Intelligence Systems, 2(4), 398–409.MathSciNetCrossRefGoogle Scholar