Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

  • Konstantin VorontsovEmail author
  • Anna Potapenko
  • Alexander Plavin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9047)


Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. Determining the optimal number of topics remains a challenging problem in topic modeling. We propose a simple entropy regularization for topic selection in terms of Additive Regularization of Topic Models (ARTM), a multicriteria approach for combining regularizers. The entropy regularization gradually eliminates insignificant and linearly dependent topics. This process converges to the correct value on semi-real data. On real text collections it can be combined with sparsing, smoothing and decorrelation regularizers to produce a sequence of models with different numbers of well interpretable topics.


Probabilistic topic modeling Regularization Probabilistic latent sematic analysis Topic selection EM-algorithm 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blei, D.M.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Daud, A., Li, J., Zhou, L., Muhammad, F.: Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of Computer Science in China 4(2), 280–301 (2010)CrossRefGoogle Scholar
  3. 3.
    Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM 57(2), 7:1–7:30 (2010)Google Scholar
  5. 5.
    Vorontsov, K.V.: Additive regularization for topic models of text collections. Doklady Mathematics 89(3), 301–304 (2014)CrossRefzbMATHMathSciNetGoogle Scholar
  6. 6.
    Vorontsov, K.V., Potapenko, A.A.: Additive regularization of topic models. Machine Learning, Special Issue on Data Analysis and Intelligent Optimization (2014). doi:10.1007/s10994-014-5476-6Google Scholar
  7. 7.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development In Information Retrieval, pp. 50–57. ACM, New York (1999)Google Scholar
  8. 8.
    Vorontsov, K.V., Potapenko, A.A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., et al (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Heidelberg (2014)Google Scholar
  9. 9.
    Tan, Y., Ou, Z.: Topic-weak-correlated latent dirichlet allocation. In: 7th International Symposium Chinese Spoken Language Processing (ISCSLP), pp. 224–228 (2010)Google Scholar
  10. 10.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  11. 11.
    McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996).
  12. 12.
    Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital libraries, JCDL 2010, pp. 215–224. ACM, New York (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Konstantin Vorontsov
    • 1
    Email author
  • Anna Potapenko
    • 2
  • Alexander Plavin
    • 3
  1. 1.Moscow Institute of Physics and Technology, Dorodnicyn Computing Centre of RASNational Research University Higher School of EconomicsMoscowRussia
  2. 2.National Research University Higher School of EconomicsMoscowRussia
  3. 3.Moscow Institute of Physics and TechnologyMoscowRussia

Personalised recommendations