Skip to main content

Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling

Part of the Communications in Computer and Information Science book series (CCIS,volume 661)

Abstract

In this paper we present a new asynchronous algorithm for learning additively regularized topic models and discuss the main architectural details of our implementation. The key property of the new algorithm is that it behaves in a fully deterministic fashion, which is typically hard to achieve in a non-blocking parallel implementation. The algorithm had been recently implemented in the BigARTM library (http://bigartm.org). Our new algorithm is compatible with all features previously introduced in BigARTM library, including multimodality, regularizers and scores calculation. While the existing BigARTM implementation compares favorably with alternative packages such as Vowpal Wabbit or Gensim, the new algorithm brings further improvements in CPU utilization, memory usage, and spends even less time to achieve the same perplexity.

Keywords

  • Probabilistic topic modeling
  • Additive regularization of topic models
  • Stochastic matrix factorization
  • EM-algorithm
  • Online learning
  • Asynchronous and parallel computing
  • BigARTM

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-52920-2_13
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   69.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-52920-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   89.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

Notes

  1. 1.

    https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.

References

  1. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    CrossRef  Google Scholar 

  2. Daud, A., Li, J., Zhou, L., Muhammad, F.: Knowledge discovery through directed probabilistic topic models: a survey. Front. Comput. Sci. Chin 4(2), 280–301 (2010)

    CrossRef  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma, W.Y.: LightLDA: big topic models on modest computer clusters. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1351–1361 (2015)

    Google Scholar 

  5. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)

    Google Scholar 

  6. Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864. Curran Associates Inc. (2010)

    Google Scholar 

  7. Newman, D., Asuncion, A., Smyth, P., Welling, M.: Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801–1828 (2009)

    MathSciNet  MATH  Google Scholar 

  8. Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: parallel latent Dirichlet allocation for large-scale applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02158-9_26

    CrossRef  Google Scholar 

  9. Liu, Z., Zhang, Y., Chang, E.Y., Sun, M.: PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2(3), 26:1–26:18 (2011)

    CrossRef  Google Scholar 

  10. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: On parallelizability of stochastic gradient descent for speech DNNs. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 235–239. IEEE (2014)

    Google Scholar 

  11. Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M.-A., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: NIPS, pp. 1223–1231 (2012)

    Google Scholar 

  12. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, Valletta, Malta, May 2010

    Google Scholar 

  13. Langford, J., Li, L., Strehl, A.: Vowpal wabbit open source project. Technical report. Yahoo! (2007)

    Google Scholar 

  14. McCallum, A.K.: A Machine Learning for Language Toolkit (2002). http://mallet.cs.umass.edu

  15. Smola, A., Narayanamurthy, S.: An architecture for parallel topic models. Proc. VLDB Endow. 3(1–2), 703–710 (2010)

    CrossRef  Google Scholar 

  16. Vorontsov, K.V.: Additive regularization for topic models of text collections. Dokl. Math. 89(3), 301–304 (2014)

    MathSciNet  CrossRef  MATH  Google Scholar 

  17. Vorontsov, K.V., Potapenko, A.A.: Additive regularization of topic models. Mach. Learn. Spec. Issue Data Anal. Intell. Optim. 101(1), 303–323 (2015)

    MathSciNet  MATH  Google Scholar 

  18. Vorontsov, K., Frei, O., Apishev, M., Romov, P., Dudarenko, M.: BigARTM: open source library for regularized multimodal topic modeling of large collections. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 370–381. Springer, Heidelberg (2015). doi:10.1007/978-3-319-26123-2_36

    CrossRef  Google Scholar 

  19. Vorontsov, K., Frei, O., Apishev, M., Romov, P., Suvorova, M., Yanina, A.: Non-Bayesian additive regularization for multimodal topic modeling of large collections. In: Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, pp. 29–37. ACM, New York (2015)

    Google Scholar 

Download references

Acknowledgements

The work was supported by Russian Science Foundation (grant 15-18-00091). Also we would like to thank Prof. K. V. Vorontsov for constant attention to our research and detailed feedback to this paper.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Oleksandr Frei or Murat Apishev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Frei, O., Apishev, M. (2017). Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling. In: , et al. Analysis of Images, Social Networks and Texts. AIST 2016. Communications in Computer and Information Science, vol 661. Springer, Cham. https://doi.org/10.1007/978-3-319-52920-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52920-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52919-6

  • Online ISBN: 978-3-319-52920-2

  • eBook Packages: Computer ScienceComputer Science (R0)