Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Wahabzada, Mirwaes; Kersting, Kristian

doi:10.1007/978-3-642-23808-6_31

Mirwaes Wahabzada²³ &
Kristian Kersting²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6913))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5665 Accesses
4 Citations

Abstract

Recently, there have been considerable advances in fast inference for latent Dirichlet allocation (LDA). In particular, stochastic optimization of the variational Bayes (VB) objective function with a natural gradient step was proved to converge and able to process massive document collections. To reduce noise in the gradient estimation, it considers multiple documents chosen uniformly at random. While it is widely recognized that the scheduling of documents in stochastic optimization may have significant consequences, this issue remains largely unexplored. In this work, we address this issue. Specifically, we propose residual LDA, a novel, easy-to-implement, LDA approach that schedules documents in an informed way. Intuitively, in each iteration, residual LDA actively selects documents that exert a disproportionately large influence on the current residual to compute the next update. On several real-world datasets, including 3M articles from Wikipedia, we demonstrate that residual LDA can handily analyze massive document collections and find topic models as good or better than those found with batch VB and randomly scheduled VB, and significantly faster.

Download to read the full chapter text

Chapter PDF

A Fast Algorithm for Posterior Inference with Latent Dirichlet Allocation

A Left-to-Right Algorithm for Likelihood Estimation in Gamma-Poisson Factor Analysis

Sparse Stochastic Inference with Regularization

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Canini, K., Shi, L., Griffiths, T.: Online inference of topics with latent dirichlet allocation. In: Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, AISTATS 2009 (2009)
Google Scholar
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.: Reading tea leaves: How humans interpret topic models. In: Proceeding of NIPS (2009)
Google Scholar
Ding, C., Li, T., Peng, W.: NMF and PLSI: Equivalence and a Hybrid Algorithm. In: Proc. SIGIR (2006)
Google Scholar
Drineas, P., Kannan, R., Mahoney, M.: Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal of Computing 36, 184–206 (2006)
Article MATH MathSciNet Google Scholar
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
Frieze, A., Kannan, R., Vempala, S.: Fast monte-carlo algorithms for finding lowrank approximations. Journal of the ACM 51(6), 1025–1041 (2004)
Article MATH MathSciNet Google Scholar
Gaussier, E., Goutte, C.: Relations between PLSA and NMF and Implications. In: Proc. SIGIR (2005)
Google Scholar
Gehler, P., Holub, A., Welling, M.: The rate adapting poisson model for information retrieval and object recognition. In: Proceedings of ICML, pp. 337–344 (2006)
Google Scholar
Girolami, M., Kaban, A.: On an Equivalence between PLSI and LDA. In: Proc. SIGIR (2003)
Google Scholar
Hoffman, M., Blei, D., Bach, F.: Online learning for latent dirichlet allocation. In: Proceedings of Neural Information Processing Systems (NIPS 2010) (2010)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. Research and Development in Information Retrieval, pp. 50–57 (1999)
Google Scholar
Hofmann, T., Buhmann, J.: Active data clustering. In: Proceedings of NIPS (1997)
Google Scholar
Mahoney, M., Drineas, P.: Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 106(3), 697–703 (2009)
Article MATH MathSciNet Google Scholar
Newman, D., Asuncion, A., Smyth, P., Welling, M.: Distributed algorithms for topic models. Journal of Machine Learning Research 10, 1801–1828 (2009)
MATH MathSciNet Google Scholar
Sato, I., Kurihara, K., Nakagawa, H.: Deterministic single-pass algorithm for lda. In: Proceedings of Neural Information Processing Systems, NIPS 2010 (2010)
Google Scholar
Settles, B.: Active learning literature survey. Tech. Rep. 1648, University of Wisconsin-Madison (2010)
Google Scholar
Smola, A., Narayanamurthy, S.: An architecture for parallel topic models. PVLDB 3(1), 703–710 (2010)
Google Scholar
Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining 1(1), 6–22 (2008)
Article MathSciNet Google Scholar
Wallach, H., Mimno, D., McCallum, A.: Rethinking lda: Why priors matter. In: Advances in Neural Information Processing Systems, vol. 22, pp. 1973–1981 (2009)
Google Scholar
Wang, X., Davidson, I.: Active spectral clustering. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2010) (2010)
Google Scholar
Yan, F., Xu, N., Qi, Y.: Parallel inference for latent dirichlet allocation on graphics processing units. In: Proceedings of NIPS (2009)
Google Scholar
Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 937–946 (2009)
Google Scholar
Yi, S., Wierstra, D., Schaul, T., Schmidhuber, J.: Stochastic search using the natural gradient. In: Proceedings of ICML, p. 146 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Discovery Department, Fraunhofer IAIS, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Mirwaes Wahabzada & Kristian Kersting

Authors

Mirwaes Wahabzada
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Kersting
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Telecommunications, University of Athens, Panepistimioupolis, Ilisia, 15784, Athens, Greece
Dimitrios Gunopulos
Google Switzerland GmbH, Brandschenkestrasse 110, 8002, Zurich, Switzerland
Thomas Hofmann
Department of Computer Science, University of Bari “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Donato Malerba
Deptartment of Informatics, Athens University of Economics and Business, Patision 76, 10434, Athens, Greece
Michalis Vazirgiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wahabzada, M., Kersting, K. (2011). Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6913. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-23808-6_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23807-9
Online ISBN: 978-3-642-23808-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Abstract

Chapter PDF

Similar content being viewed by others

A Fast Algorithm for Posterior Inference with Latent Dirichlet Allocation

A Left-to-Right Algorithm for Likelihood Estimation in Gamma-Poisson Factor Analysis

Sparse Stochastic Inference with Regularization

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Abstract

Chapter PDF

Similar content being viewed by others

A Fast Algorithm for Posterior Inference with Latent Dirichlet Allocation

A Left-to-Right Algorithm for Likelihood Estimation in Gamma-Poisson Factor Analysis

Sparse Stochastic Inference with Regularization

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation