Information Retrieval

, Volume 14, Issue 2, pp 178–203 | Cite as

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

Article

Abstract

Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.

Keywords

Evaluation Topic models LDA PLSA Experimentation Performance 

References

  1. Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (p. 2003). MIT Press.Google Scholar
  2. Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In NIPS. MIT PressGoogle Scholar
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.CrossRefMATHGoogle Scholar
  4. Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), CIKM (pp. 911–920). ACM.Google Scholar
  5. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems. MIT Press.Google Scholar
  6. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39, 1–38.MathSciNetMATHGoogle Scholar
  7. Gaussier, E., & Goutte, C. (2005). Relation between plsa and nmf and implications. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 601–602). New York, NY, USA: ACM.Google Scholar
  8. Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741.Google Scholar
  9. Girolami, M., & Kabán, A. (2003). On an equivalence between plsi and lda. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 433–434). New York, NY, USA: ACM.Google Scholar
  10. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.CrossRefGoogle Scholar
  11. Hofmann, T. (1999a). Probabilistic latent semantic analysis. In K. B. Laskey & H. Prade (Eds.), UAI (pp. 289–296). Morgan Kaufmann.Google Scholar
  12. Hofmann, T. (1999b). Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). New York, NY, USA. ACM.Google Scholar
  13. Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). Disclda: Discriminative learning for dimensionality reduction and classification. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS (pp. 897–904). MIT Press.Google Scholar
  14. Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML ’06 (pp. 577–584). ACM.Google Scholar
  15. Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. In WWW ’08: Proceeding of the 17th international conference on World Wide Web (pp. 101–110). New York, NY, USA: ACM.Google Scholar
  16. Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. (2007). Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web (pp. 171–180). New York, NY, USA: ACM.Google Scholar
  17. Mei, Q., & Zhai, C. (2006). A mixture model for contextual text mining. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 649–655). New York, NY, USA: ACM.Google Scholar
  18. Minka, T. P., & Lafferty, J. D. (2002). Expectation-propogation for the generative aspect model. In A. Darwiche, & N. Friedman (Eds.), UAI (pp. 352–359). Morgan Kaufmann.Google Scholar
  19. Nallapati, R. M., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 542–550). New York, NY, USA: ACM.Google Scholar
  20. Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Lawrence Erlbaum Associates.Google Scholar
  21. Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 306–315). New York, NY, USA: ACM.Google Scholar
  22. Teh, Y. W., & Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in neural information processing systems. MIT Press.Google Scholar
  23. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In ICML ’09: Proceedings of the 26th annual international conference on machine learning (pp. 1105–1112). New York, NY, USA: ACM.Google Scholar
  24. Wang, X., & McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical trends. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 424–433). New York, NY, USA: ACM.Google Scholar
  25. Wei, X., & Bruce Croft, W. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 178–185). New York, NY, USA: ACM.Google Scholar
  26. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 267–273). New York, NY, USA: ACM.Google Scholar
  27. Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In ECIR ’09: Proceedings of the 31th European conference on IR research on advances in information retrieval (pp. 29–41). Berlin, Heidelberg: Springer.Google Scholar
  28. Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the tenth international conference on information and knowledge management (pp. 403–410). New York, NY, USA: ACM.Google Scholar
  29. Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 334–342). New York, NY, USA: ACM.Google Scholar
  30. Zhai, C., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 743–748). New York, NY, USA: ACM.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA
  2. 2.School of InformationUniversity of MichiganAnn ArborUSA

Personalised recommendations