Constrained-hLDA for Topic Discovery in Chinese Microblogs

Wang, Wei; Xu, Hua; Yang, Weiwei; Huang, Xiaoqiu

doi:10.1007/978-3-319-06605-9_50

Constrained-hLDA for Topic Discovery in Chinese Microblogs

Wei Wang²³,
Hua Xu²³,
Weiwei Yang²³ &
…
Xiaoqiu Huang²³

Conference paper

4173 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8444))

Abstract

Since microblog service became information provider on web scale, research on microblog has begun to focus more on its content mining. Most research on microblog context is often based on topic models, such as: Latent Dirichlet Allocation(LDA) and its variations. However,there are some challenges in previous research. On one hand, the number of topics is fixed as a priori, but in real world, it is input by the users. On the other hand, it ignores the hierarchical information of topics and cannot grow structurally as more data are observed. In this paper, we propose a semi-supervised hierarchical topic model, which aims to explore more reasonable topics in the data space by incorporating some constraints into the modeling process that are extracted automatically. The new method is denoted as constrained hierarchical Latent Dirichlet Allocation (constrained-hLDA). We conduct experiments on Sina microblog, and evaluate the performance in terms of clustering and empirical likelihood. The experimental results show that constrained-hLDA has a significant improvement on the interpretability, and its predictive ability is also better than that of hLDA.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM (2007)
Google Scholar
Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about twitter. In: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM (2008)
Google Scholar
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: International AAAI Conference on Weblogs and Social Media, vol. 5, pp. 130–137 (2010)
Google Scholar
Zhang, C., Sun, J.: Large scale microblog mining using distributed mb-lda. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 1035–1042. ACM (2012)
Google Scholar
Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested chinese restaurant process. In: NIPS (2003)
Google Scholar
Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM 57(2), 7 (2010)
Article MathSciNet Google Scholar
Petinot, Y., McKeown, K., Thadani, K.: A hierarchical model of web summaries. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 670–675 (2011)
Google Scholar
Mao, X.L., Ming, Z.Y., Chua, T.S., Li, S., Yan, H., Li, X.: Sshlda: a semi-supervised hierarchical topic model. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 800–809. Association for Computational Linguistics (2012)
Google Scholar
Mao, X.L., He, J., Yan, H., Li, X.: Hierarchical topic integration through semi-supervised hierarchical topic modeling. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1612–1616. ACM (2012)
Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Chemudugunta, C., Steyvers, P.S.M.: Modeling general and specific aspects of documents with a probabilistic topic model. In: Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, vol. 19, p. 241. The MIT Press (2007)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Boyd-Graber, J., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1024–1033 (2007)
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004)
Google Scholar
Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.: Topic sentiment mixture: modeling facets and opinions in weblogs. In: Proceedings of the 16th International Conference on World Wide Web, pp. 171–180. ACM (2007)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)
Google Scholar
Li, W., McCallum, A.: Pachinko allocation: Dag-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584. ACM (2006)
Google Scholar
Mimno, D., Li, W., McCallum, A.: Mixtures of hierarchical topics with pachinko allocation. In: Proceedings of the 24th International Conference on Machine Learning, pp. 633–640. ACM (2007)
Google Scholar
Andrzejewski, D., Zhu, X.: Latent dirichlet allocation with topic-in-set knowledge. In: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, pp. 43–48. Association for Computational Linguistics (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Wei Wang, Hua Xu, Weiwei Yang & Xiaoqiu Huang

Authors

Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Cheng Kung University, Tainan, Taiwan, R.O.C.
Vincent S. Tseng & Hung-Yu Kao &
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Tu Bao Ho
Nanjing University, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan, R.O.C.
Arbee L. P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, W., Xu, H., Yang, W., Huang, X. (2014). Constrained-hLDA for Topic Discovery in Chinese Microblogs. In: Tseng, V.S., Ho, T.B., Zhou, ZH., Chen, A.L.P., Kao, HY. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8444. Springer, Cham. https://doi.org/10.1007/978-3-319-06605-9_50

Download citation

DOI: https://doi.org/10.1007/978-3-319-06605-9_50
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06604-2
Online ISBN: 978-3-319-06605-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics