Abstract
Over the past few years, microblogging websites, such as Twitter, are growing increasingly popular. Different with traditional medias, tweets are structured data and with a lot of noisy words. Topic modeling algorithms for traditional medias have been studied well, but our understanding of Twitter still remains limited and few algorithms are specially designed to mine Twitter data according to its own characteristics. Previous studies usually employ only one type of topic to analyze hot topics of the Twitter community and are greatly affected by the large amount of noisy words in tweets. We have observed that, in the Twitter community, users tend to discuss two types of topics actually. One mainly focuses on their personal lives and the other on hot issues of the society. These two types of topics usually yield different distributions. In this paper, we introduce the Categorical Topic Model. This model incorporates the features of Twitter data to divide topics into two types in semantic and introduce a word distribution for background words to filter out noisy words. Our model is able to discover different types of topics efficiently, indicate which topics are interested by an user and find hot issues of the Twitter community. Employing the Gibbs sampling, we compare our model with Latent Dirichlet Allocation and Author Topic Model on the TREC2011 data set and examples of discovered public topics and personal topics are also discussed in our paper.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Xu, R., Donald, Wunsch, o.: Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
Chowdhury, G.: Introduction to modern information retrieval. Facet publishing (2010)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Naaman, M., Boase, J., Lai, C.H.: Is it really about me?: message content in social awareness streams. In: Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, pp. 189–192. ACM (2010)
Ritter, A., Cherry, C., Dolan, B.: Unsupervised modeling of twitter conversations (2010)
Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about twitter. In: Proceedings of the First Workshop on Online Social Networks, pp. 19–24. ACM (2008)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256. Association for Computational Linguistics (2009)
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: International AAAI Conference on Weblogs and Social Media, vol. 5, pp. 130–137 (2010)
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011)
Heinrich, G.: Parameter estimation for text analysis (2005), http://www.arbylon.net/publications/text-est.pdf
Griffiths, T., Steyvers, M.: Probabilistic topic models. In: Latent Semantic Analysis: A Road to Meaning (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zheng, L., Han, K. (2013). Extracting Categorical Topics from Tweets Using Topic Model. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-45068-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)