Abstract
Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in which context information of words are ignored. Moreover, when directly applied to short text, it will arise the lack of co-occurrence patterns problem due to the sparseness of unigrams representations. Existing work either performs data expansion by utilizing external knowledge resource, or simply aggregates these semantically related short texts. These methods generally produce low-quality topic representation or suffer from poor semantically correlation between different data resource. In this paper, we propose a different method that is computationally efficient and effective. Our method applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words. We use these frequent patterns as feature units to represent texts, referred as pattern set-based text representation (PSTR). Besides that, in order to represent text more precisely, we propose a new probabilistic topic model called LDA-PSTR. And an improved Gibbs algorithm has been developed for LDA-PSTR. Experiments on different corpus show that such an approach can discover more prominent and coherent topics, and achieve significant performance improvement on several evaluation metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychol. Rev. 114(2), 211 (2007)
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Tang, J., Zhang, M., Mei, Q.: One theme in all views: modeling consensus topics in multiple contexts. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 5–13. ACM (2013)
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006)
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining, pp. 697–702. IEEE (2007)
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147 (2006)
Teh, Y.W., Jordan, M.I., Beal, M.J.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. (2012)
Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Advances in Neural Information Processing Systems, pp. 121–128 (2008)
Kim, H.D., Park, D.H., Lu, Y.: Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proc. Am. Soc. Inf. Sci. Technol. 49(1), 1–10 (2012)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Jin, O., Liu, N.N., Zhao, K.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)
Bordino, I., Castillo, C., Donato, D.: Query similarity by projecting the query-flow graph. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM (2010)
Yan, X., Guo, J., Lan, Y.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Guo, J., Cheng, X., Xu, G.: Intent-aware query similarity. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 259–268. ACM (2011)
Weng, J., Lim, E.P., Jiang, J.: TwitterRank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
Mehrotra, R., Sanner, S., Buntine, W.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Lin, T., Tian, W., Mei, Q.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550. ACM (2014)
Banerjee, K.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (2007)
Chen, W., et al.: EEG-based motion intention recognition via multi-task RNNs. In: Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 279–287. Society for Industrial and Applied Mathematics (2018)
Yue, L., Chen, W., Li, X., Zuo, W., Yin, M.: A survey of sentiment analysis in social media. Knowl. Inf. Syst. 1–47 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, K., Yang, Q. (2018). LDA-PSTR: A Topic Modeling Method for Short Text. In: Gan, G., Li, B., Li, X., Wang, S. (eds) Advanced Data Mining and Applications. ADMA 2018. Lecture Notes in Computer Science(), vol 11323. Springer, Cham. https://doi.org/10.1007/978-3-030-05090-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-05090-0_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05089-4
Online ISBN: 978-3-030-05090-0
eBook Packages: Computer ScienceComputer Science (R0)