LDA-PSTR: A Topic Modeling Method for Short Text

Zhou, Kai; Yang, Qun

doi:10.1007/978-3-030-05090-0_29

Kai Zhou¹⁶ &
Qun Yang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11323))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1641 Accesses

Abstract

Topic detection in short text has become an important task for applications of content analysis. Topic modeling is an effective way for discovering topics by finding document-level word co-occurrence patterns. Generally, most of conventional topic models are based on bag-of-words representation in which context information of words are ignored. Moreover, when directly applied to short text, it will arise the lack of co-occurrence patterns problem due to the sparseness of unigrams representations. Existing work either performs data expansion by utilizing external knowledge resource, or simply aggregates these semantically related short texts. These methods generally produce low-quality topic representation or suffer from poor semantically correlation between different data resource. In this paper, we propose a different method that is computationally efficient and effective. Our method applies frequent pattern mining to uncover statistically significant patterns which can explicitly capture semantic association and co-occurrences among corpus-level words. We use these frequent patterns as feature units to represent texts, referred as pattern set-based text representation (PSTR). Besides that, in order to represent text more precisely, we propose a new probabilistic topic model called LDA-PSTR. And an improved Gibbs algorithm has been developed for LDA-PSTR. Experiments on different corpus show that such an approach can discover more prominent and coherent topics, and achieve significant performance improvement on several evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychol. Rev. 114(2), 211 (2007)
Article Google Scholar
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Google Scholar
Tang, J., Zhang, M., Mei, Q.: One theme in all views: modeling consensus topics in multiple contexts. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 5–13. ACM (2013)
Google Scholar
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006)
Google Scholar
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining, pp. 697–702. IEEE (2007)
Google Scholar
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147 (2006)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. (2012)
Google Scholar
Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Advances in Neural Information Processing Systems, pp. 121–128 (2008)
Google Scholar
Kim, H.D., Park, D.H., Lu, Y.: Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proc. Am. Soc. Inf. Sci. Technol. 49(1), 1–10 (2012)
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)
Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Jin, O., Liu, N.N., Zhao, K.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)
Google Scholar
Bordino, I., Castillo, C., Donato, D.: Query similarity by projecting the query-flow graph. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM (2010)
Google Scholar
Yan, X., Guo, J., Lan, Y.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Google Scholar
Guo, J., Cheng, X., Xu, G.: Intent-aware query similarity. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 259–268. ACM (2011)
Google Scholar
Weng, J., Lim, E.P., Jiang, J.: TwitterRank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
Google Scholar
Mehrotra, R., Sanner, S., Buntine, W.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)
Google Scholar
Lin, T., Tian, W., Mei, Q.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550. ACM (2014)
Google Scholar
Banerjee, K.: Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (2007)
Google Scholar
Chen, W., et al.: EEG-based motion intention recognition via multi-task RNNs. In: Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 279–287. Society for Industrial and Applied Mathematics (2018)
Google Scholar
Yue, L., Chen, W., Li, X., Zuo, W., Yin, M.: A survey of sentiment analysis in social media. Knowl. Inf. Syst. 1–47 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, Jiangsu, China
Kai Zhou & Qun Yang

Authors

Kai Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qun Yang .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Guojun Gan
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Bohan Li
The University of Queensland, Brisbane, QLD, Australia
Xue Li
Beijing Institute of Technology, Beijing, China
Shuliang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, K., Yang, Q. (2018). LDA-PSTR: A Topic Modeling Method for Short Text. In: Gan, G., Li, B., Li, X., Wang, S. (eds) Advanced Data Mining and Applications. ADMA 2018. Lecture Notes in Computer Science(), vol 11323. Springer, Cham. https://doi.org/10.1007/978-3-030-05090-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-05090-0_29
Published: 29 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05089-4
Online ISBN: 978-3-030-05090-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics