The Latent Dirichlet Allocation (LDA) topic model is a popular research topic in the field of text mining. In this paper, Sentiment Word Co-occurrence and Knowledge Pair Feature Extraction based LDA Short Text Clustering Algorithm (SKP-LDA) is proposed. A definition of a word bag based on sentiment word co-occurrence is proposed. The co-occurrence of emotional words takes full account of different short texts. Then, the short texts of a microblog are endowed with emotional polarity. Furthermore, the knowledge pairs of topic special words and topic relation words are extracted and inserted into the LDA model for clustering. Thus, semantic information can be found more accurately. Then, the hidden n topics and Top30 special words set of each topic are extracted from the knowledge pair set. Finally, via LDA topic model primary clustering, a Top30 topic special words set is obtained that is clustered by K-means secondary clustering. The clustering center is optimized iteratively. Comparing with JST, LSM, LTM and ELDA, SKP-LDA performs better in terms of Accuracy, Precision, Recall and F-measure. The experimental results show that SKP-LDA reveals better semantic analysis ability and emotional topic clustering effect. It can be applied to the field of micro-blog to improve the accuracy of network public opinion analysis effectively.
This is a preview of subscription content,to check access.
Access this article
Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent Dirichlet allocation[J]. Journal of Machine Learning Research Archive, 3, 993–1022.
Chang, P., & Ma, H. (2011). Efficient short texts keyword extraction method analysis[j]. Computer Engineering & Applications, 47(20), 126–128,154.
Chen, Z., & Liu, B. Topic modeling using topics from many domains, lifelong learning and big data[C].
Hao, J., Xie, J., Su, J.Q., & et al. (2016). An unsupervised approach for sentiment classification based on weighted latent Dirichlet allocation [J]. CAAI Transactions on Intelligent Systems, 11(4), 539–545.
He, Y. (2011). Latent sentiment model for weakly-supervised crosslingual sentiment classification[J]. Advances in Information Retrieval, 6611, 214–225.
Huang, F.L., Yu, G., Zhang, J.L., & et al. (2017). Mining topic sentiment in micro-blogging based on micro-blogger social relation [J]. Journal of Software, 28(3), 694–707.
Kozlowski, M., & Rybinski, H. (2019). Clustering of semantically enriched short texts[J]. Journal of Intelligent Information Systems, 53(1), 69–92.
Lin, C., & He, Y. (2009). Joint sentiment topic model for sentiment analysis[C]. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 375–384). New York: ACM Press.
Liu, B.Y., Wang, C.R., Wang, C., & et al. (2017). Micro-blog community discovery algorithm based on dynamic topic model with multidimensional data fusion[J]. Journal of Software, 28(2), 246–261.
Liu, Z., Liu, C.Y., Xia, B., & Li, T. (2018). Multiple relational topic modeling for noisy short texts[J]. International Journal of Software Engineering and Knowledge Engineering, 28(11), 1559–1574.
Lu, L., Fuxi, Z., Rong, G., & et al. (2018). Point of interest joint recommendation method based on user-content topic model[J]. Computer Engineering & Applications, 4, 154–159.
Peng, M., Huang, J.J., Zhu, J.H., & et al. (2015). Mass of short texts clustering and topic extraction based on frequent itemsets[J]. Journal of Computer Research & Development, 52(9), 1941–1953.
Qi, J., Xun, L., Zhou, X., & et al. (2018). Micro-blog user community discovery using generalized simrank edge weighting method[J]. PLoS ONE, 13(5).
Qu, J., Chen, Z., & Zheng, Y. (2018). Research on the text clustering method of science and technology reports based on the topic model[J]. Library & Information Service.
Shams, M., & Baraani-Dastjerdi, A. (2017). Enriched LDA (ELDA): combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction[J]. Expert Systems with Applications, 80, 136–146.
Sun, Y., & Zhou, X.G. (2013). Unsupervised topic and sentiment unification model for sentiment analysis[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 49(1), 102–108.
Tago, K., & Jin, Q. (2018). Influence analysis of emotional behaviors and user relationships based on twitter data[J]. Tsinghua Science & Technology, 23(1), 104–113.
Wan, H.X., & Peng, Y. (2018). Topic words extraction of social media based on semantic constrained and time associated LDA[J]. Journal of Chinese Computer Systems, 39(4), 742–747.
Wang, X.W., & Zhang, K. (2012). Improved expansion algorithm based on co-occurrence relationship between short text feature[J]. Journal of Henan University of Urban Construction, 21(4), 48–50.
Xiong, S., Wang, K., Ji, D., & et al. (2018). A Short text sentiment-topic model for product reviews[J]. Neurocomputing, 297, 94–102.
Yong, M.C., Qing, C., School, B, & et al. (2018). Chinese short text topic analysis by latent Dirichlet allocation model with co-word network analysis[J]. Journal of the China Society for Scientific and Technical Information, 37(3), 305–317.
This work is supported by Research Projects of Science and Technology in Hebei Higher Education Institutions (No.ZD2018087,ZD2016017,QN2018109,YQ2014014), the Nature Science Foundation of Hebei Province (No.F2019402-428), National Key R&D Program of China (No.2018YFF0301004), National Natural Science Foundation of China (No.61802107).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Wu, D., Yang, R. & Shen, C. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm. J Intell Inf Syst 56, 1–23 (2021). https://doi.org/10.1007/s10844-020-00597-7