Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm


The Latent Dirichlet Allocation (LDA) topic model is a popular research topic in the field of text mining. In this paper, Sentiment Word Co-occurrence and Knowledge Pair Feature Extraction based LDA Short Text Clustering Algorithm (SKP-LDA) is proposed. A definition of a word bag based on sentiment word co-occurrence is proposed. The co-occurrence of emotional words takes full account of different short texts. Then, the short texts of a microblog are endowed with emotional polarity. Furthermore, the knowledge pairs of topic special words and topic relation words are extracted and inserted into the LDA model for clustering. Thus, semantic information can be found more accurately. Then, the hidden n topics and Top30 special words set of each topic are extracted from the knowledge pair set. Finally, via LDA topic model primary clustering, a Top30 topic special words set is obtained that is clustered by K-means secondary clustering. The clustering center is optimized iteratively. Comparing with JST, LSM, LTM and ELDA, SKP-LDA performs better in terms of Accuracy, Precision, Recall and F-measure. The experimental results show that SKP-LDA reveals better semantic analysis ability and emotional topic clustering effect. It can be applied to the field of micro-blog to improve the accuracy of network public opinion analysis effectively.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent Dirichlet allocation[J]. Journal of Machine Learning Research Archive, 3, 993–1022.

    MATH  Google Scholar 

  2. Chang, P., & Ma, H. (2011). Efficient short texts keyword extraction method analysis[j]. Computer Engineering & Applications, 47(20), 126–128,154.

    Google Scholar 

  3. Chen, Z., & Liu, B. Topic modeling using topics from many domains, lifelong learning and big data[C].

  4. Hao, J., Xie, J., Su, J.Q., & et al. (2016). An unsupervised approach for sentiment classification based on weighted latent Dirichlet allocation [J]. CAAI Transactions on Intelligent Systems, 11(4), 539–545.

    Google Scholar 

  5. He, Y. (2011). Latent sentiment model for weakly-supervised crosslingual sentiment classification[J]. Advances in Information Retrieval, 6611, 214–225.

    Article  Google Scholar 

  6. Huang, F.L., Yu, G., Zhang, J.L., & et al. (2017). Mining topic sentiment in micro-blogging based on micro-blogger social relation [J]. Journal of Software, 28(3), 694–707.

    Google Scholar 

  7. Kozlowski, M., & Rybinski, H. (2019). Clustering of semantically enriched short texts[J]. Journal of Intelligent Information Systems, 53(1), 69–92.

    Article  Google Scholar 

  8. Lin, C., & He, Y. (2009). Joint sentiment topic model for sentiment analysis[C]. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 375–384). New York: ACM Press.

  9. Liu, B.Y., Wang, C.R., Wang, C., & et al. (2017). Micro-blog community discovery algorithm based on dynamic topic model with multidimensional data fusion[J]. Journal of Software, 28(2), 246–261.

    Google Scholar 

  10. Liu, Z., Liu, C.Y., Xia, B., & Li, T. (2018). Multiple relational topic modeling for noisy short texts[J]. International Journal of Software Engineering and Knowledge Engineering, 28(11), 1559–1574.

    Article  Google Scholar 

  11. Lu, L., Fuxi, Z., Rong, G., & et al. (2018). Point of interest joint recommendation method based on user-content topic model[J]. Computer Engineering & Applications, 4, 154–159.

    Google Scholar 

  12. Peng, M., Huang, J.J., Zhu, J.H., & et al. (2015). Mass of short texts clustering and topic extraction based on frequent itemsets[J]. Journal of Computer Research & Development, 52(9), 1941–1953.

    Google Scholar 

  13. Qi, J., Xun, L., Zhou, X., & et al. (2018). Micro-blog user community discovery using generalized simrank edge weighting method[J]. PLoS ONE, 13(5).

  14. Qu, J., Chen, Z., & Zheng, Y. (2018). Research on the text clustering method of science and technology reports based on the topic model[J]. Library & Information Service.

  15. Shams, M., & Baraani-Dastjerdi, A. (2017). Enriched LDA (ELDA): combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction[J]. Expert Systems with Applications, 80, 136–146.

    Article  Google Scholar 

  16. Sun, Y., & Zhou, X.G. (2013). Unsupervised topic and sentiment unification model for sentiment analysis[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 49(1), 102–108.

    Google Scholar 

  17. Tago, K., & Jin, Q. (2018). Influence analysis of emotional behaviors and user relationships based on twitter data[J]. Tsinghua Science & Technology, 23(1), 104–113.

    Article  Google Scholar 

  18. Wan, H.X., & Peng, Y. (2018). Topic words extraction of social media based on semantic constrained and time associated LDA[J]. Journal of Chinese Computer Systems, 39(4), 742–747.

    Google Scholar 

  19. Wang, X.W., & Zhang, K. (2012). Improved expansion algorithm based on co-occurrence relationship between short text feature[J]. Journal of Henan University of Urban Construction, 21(4), 48–50.

    Google Scholar 

  20. Xiong, S., Wang, K., Ji, D., & et al. (2018). A Short text sentiment-topic model for product reviews[J]. Neurocomputing, 297, 94–102.

    Article  Google Scholar 

  21. Yong, M.C., Qing, C., School, B, & et al. (2018). Chinese short text topic analysis by latent Dirichlet allocation model with co-word network analysis[J]. Journal of the China Society for Scientific and Technical Information, 37(3), 305–317.

    Google Scholar 

Download references


This work is supported by Research Projects of Science and Technology in Hebei Higher Education Institutions (No.ZD2018087,ZD2016017,QN2018109,YQ2014014), the Nature Science Foundation of Hebei Province (No.F2019402-428), National Key R&D Program of China (No.2018YFF0301004), National Natural Science Foundation of China (No.61802107).

Author information



Corresponding author

Correspondence to Chao Shen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, D., Yang, R. & Shen, C. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm. J Intell Inf Syst (2020).

Download citation


  • LDA
  • Sentiment analysis
  • Word co-occurrence
  • Knowledge pairs
  • Feature extraction