Skip to main content
Log in

A term correlation based semi-supervised microblog clustering with dual constraints

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Microblog clustering is very important in many web applications. However, microblogs do not provide sufficient word occurrences. Meanwhile the limited length of these messages prevents traditional text clustering approaches from being employed to their full potential. To address this problem, in this paper, we propose a novel semi-supervised learning scheme fully exploring the semantic information to compensate for the limited message length. The key idea is to explore term correlation data, which well captures the semantic information for term weighting and provides greater context for microblogs. We then formulate microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework, which takes advantage of both prior domain knowledge of data points (microblogs) in the form of pair-wise constraints and category knowledge of features (terms). Our approach not only greatly reduces the labor-intensive labeling process, but also deeply exploits hidden information from microblog itself. Extensive experiments are conducted on two real-world microblog datasets. The results demonstrate the effectiveness of the proposed approach which produces promising performance as compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://trends.google.com

  2. https://tartarus.org/martin/PorterStemmer/

  3. http://www.twittonary.com/

  4. http://twitterforteachers.wetpaint.com/page/Twitter+Dictionary

References

  1. Ma HF, Wang B, Li N (2012) A novel online event analysis framework for micro-blog based on incremental topic modeling. In: Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, August 8-10, Kyoto, Japan, pp 73-76. doi: 10.1109/SNPD.2012.48

  2. Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326

    Article  Google Scholar 

  3. Huang D, Lai JH, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142

    Article  MATH  Google Scholar 

  4. Huang D, Lai JH, Wang CD (2015) Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis. Neurocomputing 170:240–250

    Article  Google Scholar 

  5. Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556-562

  6. Wang H, Nie FP, Huang H, Makedon F (2011) Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, July 16-22, Catalonia, Spain, pp 1553-1558. doi: 10.5591/978-1-57735-516-8/IJCAI11-261

  7. Chen YH, Wang LJ, Dong M (2010) Non-negative matrix factorization for semi-supervised heterogeneous data co-clustering. IEEE Trans Knowl Data Eng 22(10):1459–1474

    Article  Google Scholar 

  8. Chen Y, Li ZJ, Nie LQ, Hu X (2012) A semi-supervised Bayesian network model for microblog topic classification. In: Proceedings of the 24th International Conference on Computational Linguistics, December 8-15, Mumbai, India, pp 561-576

  9. Ma HF, Jia MHZ, Zhang D, Lin XH (2017) Combining tag correlation and user social relation for microblog recommendation. Inf Sci 385(C):325–337

    Article  Google Scholar 

  10. Carter S, Tsagkias M, Weerkamp W (2011) Semi-supervised priors for microblog language identification. In: Proceedings of the 11th Dutch-Belgian Information Retrieval Workshop, Amsterdam, Netherlands, pp 12-15

  11. Lee K, Palsetia D, Narayanan R, Patwary MMA (2011) Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, December 11-14, Vancouver, Canada, pp 251-258, doi: 10.1109/ICDMW.2011.171

  12. Zubiaga A, Spina D, Fresno V, Martnez R (2011) Classifying trending topics: A typology of conversation triggers on twitter. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, October 24-28, Glasgow, United Kingdom, pp 2461-2464. doi: 10.1145/2063576.2063992

  13. Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, July 19-23, Geneva, Switzerland, pp 841-842, doi: 10.1145/1835449.1835643

  14. Tang JL, Wang XF, Gao HJ, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. Front Comput Sci 6(1):88–101

    Article  MathSciNet  MATH  Google Scholar 

  15. Hu X, Liu H (2012) Text analytics in social media. Min Text Data 12:385–414 (Chapter)

    Google Scholar 

  16. Xu J, Xu B, Wang P (2017) Self-taught convolutional neural networks for short text clustering. Neural Netw 88:22–31

    Article  Google Scholar 

  17. Guo WW, Li H, Ji H, Diab M (2013) Linking tweets to news: A framework to enrich short text data in social media. In: Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics, August 4-9, Sofia, Bulgaria, pp 239-249

  18. Quan XJ, Kit CY, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence, July 25-31, Buenos Aires, Argentina, pp 2270-2276

  19. Mehrotra R, Sanner S, Buntine W, Xie LX (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development, July 15-19, Vienna, Austria, pp 889-892, doi: 10.1145/2484028.2484166

  20. Yan XH, Guo JF, Lan YY, Cheng XQ (2013) A bi-term topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, April 7-11, Seoul, Korea, pp 1445-1456

  21. Zhao WX, Jiang J, Weng JS, He J, Lim EP, Yan HF, Li XM (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval, April 18-21, Dublin, Ireland, pp 338-349, doi: 10.1007/978-3-642-20161-5_34

  22. Sun AX (2012) Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 12-16, Oregon, USA, pp 1145-1146, doi: 10.1145/2348283.2348511

  23. Yang LL, Li CP, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. In: Proceedings of the 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, September 9-11, Kitakyushu, Japan, pp 78-86, doi: 10.1016/j.procs.2013.09.083

  24. Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th International Conference on Machine Learning, July 8-12, Sydney, Australia, pp 307-314

  25. Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems, pp 505-512

  26. Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha DS (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, Seattle, Washington, USA, pp 509-514, doi: 10.1145/1014052.1014111

  27. Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th International Conference on Machine Learning, July 8-12, Sydney, Australia, pp 27-34

  28. Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International Conference on Machine Learning, June 28-July 1, Williamstown, MA, USA, pp 577-584

  29. Lu ZL, Leen TK (2007) Penalized probabilistic clustering. Neural Comput 19(6):1528–1567

    Article  MathSciNet  MATH  Google Scholar 

  30. Wacquet G, Caillault P, Hamad D, Hbert PA (2013) Constrained spectral embedding for K-way data clustering. Pattern Recognit Lett 34(9):1009–1017. https://doi.org/10.1016/j.patrec.2013.02.003

    Article  Google Scholar 

  31. Jia HJ, Ding SF, Meng LH, Fan SY (2014) A density-adaptive affinity propagation clustering algorithm based on spectral dimension reduction. Neural Comput Appl 25(7–8):1557–1567. https://doi.org/10.1007/s00521-014-1628-7

    Article  Google Scholar 

  32. Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, August 9-15, Acapulco, Mexico, pp 561-566

  33. Zhao WZ, He Q, Ma H (2011) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587. https://doi.org/10.1007/s10115-011-0389-1

    Article  Google Scholar 

  34. Hu GB, Zhou SG, Guan JH, Hu XH (2008) Toward effective document clustering: a constrained k-means based approach. Inf Process Manag 44(4):1397–1409. https://doi.org/10.1016/j.ipm.2008.03.001

    Article  Google Scholar 

  35. Chang H, Yeung DY (2004) Locally linear metric adaptation for semi-supervised clustering and image retrieval. In: Proceedings of the 21th International Conference on Machine Learning, July 4-8, Banff, Alberta, Canada, pp 153-160, doi: 10.1145/1015330.1015391

  36. Gu QQ, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 28-July 1, Paris, France, pp 359-368, doi: 10.1145/1557019.1557063

  37. Yan XH, Guo JF, Liu SH, Cheng XQ, Wang YF (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining, May 2-4, Austin, Texas, USA, pp 749-757

  38. Cheng X, Miao DQ, Wang C, Cao LB (2013) Coupled term-term relation analysis for document clustering. In: proceedings of the 2013 International Joint Conference on Neural Networks, August 4-9, Dallas, Texas, USA, pp 1-8, doi: 10.1109/IJCNN.2013.6706853

  39. Ma HF, Zhao WZ, Shi ZZ (2013) A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints. Knowl Inf Syst 36(3):629–651. https://doi.org/10.1007/s10115-012-0560-3

    Article  Google Scholar 

  40. Ma HF, Jia MHZ, Shi YK, Hao ZJ (2014) Semi-supervised nonnegative matrix factorization for microblog clustering based on term correlation. In: Web Technologies and Applications, pp 511-516, doi: 10.1007/978-3-319-11116-2_46

  41. Dhillon IS, Mallela S, Modha DS et al (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 89-98

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61363058), Youth Science and Technology Support Program of Gansu Province (145RJYA259), Open Program of the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (IIP2014-4), the Natural Science Foundation of Gansu Province for Distinguished Young Scholars (1308RJDA007) and the National Natural Science Foundation of China (No. 61762078).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huifang Ma.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, H., Zhang, D., Jia, M. et al. A term correlation based semi-supervised microblog clustering with dual constraints. Int. J. Mach. Learn. & Cyber. 10, 679–692 (2019). https://doi.org/10.1007/s13042-017-0750-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-017-0750-0

Keywords

Navigation