Advertisement

A term correlation based semi-supervised microblog clustering with dual constraints

  • Huifang Ma
  • Di Zhang
  • Meihuizi Jia
  • Xianghong Lin
Original Article
  • 80 Downloads

Abstract

Microblog clustering is very important in many web applications. However, microblogs do not provide sufficient word occurrences. Meanwhile the limited length of these messages prevents traditional text clustering approaches from being employed to their full potential. To address this problem, in this paper, we propose a novel semi-supervised learning scheme fully exploring the semantic information to compensate for the limited message length. The key idea is to explore term correlation data, which well captures the semantic information for term weighting and provides greater context for microblogs. We then formulate microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework, which takes advantage of both prior domain knowledge of data points (microblogs) in the form of pair-wise constraints and category knowledge of features (terms). Our approach not only greatly reduces the labor-intensive labeling process, but also deeply exploits hidden information from microblog itself. Extensive experiments are conducted on two real-world microblog datasets. The results demonstrate the effectiveness of the proposed approach which produces promising performance as compared to state-of-the-art methods.

Keywords

Semi-supervised clustering Microblogs Dual constraints Term correlation matrix Nonnegative matrix factorization 

Notes

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61363058), Youth Science and Technology Support Program of Gansu Province (145RJYA259), Open Program of the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (IIP2014-4), the Natural Science Foundation of Gansu Province for Distinguished Young Scholars (1308RJDA007) and the National Natural Science Foundation of China (No. 61762078).

References

  1. 1.
    Ma HF, Wang B, Li N (2012) A novel online event analysis framework for micro-blog based on incremental topic modeling. In: Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, August 8-10, Kyoto, Japan, pp 73-76. doi: 10.1109/SNPD.2012.48Google Scholar
  2. 2.
    Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326CrossRefGoogle Scholar
  3. 3.
    Huang D, Lai JH, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142CrossRefGoogle Scholar
  4. 4.
    Huang D, Lai JH, Wang CD (2015) Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis. Neurocomputing 170:240–250CrossRefGoogle Scholar
  5. 5.
    Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556-562Google Scholar
  6. 6.
    Wang H, Nie FP, Huang H, Makedon F (2011) Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, July 16-22, Catalonia, Spain, pp 1553-1558. doi: 10.5591/978-1-57735-516-8/IJCAI11-261Google Scholar
  7. 7.
    Chen YH, Wang LJ, Dong M (2010) Non-negative matrix factorization for semi-supervised heterogeneous data co-clustering. IEEE Trans Knowl Data Eng 22(10):1459–1474CrossRefGoogle Scholar
  8. 8.
    Chen Y, Li ZJ, Nie LQ, Hu X (2012) A semi-supervised Bayesian network model for microblog topic classification. In: Proceedings of the 24th International Conference on Computational Linguistics, December 8-15, Mumbai, India, pp 561-576Google Scholar
  9. 9.
    Ma HF, Jia MHZ, Zhang D, Lin XH (2017) Combining tag correlation and user social relation for microblog recommendation. Inf Sci 385(C):325–337CrossRefGoogle Scholar
  10. 10.
    Carter S, Tsagkias M, Weerkamp W (2011) Semi-supervised priors for microblog language identification. In: Proceedings of the 11th Dutch-Belgian Information Retrieval Workshop, Amsterdam, Netherlands, pp 12-15Google Scholar
  11. 11.
    Lee K, Palsetia D, Narayanan R, Patwary MMA (2011) Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, December 11-14, Vancouver, Canada, pp 251-258, doi: 10.1109/ICDMW.2011.171Google Scholar
  12. 12.
    Zubiaga A, Spina D, Fresno V, Martnez R (2011) Classifying trending topics: A typology of conversation triggers on twitter. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, October 24-28, Glasgow, United Kingdom, pp 2461-2464. doi: 10.1145/2063576.2063992Google Scholar
  13. 13.
    Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, July 19-23, Geneva, Switzerland, pp 841-842, doi: 10.1145/1835449.1835643Google Scholar
  14. 14.
    Tang JL, Wang XF, Gao HJ, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. Front Comput Sci 6(1):88–101MathSciNetMATHGoogle Scholar
  15. 15.
    Hu X, Liu H (2012) Text analytics in social media. Min Text Data 12:385–414 (Chapter)CrossRefGoogle Scholar
  16. 16.
    Xu J, Xu B, Wang P (2017) Self-taught convolutional neural networks for short text clustering. Neural Netw 88:22–31CrossRefGoogle Scholar
  17. 17.
    Guo WW, Li H, Ji H, Diab M (2013) Linking tweets to news: A framework to enrich short text data in social media. In: Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics, August 4-9, Sofia, Bulgaria, pp 239-249Google Scholar
  18. 18.
    Quan XJ, Kit CY, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence, July 25-31, Buenos Aires, Argentina, pp 2270-2276Google Scholar
  19. 19.
    Mehrotra R, Sanner S, Buntine W, Xie LX (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development, July 15-19, Vienna, Austria, pp 889-892, doi: 10.1145/2484028.2484166Google Scholar
  20. 20.
    Yan XH, Guo JF, Lan YY, Cheng XQ (2013) A bi-term topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, April 7-11, Seoul, Korea, pp 1445-1456Google Scholar
  21. 21.
    Zhao WX, Jiang J, Weng JS, He J, Lim EP, Yan HF, Li XM (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval, April 18-21, Dublin, Ireland, pp 338-349, doi: 10.1007/978-3-642-20161-5_34Google Scholar
  22. 22.
    Sun AX (2012) Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 12-16, Oregon, USA, pp 1145-1146, doi: 10.1145/2348283.2348511Google Scholar
  23. 23.
    Yang LL, Li CP, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. In: Proceedings of the 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, September 9-11, Kitakyushu, Japan, pp 78-86, doi: 10.1016/j.procs.2013.09.083Google Scholar
  24. 24.
    Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th International Conference on Machine Learning, July 8-12, Sydney, Australia, pp 307-314Google Scholar
  25. 25.
    Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems, pp 505-512Google Scholar
  26. 26.
    Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha DS (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, Seattle, Washington, USA, pp 509-514, doi: 10.1145/1014052.1014111Google Scholar
  27. 27.
    Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th International Conference on Machine Learning, July 8-12, Sydney, Australia, pp 27-34Google Scholar
  28. 28.
    Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International Conference on Machine Learning, June 28-July 1, Williamstown, MA, USA, pp 577-584Google Scholar
  29. 29.
    Lu ZL, Leen TK (2007) Penalized probabilistic clustering. Neural Comput 19(6):1528–1567MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Wacquet G, Caillault P, Hamad D, Hbert PA (2013) Constrained spectral embedding for K-way data clustering. Pattern Recognit Lett 34(9):1009–1017.  https://doi.org/10.1016/j.patrec.2013.02.003 CrossRefGoogle Scholar
  31. 31.
    Jia HJ, Ding SF, Meng LH, Fan SY (2014) A density-adaptive affinity propagation clustering algorithm based on spectral dimension reduction. Neural Comput Appl 25(7–8):1557–1567.  https://doi.org/10.1007/s00521-014-1628-7 CrossRefGoogle Scholar
  32. 32.
    Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, August 9-15, Acapulco, Mexico, pp 561-566Google Scholar
  33. 33.
    Zhao WZ, He Q, Ma H (2011) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587.  https://doi.org/10.1007/s10115-011-0389-1 CrossRefGoogle Scholar
  34. 34.
    Hu GB, Zhou SG, Guan JH, Hu XH (2008) Toward effective document clustering: a constrained k-means based approach. Inf Process Manag 44(4):1397–1409.  https://doi.org/10.1016/j.ipm.2008.03.001 CrossRefGoogle Scholar
  35. 35.
    Chang H, Yeung DY (2004) Locally linear metric adaptation for semi-supervised clustering and image retrieval. In: Proceedings of the 21th International Conference on Machine Learning, July 4-8, Banff, Alberta, Canada, pp 153-160, doi: 10.1145/1015330.1015391Google Scholar
  36. 36.
    Gu QQ, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 28-July 1, Paris, France, pp 359-368, doi: 10.1145/1557019.1557063Google Scholar
  37. 37.
    Yan XH, Guo JF, Liu SH, Cheng XQ, Wang YF (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining, May 2-4, Austin, Texas, USA, pp 749-757Google Scholar
  38. 38.
    Cheng X, Miao DQ, Wang C, Cao LB (2013) Coupled term-term relation analysis for document clustering. In: proceedings of the 2013 International Joint Conference on Neural Networks, August 4-9, Dallas, Texas, USA, pp 1-8, doi: 10.1109/IJCNN.2013.6706853Google Scholar
  39. 39.
    Ma HF, Zhao WZ, Shi ZZ (2013) A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints. Knowl Inf Syst 36(3):629–651.  https://doi.org/10.1007/s10115-012-0560-3 CrossRefGoogle Scholar
  40. 40.
    Ma HF, Jia MHZ, Shi YK, Hao ZJ (2014) Semi-supervised nonnegative matrix factorization for microblog clustering based on term correlation. In: Web Technologies and Applications, pp 511-516, doi: 10.1007/978-3-319-11116-2_46Google Scholar
  41. 41.
    Dhillon IS, Mallela S, Modha DS et al (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 89-98Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  • Huifang Ma
    • 1
    • 2
  • Di Zhang
    • 1
  • Meihuizi Jia
    • 1
  • Xianghong Lin
    • 1
  1. 1.College of Computer science and engineeringNorthwest Normal University LanzhouChina
  2. 2.The Key Laboratory of Intelligent Information ProcessingInstitute of Computing Technology, Chinese Academy of SciencesBeijingChina

Personalised recommendations