A term correlation based semi-supervised microblog clustering with dual constraints

Ma, Huifang; Zhang, Di; Jia, Meihuizi; Lin, Xianghong

doi:10.1007/s13042-017-0750-0

A term correlation based semi-supervised microblog clustering with dual constraints

Original Article
Published: 24 November 2017

Volume 10, pages 679–692, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Huifang Ma^1,2,
Di Zhang¹,
Meihuizi Jia¹ &
…
Xianghong Lin¹

348 Accesses
2 Citations
Explore all metrics

Abstract

Microblog clustering is very important in many web applications. However, microblogs do not provide sufficient word occurrences. Meanwhile the limited length of these messages prevents traditional text clustering approaches from being employed to their full potential. To address this problem, in this paper, we propose a novel semi-supervised learning scheme fully exploring the semantic information to compensate for the limited message length. The key idea is to explore term correlation data, which well captures the semantic information for term weighting and provides greater context for microblogs. We then formulate microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework, which takes advantage of both prior domain knowledge of data points (microblogs) in the form of pair-wise constraints and category knowledge of features (terms). Our approach not only greatly reduces the labor-intensive labeling process, but also deeply exploits hidden information from microblog itself. Extensive experiments are conducted on two real-world microblog datasets. The results demonstrate the effectiveness of the proposed approach which produces promising performance as compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

An integrated clustering and BERT framework for improved topic modeling

Article 01 April 2023

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Article 26 October 2022

Notes

References

Ma HF, Wang B, Li N (2012) A novel online event analysis framework for micro-blog based on incremental topic modeling. In: Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, August 8-10, Kyoto, Japan, pp 73-76. doi: 10.1109/SNPD.2012.48
Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326
Article Google Scholar
Huang D, Lai JH, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142
Article MATH Google Scholar
Huang D, Lai JH, Wang CD (2015) Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis. Neurocomputing 170:240–250
Article Google Scholar
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556-562
Wang H, Nie FP, Huang H, Makedon F (2011) Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, July 16-22, Catalonia, Spain, pp 1553-1558. doi: 10.5591/978-1-57735-516-8/IJCAI11-261
Chen YH, Wang LJ, Dong M (2010) Non-negative matrix factorization for semi-supervised heterogeneous data co-clustering. IEEE Trans Knowl Data Eng 22(10):1459–1474
Article Google Scholar
Chen Y, Li ZJ, Nie LQ, Hu X (2012) A semi-supervised Bayesian network model for microblog topic classification. In: Proceedings of the 24th International Conference on Computational Linguistics, December 8-15, Mumbai, India, pp 561-576
Ma HF, Jia MHZ, Zhang D, Lin XH (2017) Combining tag correlation and user social relation for microblog recommendation. Inf Sci 385(C):325–337
Article Google Scholar
Carter S, Tsagkias M, Weerkamp W (2011) Semi-supervised priors for microblog language identification. In: Proceedings of the 11th Dutch-Belgian Information Retrieval Workshop, Amsterdam, Netherlands, pp 12-15
Lee K, Palsetia D, Narayanan R, Patwary MMA (2011) Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, December 11-14, Vancouver, Canada, pp 251-258, doi: 10.1109/ICDMW.2011.171
Zubiaga A, Spina D, Fresno V, Martnez R (2011) Classifying trending topics: A typology of conversation triggers on twitter. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, October 24-28, Glasgow, United Kingdom, pp 2461-2464. doi: 10.1145/2063576.2063992
Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, July 19-23, Geneva, Switzerland, pp 841-842, doi: 10.1145/1835449.1835643
Tang JL, Wang XF, Gao HJ, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. Front Comput Sci 6(1):88–101
Article MathSciNet MATH Google Scholar
Hu X, Liu H (2012) Text analytics in social media. Min Text Data 12:385–414 (Chapter)
Google Scholar
Xu J, Xu B, Wang P (2017) Self-taught convolutional neural networks for short text clustering. Neural Netw 88:22–31
Article Google Scholar
Guo WW, Li H, Ji H, Diab M (2013) Linking tweets to news: A framework to enrich short text data in social media. In: Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics, August 4-9, Sofia, Bulgaria, pp 239-249
Quan XJ, Kit CY, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence, July 25-31, Buenos Aires, Argentina, pp 2270-2276
Mehrotra R, Sanner S, Buntine W, Xie LX (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development, July 15-19, Vienna, Austria, pp 889-892, doi: 10.1145/2484028.2484166
Yan XH, Guo JF, Lan YY, Cheng XQ (2013) A bi-term topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, April 7-11, Seoul, Korea, pp 1445-1456
Zhao WX, Jiang J, Weng JS, He J, Lim EP, Yan HF, Li XM (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval, April 18-21, Dublin, Ireland, pp 338-349, doi: 10.1007/978-3-642-20161-5_34
Sun AX (2012) Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 12-16, Oregon, USA, pp 1145-1146, doi: 10.1145/2348283.2348511
Yang LL, Li CP, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. In: Proceedings of the 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, September 9-11, Kitakyushu, Japan, pp 78-86, doi: 10.1016/j.procs.2013.09.083
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th International Conference on Machine Learning, July 8-12, Sydney, Australia, pp 307-314
Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems, pp 505-512
Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha DS (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, Seattle, Washington, USA, pp 509-514, doi: 10.1145/1014052.1014111
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th International Conference on Machine Learning, July 8-12, Sydney, Australia, pp 27-34
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International Conference on Machine Learning, June 28-July 1, Williamstown, MA, USA, pp 577-584
Lu ZL, Leen TK (2007) Penalized probabilistic clustering. Neural Comput 19(6):1528–1567
Article MathSciNet MATH Google Scholar
Wacquet G, Caillault P, Hamad D, Hbert PA (2013) Constrained spectral embedding for K-way data clustering. Pattern Recognit Lett 34(9):1009–1017. https://doi.org/10.1016/j.patrec.2013.02.003
Article Google Scholar
Jia HJ, Ding SF, Meng LH, Fan SY (2014) A density-adaptive affinity propagation clustering algorithm based on spectral dimension reduction. Neural Comput Appl 25(7–8):1557–1567. https://doi.org/10.1007/s00521-014-1628-7
Article Google Scholar
Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, August 9-15, Acapulco, Mexico, pp 561-566
Zhao WZ, He Q, Ma H (2011) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587. https://doi.org/10.1007/s10115-011-0389-1
Article Google Scholar
Hu GB, Zhou SG, Guan JH, Hu XH (2008) Toward effective document clustering: a constrained k-means based approach. Inf Process Manag 44(4):1397–1409. https://doi.org/10.1016/j.ipm.2008.03.001
Article Google Scholar
Chang H, Yeung DY (2004) Locally linear metric adaptation for semi-supervised clustering and image retrieval. In: Proceedings of the 21th International Conference on Machine Learning, July 4-8, Banff, Alberta, Canada, pp 153-160, doi: 10.1145/1015330.1015391
Gu QQ, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 28-July 1, Paris, France, pp 359-368, doi: 10.1145/1557019.1557063
Yan XH, Guo JF, Liu SH, Cheng XQ, Wang YF (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining, May 2-4, Austin, Texas, USA, pp 749-757
Cheng X, Miao DQ, Wang C, Cao LB (2013) Coupled term-term relation analysis for document clustering. In: proceedings of the 2013 International Joint Conference on Neural Networks, August 4-9, Dallas, Texas, USA, pp 1-8, doi: 10.1109/IJCNN.2013.6706853
Ma HF, Zhao WZ, Shi ZZ (2013) A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints. Knowl Inf Syst 36(3):629–651. https://doi.org/10.1007/s10115-012-0560-3
Article Google Scholar
Ma HF, Jia MHZ, Shi YK, Hao ZJ (2014) Semi-supervised nonnegative matrix factorization for microblog clustering based on term correlation. In: Web Technologies and Applications, pp 511-516, doi: 10.1007/978-3-319-11116-2_46
Dhillon IS, Mallela S, Modha DS et al (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 89-98

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61363058), Youth Science and Technology Support Program of Gansu Province (145RJYA259), Open Program of the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (IIP2014-4), the Natural Science Foundation of Gansu Province for Distinguished Young Scholars (1308RJDA007) and the National Natural Science Foundation of China (No. 61762078).

Author information

Authors and Affiliations

College of Computer science and engineering, Northwest Normal University, Lanzhou, 730070, Gansu, China
Huifang Ma, Di Zhang, Meihuizi Jia & Xianghong Lin
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100085, China
Huifang Ma

Authors

Huifang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Di Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Meihuizi Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xianghong Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huifang Ma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, H., Zhang, D., Jia, M. et al. A term correlation based semi-supervised microblog clustering with dual constraints. Int. J. Mach. Learn. & Cyber. 10, 679–692 (2019). https://doi.org/10.1007/s13042-017-0750-0

Download citation

Received: 10 September 2015
Accepted: 15 November 2017
Published: 24 November 2017
Issue Date: 02 April 2019
DOI: https://doi.org/10.1007/s13042-017-0750-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A term correlation based semi-supervised microblog clustering with dual constraints

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

An integrated clustering and BERT framework for improved topic modeling

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A term correlation based semi-supervised microblog clustering with dual constraints

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

An integrated clustering and BERT framework for improved topic modeling

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation