Abstract
Tracking of document collections over time (or across domains) is helpful in several applications such as finding dynamics of terminologies, identifying emerging and evolving trends, and concept drift detection. We propose a novel ‘Cluster Association-aware’ Non-negative Matrix Factorization (NMF)-based method with graph-based visualization to identify the changing dynamics of text clusters over time/domains. NMF is utilized to find similar clusters in the set of clustering solutions. Based on the similarities, four major lifecycle states of clusters, namely birth, split, merge and death, are tracked to discover their emergence, growth, persistence and decay. The novel concepts of ‘cluster associations’ and term frequency-based ‘cluster density’ have been used to improve the quality of evolution patterns. The cluster evolution is visualized using a k-partite graph. Empirical analysis with the text data shows that the proposed method is able to produce accurate and efficient solution as compared to the state-of-the-art methods.
Similar content being viewed by others
Notes
This paper uses the terms topic, cluster and concept interchangeably.
References
Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin
Amado A, Cortez P, Rita P, Moro S (2018) Research trends on big data in marketing: a text mining and topic modeling based literature analysis. Eur Res Manage Bus Econo 24(1):1–7
Anastasiu DC, Tagarelli A, Karypis G (2013) Document clustering: the next frontier.
Bao C, Ji H, Quan Y, Shen Z (2016) Dictionary learning for sparse coding: algorithms and convergence analysis. IEEE Trans Pattern Anal Mach Intell 38(7):1356–1369
Belford M, Mac Namee B, Greene D (2018) Stability of topic modeling via matrix factorization. Expert Syst Appl 91:159–169
Bolelli L, Ertekin Ş, Giles CL (2009) Topic and trend detection in text collections using latent dirichlet allocation. In: European conference on information retrieval, pp 776–780. Springer, Berlin
Carneiro HA, Mylonakis E (2009) Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clin Infect Dis 49(10):1557–1564
Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining, p 4. ACM
Chi Y, Song X, Zhou D, Hino K, Tseng BL (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 153–162. ACM
Churchill R, Singh L, Kirov C (2018) A temporal topic model for noisy mediums. In: Pacific-Asia conference on knowledge discovery and data mining, pp 42–53. Springer, Berlin
Du L, Buntine W, Jin H, Chen C (2012) Sequential latent dirichlet allocation. Knowl Inf Syst 31(3):475–503
Du N, Farajtabar M, Ahmed A, Smola AJ, Song, L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 219–228. ACM
Galili T (2015) dendextend: an r package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22):3718–3720
Gandomi A, Haider M (2015) Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage 35(2):137–144
Greene D, Archambault D, Belák V, Cunningham P (2014) Textluas: Tracking and visualizing document and term clusters in dynamic text data. arXiv preprint arXiv:1502.04609
Greene D, Cross JP (2017) Exploring the political agenda of the european parliament using a dynamic topic modeling approach. Polit Anal 25(1):77–94
Hervas-Oliver JL, Gonzalez G, Caja P, Sempere-Ripoll F (2015) Clusters and industrial districts: Where is the literature going? identifying emerging sub-fields of research. Eur Plan Stud 23(9):1827–1872
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. In: advances in neural information processing systems, pp 856–864
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp 80–88. ACM
Huang G, He J, Zhang Y, Zhou W, Liu H, Zhang P, Ding Z, You Y, Cao J (2015) Mining streams of short text for analysis of world-wide event evolutions. World wide web 18(5):1201–1217
Huang K, Sidiropoulos ND, Swami A (2014) Non-negative matrix factorization revisited: uniqueness and algorithm for symmetric decomposition. IEEE Trans Signal Process 62(1):211–224
Kasiviswanathan SP, Melville P, Banerjee A, Sindhwani V (2011) Emerging topic detection using dictionary learning. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp 745–754. ACM
Kasiviswanathan SP, Wang H, Banerjee A, Melville P (2012) Online l1-dictionary learning with application to novel document detection. In: Advances in neural information processing systems, pp. 2258–2266
Kim J, He Y, Park H (2014) Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J Global Optim 58(2):285–319
Kim MS, Han J (2009) A particle-and-density based evolutionary clustering method for dynamic networks. Proc VLDB Endow 2(1):622–633
Kutty S, Nayak R, Turnbull P, Chernich R, Kennedy G, Raymond K (2019) Paperminer–a real-time spatiotemporal visualization for newspaper articles. Digital Scholarship in the Humanities
Landauer M, Wurzenberger M, Skopik F, Settanni G, Filzmoser P (2018) Dynamic log file analysis: an unsupervised cluster evolution approach for anomaly detection. Comput Secur 79:94–116
Lee P, Lakshmanan LV, Milios EE (2014) Incremental cluster evolution tracking from highly dynamic network data. In: 2014 IEEE 30th international conference on data engineering (ICDE), pp 3–14. IEEE
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems, pp 2177–2185
Li Q, Nourbakhsh A, Shah S, Liu X (2017) Real-time novel event detection from social media. In: Data Engineering (ICDE), 2017 IEEE 33rd international conference on, pp 1129–1139. IEEE
Lin YR, Chi Y, Zhu S, Sundaram H, Tseng BL (2008) Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th international conference on World Wide Web, pp 685–694. ACM
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mohotti WA, Nayak R (2018) Corpus-based augmented media posts with density-based clustering for community detection. In: 2018 IEEE 30th International conference on tools with artificial intelligence (ICTAI), pp 379–386. IEEE
Olszewski D (2014) Fraud detection using self-organizing map visualizing the user profiles. Knowl-Based Syst 70:324–334
Schütze H, Manning CD, Raghavan P (2008) Introduction to information retrieval, vol 39. Cambridge University Press, Cambridge
Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp 1105–1114. International World Wide Web Conferences Steering Committee
Ventocilla E, Riveiro M (2020) A comparative user study of visualization techniques for cluster analysis of multidimensional data sets. Inf Vis 19(4):318–338
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433. ACM
Yang T, Chi Y, Zhu S, Gong Y, Jin R (2011) Detecting communities and their evolutions in dynamic social networks-a bayesian approach. Mach Learn 82(2):157–189
You Y, Huang G, Cao J, Chen E, He J, Zhang Y, Hu L (2013) Geam: A general and event-related aspects model for twitter event detection. In: International conference on web information systems engineering, pp 319–332. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mohotti, W.A., Nayak, R. Discovering cluster evolution patterns with the Cluster Association-aware matrix factorization. Knowl Inf Syst 63, 1397–1428 (2021). https://doi.org/10.1007/s10115-021-01561-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01561-9