Skip to main content
Log in

Discovering cluster evolution patterns with the Cluster Association-aware matrix factorization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Tracking of document collections over time (or across domains) is helpful in several applications such as finding dynamics of terminologies, identifying emerging and evolving trends, and concept drift detection. We propose a novel ‘Cluster Association-aware’ Non-negative Matrix Factorization (NMF)-based method with graph-based visualization to identify the changing dynamics of text clusters over time/domains. NMF is utilized to find similar clusters in the set of clustering solutions. Based on the similarities, four major lifecycle states of clusters, namely birth, split, merge and death, are tracked to discover their emergence, growth, persistence and decay. The novel concepts of ‘cluster associations’ and term frequency-based ‘cluster density’ have been used to improve the quality of evolution patterns. The cluster evolution is visualized using a k-partite graph. Empirical analysis with the text data shows that the proposed method is able to produce accurate and efficient solution as compared to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. This paper uses the terms topic, cluster and concept interchangeably.

  2. https://drive.google.com/drive/u/1/folders/1gHoEm-R9S2OkiN9LRVNk3JLVeGpRdWXn.

  3. https://www.openicpsr.org/openicpsr/project/100843/version/V2/view,

  4. https://www.kaggle.com/madhab/jobposts,

References

  1. Aggarwal CC, Zhai C (2012) Mining text data. Springer, Berlin

    Book  Google Scholar 

  2. Amado A, Cortez P, Rita P, Moro S (2018) Research trends on big data in marketing: a text mining and topic modeling based literature analysis. Eur Res Manage Bus Econo 24(1):1–7

    Article  Google Scholar 

  3. Anastasiu DC, Tagarelli A, Karypis G (2013) Document clustering: the next frontier.

  4. Bao C, Ji H, Quan Y, Shen Z (2016) Dictionary learning for sparse coding: algorithms and convergence analysis. IEEE Trans Pattern Anal Mach Intell 38(7):1356–1369

    Article  Google Scholar 

  5. Belford M, Mac Namee B, Greene D (2018) Stability of topic modeling via matrix factorization. Expert Syst Appl 91:159–169

    Article  Google Scholar 

  6. Bolelli L, Ertekin Ş, Giles CL (2009) Topic and trend detection in text collections using latent dirichlet allocation. In: European conference on information retrieval, pp 776–780. Springer, Berlin

  7. Carneiro HA, Mylonakis E (2009) Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clin Infect Dis 49(10):1557–1564

    Article  Google Scholar 

  8. Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining, p 4. ACM

  9. Chi Y, Song X, Zhou D, Hino K, Tseng BL (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 153–162. ACM

  10. Churchill R, Singh L, Kirov C (2018) A temporal topic model for noisy mediums. In: Pacific-Asia conference on knowledge discovery and data mining, pp 42–53. Springer, Berlin

  11. Du L, Buntine W, Jin H, Chen C (2012) Sequential latent dirichlet allocation. Knowl Inf Syst 31(3):475–503

    Article  Google Scholar 

  12. Du N, Farajtabar M, Ahmed A, Smola AJ, Song, L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 219–228. ACM

  13. Galili T (2015) dendextend: an r package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31(22):3718–3720

    Article  Google Scholar 

  14. Gandomi A, Haider M (2015) Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage 35(2):137–144

    Article  Google Scholar 

  15. Greene D, Archambault D, Belák V, Cunningham P (2014) Textluas: Tracking and visualizing document and term clusters in dynamic text data. arXiv preprint arXiv:1502.04609

  16. Greene D, Cross JP (2017) Exploring the political agenda of the european parliament using a dynamic topic modeling approach. Polit Anal 25(1):77–94

    Article  Google Scholar 

  17. Hervas-Oliver JL, Gonzalez G, Caja P, Sempere-Ripoll F (2015) Clusters and industrial districts: Where is the literature going? identifying emerging sub-fields of research. Eur Plan Stud 23(9):1827–1872

    Article  Google Scholar 

  18. Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. In: advances in neural information processing systems, pp 856–864

  19. Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp 80–88. ACM

  20. Huang G, He J, Zhang Y, Zhou W, Liu H, Zhang P, Ding Z, You Y, Cao J (2015) Mining streams of short text for analysis of world-wide event evolutions. World wide web 18(5):1201–1217

    Article  Google Scholar 

  21. Huang K, Sidiropoulos ND, Swami A (2014) Non-negative matrix factorization revisited: uniqueness and algorithm for symmetric decomposition. IEEE Trans Signal Process 62(1):211–224

    Article  MathSciNet  Google Scholar 

  22. Kasiviswanathan SP, Melville P, Banerjee A, Sindhwani V (2011) Emerging topic detection using dictionary learning. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp 745–754. ACM

  23. Kasiviswanathan SP, Wang H, Banerjee A, Melville P (2012) Online l1-dictionary learning with application to novel document detection. In: Advances in neural information processing systems, pp. 2258–2266

  24. Kim J, He Y, Park H (2014) Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J Global Optim 58(2):285–319

    Article  MathSciNet  Google Scholar 

  25. Kim MS, Han J (2009) A particle-and-density based evolutionary clustering method for dynamic networks. Proc VLDB Endow 2(1):622–633

    Article  Google Scholar 

  26. Kutty S, Nayak R, Turnbull P, Chernich R, Kennedy G, Raymond K (2019) Paperminer–a real-time spatiotemporal visualization for newspaper articles. Digital Scholarship in the Humanities

  27. Landauer M, Wurzenberger M, Skopik F, Settanni G, Filzmoser P (2018) Dynamic log file analysis: an unsupervised cluster evolution approach for anomaly detection. Comput Secur 79:94–116

    Article  Google Scholar 

  28. Lee P, Lakshmanan LV, Milios EE (2014) Incremental cluster evolution tracking from highly dynamic network data. In: 2014 IEEE 30th international conference on data engineering (ICDE), pp 3–14. IEEE

  29. Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems, pp 2177–2185

  30. Li Q, Nourbakhsh A, Shah S, Liu X (2017) Real-time novel event detection from social media. In: Data Engineering (ICDE), 2017 IEEE 33rd international conference on, pp 1129–1139. IEEE

  31. Lin YR, Chi Y, Zhu S, Sundaram H, Tseng BL (2008) Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th international conference on World Wide Web, pp 685–694. ACM

  32. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  33. Mohotti WA, Nayak R (2018) Corpus-based augmented media posts with density-based clustering for community detection. In: 2018 IEEE 30th International conference on tools with artificial intelligence (ICTAI), pp 379–386. IEEE

  34. Olszewski D (2014) Fraud detection using self-organizing map visualizing the user profiles. Knowl-Based Syst 70:324–334

    Article  Google Scholar 

  35. Schütze H, Manning CD, Raghavan P (2008) Introduction to information retrieval, vol 39. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  36. Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp 1105–1114. International World Wide Web Conferences Steering Committee

  37. Ventocilla E, Riveiro M (2020) A comparative user study of visualization techniques for cluster analysis of multidimensional data sets. Inf Vis 19(4):318–338

    Article  Google Scholar 

  38. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433. ACM

  39. Yang T, Chi Y, Zhu S, Gong Y, Jin R (2011) Detecting communities and their evolutions in dynamic social networks-a bayesian approach. Mach Learn 82(2):157–189

    Article  MathSciNet  Google Scholar 

  40. You Y, Huang G, Cao J, Chen E, He J, Zhang Y, Hu L (2013) Geam: A general and event-related aspects model for twitter event detection. In: International conference on web information systems engineering, pp 319–332. Springer, Berlin

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wathsala Anupama Mohotti.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohotti, W.A., Nayak, R. Discovering cluster evolution patterns with the Cluster Association-aware matrix factorization. Knowl Inf Syst 63, 1397–1428 (2021). https://doi.org/10.1007/s10115-021-01561-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01561-9

Keywords

Navigation