Growing neural gas with random projection method for high-dimensional data stream clustering

  • Yingwen Zhu
  • Songcan ChenEmail author
Methodologies and Application


High-dimensional data streams emerge ubiquitously in many real-world applications such as network monitoring and forest cover type. Clustering such data streams differs from traditional data clustering algorithm where given datasets are generally static and can be repeatedly read and processed, thus facing more challenges due to having to satisfy such constraints as bounded memory, single-pass, real-time response and concept drift detection. Recently, many methods of such type have been proposed. However, when dealing with high-dimensional data, they often result in high computational cost and poor performance due to the curse of dimensionality. To address the above problem, in this paper, we present a new clustering algorithm for data streams, called RPGStream, by combining the random projection method with the growing neural gas (GNG) model which is an incremental self-organizing approach, belonging to the family of topological maps such as SOM or neural gas. To gain insights into the performance improvement obtained by our algorithm, we analyze and identify the major influence of random projection on GNG. Although our method is embarrassingly simple just by incorporating the random projection into an exponential fading function of GNG, the experimental results on variety of benchmark datasets indicate that our method can still achieve comparable or even better performance than G-Stream algorithm even if the raw dimension is compressed up to 10% of the original one (e.g., for CoverType dataset, its dimension is reduced from 54 to 5).


Data stream clustering Growing neural gas Random projection 



This work is supported by the National Natural Science Foundation of China (NSFC) under the Grant Nos. 61672281 and 61472186, the Key Program of NSFC under Grant No. 61732006, as well as the founding of Jiangsu Innovation Program for Graduate Education under Grant No. KYLX15_0322.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687MathSciNetzbMATHCrossRefGoogle Scholar
  2. Aggarwal CC (2009) Data streams: an overview and scientific applications. In: Gaber MM (ed) Scientific data mining and knowledge discovery. Springer, Berlin, pp 377–397CrossRefGoogle Scholar
  3. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29. VLDB Endowment, pp 81–92CrossRefGoogle Scholar
  4. Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, vol 30. VLDB Endowment, pp 852–863CrossRefGoogle Scholar
  5. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New YorkGoogle Scholar
  6. Boutsidis C, Zouzias A, Drineas P (2010) Random projections for \( k \)-means clustering. In: Advances in neural information processing systems, pp 298–306Google Scholar
  7. Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining. SIAM, pp 328–339Google Scholar
  8. Cardoso Â, Wichert A (2012) Iterative random projections for high-dimensional data clustering. Pattern Recognit Lett 33(13):1749–1755CrossRefGoogle Scholar
  9. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 133–142Google Scholar
  10. Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM on symposium on theory of computing. ACM, New York, pp 163–172Google Scholar
  11. Dang XH, Lee V, Ng WK, Ciptadi A, Ong KL (2009) An EM-based algorithm for clustering data streams in sliding windows. In: Zhou X et al (eds) International conference on database systems for advanced applications. Springer, Berlin, pp 230–235CrossRefGoogle Scholar
  12. Dy JG, Brodley CE (2000) Feature subset selection and order identification for unsupervised learning. In: ICML. Citeseer, pp 247–254Google Scholar
  13. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193Google Scholar
  14. Fritzke B (1994) Growing cell structures—a self-organizing network for unsupervised and supervised learning. Neural Netw 7(9):1441–1460CrossRefGoogle Scholar
  15. Fritzke B et al (1995) A growing neural gas network learns topologies. Adv Neural Inf Process Syst 7:625–632Google Scholar
  16. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26zbMATHCrossRefGoogle Scholar
  17. Gama J, Rodrigues PP (2009) An overview on mining data streams. In: Abraham A et al (eds) Foundations of computational, intelligence, vol 6. Springer, Berlin, pp 29–45Google Scholar
  18. Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50CrossRefGoogle Scholar
  19. Hecht-Nielsen R (1994) Context vectors: general purpose approximate meaning representations self-organized from raw data. Comput Intell Imitating Life 3(11):43–56Google Scholar
  20. Hodge VJ, Austin J (2001) Hierarchical growing cell structures: Treegcs. IEEE Trans Knowl Data Eng 13(2):207–218CrossRefGoogle Scholar
  21. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, New York, pp 604–613Google Scholar
  22. Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26(189–206):1MathSciNetzbMATHGoogle Scholar
  23. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. ACM SIGMOD Rec 30(2):151–162zbMATHCrossRefGoogle Scholar
  24. Kohonen T (1998) The self-organizing map. Neurocomputing 21(1):1–6MathSciNetzbMATHCrossRefGoogle Scholar
  25. Kriegel HP, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Cushing JB, French J, Bowers S (eds) International conference on scientific and statistical database management. Springer, Berlin, pp 387–404CrossRefGoogle Scholar
  26. Li Y, Yang G, He H, Jiao L, Shang R (2016) A study of large-scale data clustering based on fuzzy clustering. Soft Comput 20(8):3231–3242. CrossRefGoogle Scholar
  27. Liberty E, Sriharsha R, Sviridenko M (2016) An algorithm for online k-means clustering. In: 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX). SIAM, Philadelphia, pp 81–89Google Scholar
  28. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137MathSciNetzbMATHCrossRefGoogle Scholar
  29. Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27CrossRefGoogle Scholar
  30. Martinetz T, Schulten K et al (1991) A “neural-gas” network learns topologies. Artif Neural Netw 397–402Google Scholar
  31. Musco CN (2015) Dimensionality reduction for k-means clustering. Ph.D. thesis, Massachusetts Institute of TechnologyGoogle Scholar
  32. Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569CrossRefGoogle Scholar
  33. O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: International conference on data engineering. IEEE, pp 685–694Google Scholar
  34. Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33(1):32–37CrossRefGoogle Scholar
  35. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRefGoogle Scholar
  36. Satizábal HF, Perez-Uribe A (2015) Unsupervised template discovery in activity recognition using the gamma growing neural gas algorithm. Soft Comput 19(9):2435–2445. CrossRefGoogle Scholar
  37. Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, New York, pp 861–866Google Scholar
  38. Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, Philadelphia, pp 407–415Google Scholar
  39. Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):1–13zbMATHCrossRefGoogle Scholar
  40. Smith T, Alahakoon D (2009) Growing self-organizing map for online continuous clustering. In: Abraham A et al (eds) Foundations of computational intelligence, vol 4. Springer, Berlin, pp 49–83Google Scholar
  41. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617MathSciNetzbMATHGoogle Scholar
  42. Tasoulis DK, Ross G, Adams NM (2007) Visualising the cluster structure of data streams. In: Berthold MR, Shawe-Taylor J, Lavrač N (eds) International symposium on intelligent data analysis. Springer, Berlin, pp 81–92Google Scholar
  43. Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: Alhajj R et al (eds) International conference on advanced data mining and applications. Springer, Berlin, pp 605–615CrossRefGoogle Scholar
  44. Vojáek L, Drdilov P, Dvorsk J (2017) Optimalization of parallel GNG by neurons assigned to processes. In: IFIP International conference on computer information systems and industrial management, pp 63–72Google Scholar
  45. Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):14Google Scholar
  46. Webb AR (2003) Statistical pattern recognition. Wiley, New YorkzbMATHGoogle Scholar
  47. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994MathSciNetzbMATHCrossRefGoogle Scholar
  48. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  49. Ye M, Liu W, Wei J, Hu X (2016) Fuzzy-means and cluster ensemble with random projection for big data clustering. Math Probl Eng 2016:13MathSciNetzbMATHGoogle Scholar
  50. Yin C, Xia L, Zhang S, Sun R, Wang J (2018) Improved clustering algorithm based on high-speed network data stream. Soft Comput 22(13):4185–4195CrossRefGoogle Scholar
  51. Zhang P, Shen Q (2018) Fuzzy c-means based coincidental link filtering in support of inferring social networks from spatiotemporal data streams. Soft Comput. CrossRefGoogle Scholar
  52. Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.College of Computer Science and TechnologyNanjing University of Aeronautics and AstronauticsNanjingChina
  2. 2.College of Computer Science and EngineeringSanjiang UniversityNanjingChina

Personalised recommendations