Abstract
The majority of clustering approaches focused on static data. However, a big variety of recent applications and research issues in big data mining require dealing with continuous, possibly infinite streams of data, arriving at high velocity. Web traffic data, surveillance data, sensor measurements, and stock trading are only some examples of these daily-increasing applications. Additionally, as the growth of data volumes is accompanied by a similar expansion in their dimensionalities, clusters cannot be expected to completely appear when considering all attributes together. Subspace clustering is a general approach that solved that issue by automatically finding the hidden clusters within different subsets of the attributes rather than considering all attributes together. In this chapter, novel methods for an efficient subspace clustering of high-dimensional big data streams are presented. Approaches that efficiently combine the anytime clustering concept with the stream subspace clustering paradigm are discussed. Additionally, efficient and adaptive density-based clustering algorithms are presented for high-dimensional data streams. Novel open-source assessment framework and evaluation measures are additionally presented for subspace stream clustering.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Sources: domo.com and statisticbrain.com.
- 2.
References
C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, J.S. Park, Fast algorithms for projected clustering, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, pp. 61–72 (ACM, New York, 1999)
C.C. Aggarwal, J. Han, J. Wang, P.S. Yu, A framework for clustering evolving data streams, in Proceedings of the 29th International Conference on Very Large Data Bases, VLDB ’03, pp. 81–92 (VLDB Endowment, Los Angeles, 2003)
R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, pp. 94–105 (ACM, New York , 1998)
K.S. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “nearest neighbor” meaningful? in Proceedings of the 7th International Conference on Database Theory, ICDT ’99, pp. 217–235 (Springer, Berlin, 1999)
A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
L. Byron, M. Wattenberg, Stacked graphs - geometry & aesthetics. IEEE Trans. Vis. Comput. Graph. 14(6), 1245–1252 (2008)
F. Cao, M. Ester, W. Qian, A. Zhou, Density-based clustering over an evolving data stream with noise, in Proceedings of the 6th SIAM International Conference on Data Mining, SDM ’06, pp. 328–339 (2006)
G. Cormode, S. Muthukrishnan, W. Zhuang, Conquering the divide: continuous clustering of distributed data streams, in IEEE 23rd International Conference on Data Engineering, ICDE ’07, pp. 1036–1045 (IEEE Computer Society, Washington, 2007)
N.I. Dataset, KDD cup data (1999). http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
I. Dataset, Dataset of Intel Berkeley Research Lab (2004) URL: http://db.csail.mit.edu/labdata/labdata.html
S. Guha, Tight results for clustering and summarizing data streams, in Proceedings of the 12th International Conference on Database Theory, ICDT ’09, pp. 268–275 (ACM, New York, 2009)
M. Hassani, Efficient clustering of big data streams, Ph.D. thesis, RWTH Aachen University, 2015
M. Hassani, T. Seidl, Towards a mobile health context prediction: sequential pattern mining in multiple streams, in Proceedings of the IEEE 12th International Conference on Mobile Data Management, vol. 2 of MDM ’11, pp. 55–57 (IEEE Computer Society, Washington, 2011)
M. Hassani, T. Seidl, Distributed weighted clustering of evolving sensor data streams with noise. J. Digit. Inf. Manag. 10(6), 410–420 (2012)
M. Hassani, E. Müller, T. Seidl, EDISKCO: energy efficient distributed in-sensor-network k-center clustering with outliers, in Proceedings of the 3rd International Workshop on Knowledge Discovery from Sensor Data, SensorKDD ’09 @KDD ’09, pp. 39–48 (ACM, New York, 2009)
M. Hassani, E. Müller, P. Spaus, A. Faqolli, T. Palpanas, T. Seidl, Self-organizing energy aware clustering of nodes in sensor networks using relevant attributes, in Proceedings of the 4th International Workshop on Knowledge Discovery from Sensor Data, SensorKDD ’10 @KDD ’10, pp. 39–48 (ACM, New York, 2010)
M. Hassani, P. Kranen, T. Seidl, Precise anytime clustering of noisy sensor data with logarithmic complexity, in Proceedings of the 5th International Workshop on Knowledge Discovery from Sensor Data, SensorKDD ’11 @KDD ’11, pp. 52–60 (ACM, New York, 2011)
M. Hassani, P. Spaus, M.M. Gaber, T. Seidl, Density-based projected clustering of data streams, in Proceedings of the 6th International Conference on Scalable Uncertainty Management, SUM ’12, pp. 311–324 (2012)
M. Hassani, P. Spaus, T. Seidl, Adaptive multiple-resolution stream clustering, in Proceedings of the 10th International Conference on Machine Learning and Data Mining, MLDM ’14, pp. 134–148 (2014)
M. Hassani, P. Kranen, R. Saini, T. Seidl, Subspace anytime stream clustering, in Proceedings of the 26th Conference on Scientific and Statistical Database Management, SSDBM’ 14, p. 37 (2014)
M. Hassani, C. Beecks, D. Töws, T. Serbina, M. Haberstroh, P. Niemietz, S. Jeschke, S. Neumann, T. Seidl, Sequential pattern mining of multimodal streams in the humanities, in Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, pp. 683–686 (2015)
M. Hassani, Y. Kim, S. Choi, T. Seidl, Subspace clustering of data streams: new algorithms and effective evaluation measures. J. Intell. Inf. Syst. 45(3), 319–335 (2015)
K. Kailing, H.-P. Kriegel, P. Kröger, Density-connected subspace clustering for high-dimensional data, in Proceedings of the SIAM International Conference on Data Mining, SDM ’04, pp. 246–257 (2004)
G. Moise, J. Sander, M. Ester, P3C: a robust projected clustering algorithm, in Proceedings of the 6th IEEE International Conference on Data Mining, ICDM ’07, pp. 414–425 (IEEE, Piscataway, 2006)
I. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, H.-P. Kriegel, Density-based projected clustering over high dimensional data streams, in Proceedings of the 12th SIAM International Conference on Data Mining, SDM ’12, pp. 987–998 (2012)
Physiological dataset. http://www.cs.purdue.edu/commugrate/data/2004icml/
J. Polastre, R. Szewczyk, D. Culler, Telos: enabling ultra-low power wireless research, in Proceedings of the 4th International Symposium on Information Processing in Sensor Networks, IPSN ’05 (IEEE Press, , Piscataway, 2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Hassani, M. (2019). Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams. In: Nasraoui, O., Ben N'Cir, CE. (eds) Clustering Methods for Big Data Analytics. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-97864-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-97864-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97863-5
Online ISBN: 978-3-319-97864-2
eBook Packages: EngineeringEngineering (R0)