Skip to main content
Log in

Subspace clustering of data streams: new algorithms and effective evaluation measures

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Nowadays, most streaming data sources are becoming high dimensional. Accordingly, subspace stream clustering, which aims at finding evolving clusters within subgroups of dimensions, has gained a significant importance. However, in spite of the rich literature of subspace and projected clustering algorithms on static data, only three stream projected algorithms are available. Additionally, existing subspace clustering evaluation measures are mainly designed for static data, and cannot reflect the quality of the evolving nature of data streams. On the other hand, available stream clustering evaluation measures care only about the errors of the full-space clustering but not the quality of subspace clustering. In this article we present a method for designing new stream subspace and projected algorithms. We propose also, to the first of our knowledge, the first subspace clustering measure that is designed for streaming data, called SubCMM: Subspace Cluster Mapping Measure. SubCMM is an effective evaluation measure for stream subspace clustering that is able to handle errors caused by emerging, moving, or splitting subspace clusters. Additionally, we propose a novel method for using available offline subspace clustering measures for data streams over the suggested new algorithms within the Subspace MOA framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Aggarwal, CC, Han, J, Wang, J, Philip, SY (2003). A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on very large data bases - Volume 29, VLDB ’03 (pp. 81–92).

  • Aggarwal, CC, Han, J, Wang, J, Philip, SY (2004). A framework for projected clustering of high dimensional data streams. In Proceedings of VLDB ’04 (pp. 852–863).

  • Aggarwal, CC, Wolf, JL, Philip, SY, Procopiuc, C, Park, JS (1999). Fast algorithms for projected clustering. SIGMOD Record, 28(2), 61–72.

    Article  Google Scholar 

  • Agrawal, R, Gehrke, J, Gunopulos, D, Raghavan, P (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of SIGMOD ’98 (pp. 94–105).

  • Assent, I, Krieger, R, Müller, E, Inscy, TS (2008). Indexing subspace clusters with in-process-removal of redundancy. In ICDM (pp. 719–724).

  • Beyer, K, Goldstein, J, Ramakrishnan, R, Shaft, U (1999). When is “nearest neighbor” meaningful? In Proceedings of ICDT ’99 (pp. 217–235).

  • Bohm, C, Kailing, K, Kriegel, H-P, Kroger, Peer (2004). Density connected clustering with local subspace preferences. In ICDM (pp 27–34).

  • Bringmann, B, & Zimmermann, A (2007). The chosen few: On identifying valuable patterns. In ICDM (pp. 63–72).

  • Cao, F, Ester, M, Qian, W, Zhou, A (2006). Density-based clustering over an evolving data stream with noise. In 2006 SIAM conference on data mining (pp. 328–339).

  • Chen, Y, & Li, T (2007). Density-based clustering for real-time stream data. In Proceedings of KDD’07 (pp. 133–142).

  • Ester, M, Kriegel, H-P, Sander, J, Xiaowei, X (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of KDD’96 (pp. 226–231).

  • Jiawei, H, & Micheline, K. (2006). Data Mining: Concepts And Techniques.: Elsevier Science & Tech.

  • Hassani, M, Kim, Y, Seidl, T (2013). MOA: Subspace stream clustering evaluation using the MOA framework. In Proceedings of DASFAA ’13 (2) (pp. 446–449).

  • Hassani, M, Kranen, P, Seidl, T (2011). Precise anytime clustering of noisy sensor data with logarithmic complexity. In Proceedings 5th International Workshop on Knowledge Discovery from Sensor Data (SensorKDD 2011) in conjunction with KDD’11 (pp. 52–60).

  • Hassani, M, Müller, E, Seidl, T (2009). EDISKCO: energy efficient distributed in-sensor-network k-center clustering with outliers. In Proceedings of SensorKDD ’10 Workshop in conj. with KDD ’09 (pp. 39–48).

  • Hassani, M, Spaus, P, Gaber, MM, Seidl, T (2012). Density-based projected clustering of data streams. In Proceedings of the 6th international conference on scalable uncertainty management (SUM 2012) (pp 311–324).

  • KDD (1999). Network intrusion dataset. In: KDD Cup ’99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

  • Jain, A, Zhang, Z, Chang, EY (2006). Adaptive non-linear clustering in data streams. In Proceedings of CIKM ’06 (pp. 122–131).

  • Kaufman, L, & Rousseeuw, PJ. (1990). Finding groups in data: An introduction to cluster analysis. Applied probability and statistics. Wiley: Wiley series in probability and mathematical statistics.

    Book  Google Scholar 

  • Kranen, P, Kremer, H, Jansen, T, Seidl, T, Bifet, A, Holmes, G, Pfahringer, B, Read, J (2012). Stream data mining using the MOA framework. In DASFAA (2) (pp. 309–313).

  • Kremer, H, Kranen, P, Jansen, T, Seidl, T, Bifet, A, Holmes, G, Pfahringer, B (2011). An effective evaluation measure for clustering on evolving data streams. In Proceedings of KDD’11 (pp. 868–876).

  • Kriegel, H-P, Peer, K, Renz, M, Wurst, S (2005). A Generic framework for efficient subspace clustering of high-dimensional data. In Proceedings of ICDM’ 05 (pp. 250–257)

  • Kriegel, H-P, Kröger, P, Ntoutsi, I, Zimek, A (2010). Towards subspace clustering on dynamic data: An incremental version of predecon. In Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques, StreamKDD ’10 (pp. 31–38).

  • Kröger, P, Kriegel, H-P, Kailing, K (2004). Density-connected subspace clustering for high-dimensional data. In SDM (pp. 246–257).

  • Lin, G, & Chen, L. (2008). A grid and fractal dimension-based data stream clustering algorithm. In ISISE ’08 (pp. 66 –70).

  • Moise, G, Sander, J, Ester, M (2006). P3C: A robust projected clustering algorithm. IEEE International Conference on Data Mining, 0, 414–425.

    Google Scholar 

  • Müller, E, Assent, I, Günnemann, S, Jansen, T, Seidl, T (2009). Opensubspace: An open source framework for evaluation and exploration of subspace clustering algorithms in wek. In Open Source in Data Mining Workshop at PAKDD (pp. 2–13).

  • Müller, E, Günnemann, S, Assent, I, Seidl, T (2009). Evaluating clustering in subspace projections of high dimensional data. PVLDB, 2(1), 1270–1281.

    Google Scholar 

  • Ntoutsi, I, Zimek, A, Palpanas, T, Kröger, P, Kriegel, H-P (2012). Density-based projected clustering over high dimensional data streams. In Proceedings of SDM ’12 (pp. 987–998).

  • Park, NH, & Lee, WS. (2007). Grid-based subspace clustering over data streams. In Proceedings of CIKM ’07 (pp. 801–810).

  • Patrikainen, A, & Meila, M (2006). Comparing subspace clusterings. EEE Transactions on Knowledge and Data Engineering, 18(7), 902–916.

    Article  Google Scholar 

  • Sequeira, K, & Schism, MZ (2005). A new approach to interesting subspace mining. International Journal of Business Intelligence and Data Mining, 1(2), 137–160.

    Article  Google Scholar 

  • Zhao, Y, & George, K. (2002). Criterion functions for document clustering: Experiments and analysis Technical report: University of Minnesota.

Download references

Acknowledgments

This work has been supported by the UMIC Research Centre, RWTH Aachen University, Germany.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marwan Hassani.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hassani, M., Kim, Y., Choi, S. et al. Subspace clustering of data streams: new algorithms and effective evaluation measures. J Intell Inf Syst 45, 319–335 (2015). https://doi.org/10.1007/s10844-014-0319-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0319-2

Keywords

Navigation