Knowledge and Information Systems

, Volume 29, Issue 2, pp 249–272 | Cite as

The ClusTree: indexing micro-clusters for anytime stream mining

  • Philipp Kranen
  • Ira Assent
  • Corinna Baldauf
  • Thomas Seidl
Regular Paper

Abstract

Clustering streaming data requires algorithms that are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. Our approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, we introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits. Our experiments show that our approach is capable of handling a multitude of different stream characteristics for accurate and scalable anytime stream clustering.

Keywords

Data mining Clustering Anytime Stream mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal C (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156MathSciNetCrossRefGoogle Scholar
  2. 2.
    Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. VLDB, Berlin, pp, pp 81–92Google Scholar
  3. 3.
    Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. VLDB, Toronto, pp, pp 852–863Google Scholar
  4. 4.
    Arai B, Das G, Gunopulos D, Koudas N (2007) Anytime measures for top-k algorithms. VLDB, Vienna, pp, pp 914–925Google Scholar
  5. 5.
    Assent I, Krieger R, Glavic B, Seidl T (2008) Clustering multidimensional sequences in spatial and temporal databases. Knowl Inf Syst 16(1): 29–51CrossRefGoogle Scholar
  6. 6.
    Barbará D, Chen P (2000) Using the fractal dimension to cluster datasets, KDD, pp 260–264Google Scholar
  7. 7.
    Beckmann N, Kriegel H-P, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD, Atlantic City, pp, pp 322–331Google Scholar
  8. 8.
    Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise, SDMGoogle Scholar
  9. 9.
    Chen Y, Tu L (2007) Density-based clustering for real-time stream data, KDD, pp 133–142Google Scholar
  10. 10.
    Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inf Syst 16(1): 1–27MathSciNetCrossRefGoogle Scholar
  11. 11.
    Dang X, Ng W, Ong K (2008) Online mining of frequent sets in data streams with error guarantee. Knowl Inf Syst 16(2): 245–258CrossRefGoogle Scholar
  12. 12.
    DeCoste D (2002) Anytime interval-valued outputs for kernel machines: fast support vector machine classification via distance geometry, ICML, pp 99–106Google Scholar
  13. 13.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J Royal Stat Soc B 39(1): 1–38MathSciNetMATHGoogle Scholar
  14. 14.
    Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, LondonGoogle Scholar
  15. 15.
    Gaber M, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. Springer, Berlin, pp, pp 39–59Google Scholar
  16. 16.
    Guttman A (1984) R-trees: a dynamic index structure for spatial searching. SIGMOD, Boston, pp, pp 47–57Google Scholar
  17. 17.
    Hettich S, Bay S (1999) The UCI KDD archive. http://kdd.ics.uci.edu
  18. 18.
    Jain A, Zhang Z, Chang EY (2006) Adaptive non-linear clustering in data streams, CIKM, pp 122–131Google Scholar
  19. 19.
    Kranen P, Krieger R, Denker S, Seidl T (2010) Bulk loading hierarchical mixture models for efficient stream classification. In: LNAI, Proceedings of 14th PAKDDGoogle Scholar
  20. 20.
    Kranen P, Seidl T (2009) Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, Special Issue on Selected Papers from ECML PKDD 19(2):245–260Google Scholar
  21. 21.
    Li H, Shan M, Lee S (2008) DSM-FI: an efficient algorithm for mining frequent itemsets in data streams. Knowl Inf Syst 17(1): 79–97CrossRefGoogle Scholar
  22. 22.
    Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1): 1–27CrossRefGoogle Scholar
  23. 23.
    O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. ICDEGoogle Scholar
  24. 24.
    Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009). Indexing density models for incremental learning and anytime classification on data streams, EDBTGoogle Scholar
  25. 25.
    Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions, KDD, pp 706–711Google Scholar
  26. 26.
    Spinosa EJ, Ponce de Leon Ferreira de Carvalho AC, Gama J (2007) Olindda: a cluster-based approach for detecting novelty and concept drift in data streams, SAC, pp 448–452Google Scholar
  27. 27.
    Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for large-scale classification, KDD, pp 377–382Google Scholar
  28. 28.
    Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolutionbased technique for stream clustering, ADMA, pp 605–615Google Scholar
  29. 29.
    Ueno K, Xi X, Keogh EJ, Lee D-Y (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining, ICDM, pp 623–632Google Scholar
  30. 30.
    van Leeuwen M, Siebes A (2008) Streamkrimp: detecting change in data streams, ECML/PKDD, pp 672–687Google Scholar
  31. 31.
    Vlachos M, Lin J, Keogh E, Gunopulos D(2003) A wavelet-based anytime algorithm for k-means clustering of time series. WS Clust. High Dim. Data & App. (at ICDM)Google Scholar
  32. 32.
    Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers, KDD, pp 226–235Google Scholar
  33. 33.
    Yang Y, Webb GI, Korb KB, Ting KM (2007) Classifying under computational resource constraints: anytime classification using probabilistic estimators. Mach Learn 69(1): 35–53CrossRefGoogle Scholar
  34. 34.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD, NY, USAGoogle Scholar
  35. 35.
    Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2): 181–214CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Philipp Kranen
    • 1
  • Ira Assent
    • 2
  • Corinna Baldauf
    • 1
  • Thomas Seidl
    • 1
  1. 1.RWTH Aachen UniversityAachenGermany
  2. 2.Aarhus UniversityAarhusDenmark

Personalised recommendations