Advertisement

Knowledge and Information Systems

, Volume 41, Issue 1, pp 127–152 | Cite as

Clustering data streams using grid-based synopsis

  • Vasudha Bhatnagar
  • Sharanjit KaurEmail author
  • Sharma Chakravarthy
Regular Paper

Abstract

Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple scans of data an impractical proposition for the purpose of knowledge discovery. Need to learn structures from data in streaming environment has been a driving force for making clustering a popular technique for knowledge discovery from data streams. Continuous nature of streaming data makes it infeasible to look for point membership among the clusters discovered so far, necessitating employment of a synopsis structure to consolidate incoming data points. This synopsis is exploited for building clustering scheme to meet subsequent user demands. The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise. The algorithm is robust, adaptive to changes in data distribution and detects succinct outliers on-the-fly. It deploys a fixed granularity grid structure as synopsis and performs clustering by coalescing dense regions in grid. Speed-based pruning is applied to synopsis prior to clustering to ensure currency of discovered clusters. Extensive experimentation demonstrates that the algorithm is robust, identifies succinct outliers on-the-fly and is adaptive to change in the data distribution. ExCC algorithm is further evaluated for performance and compared with other contemporary algorithms.

Keywords

Stream clustering Synopsis Micro-cluster Grid structure Exclusive clustering Complete clustering 

References

  1. 1.
    Abraham A, Hassanien AE, de Leon Ferreira de Carvalho ACP, Snásel V (eds) (2009) Foundations of computational intelligence: data mining, Studies in Computational Intelligence, vol 206. Springer, BerlinGoogle Scholar
  2. 2.
    Aggarwal CC (ed) (2007) Data streams: models and algorithms. Springer, New York, ISBN 978-0-387-28759-1Google Scholar
  3. 3.
    Aggarwal CC, Yu PS (2006) Framework for clustering massive text and categorical data streams. In: Proceedings of the sixth SIAM international conference on data mining, pp 447–471Google Scholar
  4. 4.
    Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of international conference on very large databases, pp 81–92Google Scholar
  5. 5.
    Aggarwal CC, Han J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large databases, Morgan Kaufmann, Burlington pp 852–863Google Scholar
  6. 6.
    Agrawal R, Gehrke J, Gunopolos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining application. In: Proceedings of the ACM SIGMOD international conference on management of data, ACM PressGoogle Scholar
  7. 7.
    Bailey DG, Johnston CT (2007) Single pass connected components analysis. In: Proceedings of image and vision computing, IEEE. New ZealandGoogle Scholar
  8. 8.
    Bailey DG, Johnston CT, Ma N (2008) Connected components analysis of streamed images. In: Proceedings of the international conference on field programmable logic and applications. ACM Press, pp 679–682Google Scholar
  9. 9.
    Barbára D (2002) Requirements of clustering data streams. SIGKDD Explor 3(2):23–27CrossRefGoogle Scholar
  10. 10.
    Barbára D, Chen P (2001) Tracking clusters in evolving data sets. In: Proceedings of FLAIRS 2001, special track on knowledge discovery and data miningGoogle Scholar
  11. 11.
    Berkhin P (2002) Survey of clustering data mining techniques. Available from http://citeseerx.ist.psu.edu. Accessed March, 2011
  12. 12.
    Bhatnagar V, Kaur S (2007) Exclusive and complete clustering of streams. In: Proceedings of the eighteenth international conference on database and expert systems applications. Germany, pp 629–638Google Scholar
  13. 13.
    Bouman CA (2010) Connected component analysis. Digital image processing, pp 1–19. Available from https://engineering.purdue.edu/bouman/ece637/notes/pdf/ConnectComp.pdf. Accessed March, 2012
  14. 14.
    Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining. Maryland, pp 326–337Google Scholar
  15. 15.
    Chen Y, Tu L (2007) Density-based clustering for real-ime stream data. In: Proceedings of the thirteenth international conference on knowledge discovery and data mining. ACM, San JoseGoogle Scholar
  16. 16.
    Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, USA, pp 84–93Google Scholar
  17. 17.
    Dang XH, Lee VCS, Ng WK, Ong KL (2009) Incremental and adaptive clustering stream data over sliding window. In: Proceedings of 20th international conference on database and expert systems applications, Austria, pp 660–674Google Scholar
  18. 18.
    Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231Google Scholar
  19. 19.
    Gaber MM, Siddiqi AM (2010) Distributed data stream classification for wireless sensor networks. In: Proceedings of the 2010 ACM symposium on applied computing. ACM, Sierre, pp 1629–1630Google Scholar
  20. 20.
    Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Rec 3:18–26CrossRefGoogle Scholar
  21. 21.
    Gao J, Li J, Zhang Z, Tan PN (2005) An incremental data stream clustering algorithms based on dense units detection. In: Proceedings of the ninth Pacific-Asia conference on advances in knowledge discovery and data mining. Springer, Berlin, pp 420–425Google Scholar
  22. 22.
    Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look. In: Tutorial notes of the 28th international conference on very large databases, ChinaGoogle Scholar
  23. 23.
    Guha S, Mishra N, Motwani R, O’Callaghan L (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of IEEE international conference on data engineering. IEEE Computer Society, USAGoogle Scholar
  24. 24.
    Gupta C, Grossman RL (2007) Outlier detection with streaming dyadic decomposition. In: Proceedings of the seventh industrial conference on data mining. Springer, Berlin (Heidelberg), pp 77–91Google Scholar
  25. 25.
    Han J, Kamber M (2005) Data mining concepts and techniques, 2nd edn. Morgan-Kaufmann, San FranciscoGoogle Scholar
  26. 26.
    He Z, Xu X, Deng S, Huang JZ (2004) Clustering categorical data streams. Comput Res Repos. abs/cs/0412058Google Scholar
  27. 27.
    Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Scotland, pp 506–517Google Scholar
  28. 28.
    Hirsh H (2008) Data mining research: current status and future oppourtunities. Stat Anal Data Min 1(2):104–107CrossRefMathSciNetGoogle Scholar
  29. 29.
    Hsu CC, Huang YP (2008) Incremental clustering of mixed data based on distance hierarchy. Expert Syst Appl 35(3):1177–1185CrossRefGoogle Scholar
  30. 30.
    Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the fifteenth international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 389–396Google Scholar
  31. 31.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRefGoogle Scholar
  32. 32.
    Jia C, Tan C, Yong A (2008) A grid and density-based clustering algorithm for processing data stream. In: Proceedings of the second international conference on genetic and evolutionary computing. IEEE Computer Society, Los Alamitos, CA, USAGoogle Scholar
  33. 33.
    Kailing K, Kriegel HP, Kroger P (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the fourth SIAM international conference on data mining, pp 246–257Google Scholar
  34. 34.
    Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. Trans Knowl Discov Data 3(1):1–58CrossRefGoogle Scholar
  35. 35.
    Lee JW, Lee WS (2008) A coarse-grain grid-based subspace clustering method for online muti-dimensional data streams. In: Proceedings of the seventeenth ACM conference on information and knowledge management. ACM PressGoogle Scholar
  36. 36.
    Lu Y, Sun Y, Xu G, Liu G (2005) A grid-based clustering algorithm for high-dimensional data streams. In: Proceedings of the first international conference on advanced data mining application. ChinaGoogle Scholar
  37. 37.
    Luhr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity-based representative points. Data Knowl Eng 68:1–27CrossRefGoogle Scholar
  38. 38.
    Moise G, Zimek A, Kröger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21:299–326CrossRefGoogle Scholar
  39. 39.
    Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. In: Proceedings of Australasian workshop on data mining and web intelligence, vol 32. New Zealand, pp 115–120Google Scholar
  40. 40.
    Orlowska ME, Sun X, Li X (2006) Can exclusive clustering on streaming data be achieved? SIGKDD Explor 8:102–108CrossRefGoogle Scholar
  41. 41.
    Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33:32–37CrossRefGoogle Scholar
  42. 42.
    Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. J Data Knowl Eng 63:528–549CrossRefGoogle Scholar
  43. 43.
    Ruiz C, Menasalvas E, Spiliopoulou M (2009) C-DenStream: using domain knowledge on a data stream. Discovery Science. Lecture notes in computer science, vol 5808. Springer, Berlin, pp 287–301Google Scholar
  44. 44.
    Song M, Wang H (2004) Incremental estimation of Gaussian mixture models for online data stream clustering. In: Proceedings of the international conference on bioinformatics and its applicationsGoogle Scholar
  45. 45.
    Tan P, Steinbach M, Kumar V (2007) Introduction to data mining, Pearson Education, ISBN 978-81-317-1472-0Google Scholar
  46. 46.
    Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Proceedings of the sixth IEEE international conference on data mining—Workshops, pp 638–642Google Scholar
  47. 47.
    Tu L, Chen Y (2009) Stream data clustering based on grid density and attraction. ACM Trans Knowl Discov Data 3(3):1–27CrossRefGoogle Scholar
  48. 48.
    Uci, KDD Archive (1999) KDD CUP 99 intrusion data. Available from http://kdd.ics.uci.edu//databases/kddcup99. Accessed March, 2012
  49. 49.
    University of California at Irvine (1998) UCI machine learning repository. Available from http://archive.ics.uci.edu/ml/datasets.html. Accessed March, 2012
  50. 50.
    Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28CrossRefGoogle Scholar
  51. 51.
    Wang S, Fan Y, Zhang C, Xu H, Hao X, Hu Y (2008) Entropy-based clustering of data streams with mixed numeric and categorical values In: Proceedings of the seventh IEEE/ACIS international conference on computer and information science, pp 140–145Google Scholar
  52. 52.
    Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRefGoogle Scholar
  53. 53.
    Yanchang Z, Junde S (2001) GDILC: a grid-based density-isoline clustering algorithm. In: Proceedings of international conference on Info-tech and Info-net, IEEE. pp 140–145Google Scholar
  54. 54.
    Yanchang Z, Junde S (2003) AGRID: an efficient algorithm for clustering large high-dimensional datasets. In: Proceedings of the seventh Pacific-Asia international conference an advances in knowledge discovery and data mining. Springer, Berlin, pp 271–282Google Scholar
  55. 55.
    Yapa RD, Koichi H (2007) A connected component labeling algorithm for grayscale images and application of the algorithm on mammograms. In: Proceedings of the symposium on applied computing. ACM, SeoulGoogle Scholar
  56. 56.
    Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceedings of the eighth international conference on database systems for advanced applications. Japan, pp 285–292Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Vasudha Bhatnagar
    • 1
  • Sharanjit Kaur
    • 2
    Email author
  • Sharma Chakravarthy
    • 3
  1. 1.Department of Computer ScienceUniversity of DelhiDelhiIndia
  2. 2.Department of Computer Science, Acharya Narendra Dev CollegeUniversity of DelhiDelhiIndia
  3. 3.Computer Science and Engineering DepartmentThe University of TexasArlingtonUSA

Personalised recommendations