Advertisement

Journal of Computer Science and Technology

, Volume 21, Issue 2, pp 284–296 | Cite as

Efficient Computation of k-Medians over Data Streams Under Memory Constraints

  • Zhi-Hong ChongEmail author
  • Jeffrey Xu Yu
  • Zhen-Jie Zhang
  • Xue-Min Lin
  • Wei Wang
  • Ao-Ying Zhou
Database and Knowledge-Based Systems

Abstract

In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1−∊)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.

Keywords

data streams k-medians cluster data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In FOCS′00: Proc. the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, p.359.Google Scholar
  2. [2]
    Moses Charikar, Liadan O’Callaghan, Rina Panigrahy. Better streaming algorithms for clustering problems. In STOC′03: Proc. the 35th Annual ACM Symposium on Theory of Computing, San Diego, CA, USA, 2003, pp.30–39.Google Scholar
  3. [3]
    Sariel Har-Peled, Soham Mazumdar. On coresets for κ-means and κ-median clustering. In STOC′04: Proc. the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 2004, pp.291–300.Google Scholar
  4. [4]
    Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC′04: Proc. the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 2004, pp.373–380.Google Scholar
  5. [5]
    Charu C Aggarwal, Jiawei Han, Jianyong Wang, Philip S Yu. A framework for clustering evolving data streams. In VLDB′03: Proc. 29th International Conference on Very Large Data Bases, Berlin, Germany, 2003, pp.81–92.Google Scholar
  6. [6]
    Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD′96: Proc. the 1996 ACM SIGMOD Int. Conf. Management of Data, Montreal, Quebec, Canada, 1996, pp.103–114.Google Scholar
  7. [7]
    Moses Charikar, Sudipto Guha. Improved combinatorial algorithms for the facility location and κ-median problems. In FOCS′99: Proc. the 40th Annual Symposium on Foundations of Computer Science, New York, NY, USA, 1999, p.378.Google Scholar
  8. [8]
    Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD′96: Proc. the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, 1996, pp.226–231.Google Scholar
  9. [9]
    Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, Jörg Sander. OPTICS: Ordering points to identify the clustering structure. In SIGMOD′99: Proc. the 1999 ACM SIGMOD Int. Conf. Management of Data, Philadelphia, Pennsylvania, USA, 1999, pp.49–60.Google Scholar
  10. [10]
    Markus M Breunig, Hans-Peter Kriegel, Peer Kröger, Jörg Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. SIGMOD Rec., 2001, 30(2): 79–90.CrossRefGoogle Scholar
  11. [11]
    Samer Nassar, Jörg Sander, Corrine Cheng. Incremental and effective data summarization for dynamic hierarchical clustering. In SIGMOD′04: Proc. the 2004 ACM SIGMOD Int. Conf. Management of Data, 2004, Paris, France, pp.467–478.Google Scholar
  12. [12]
    Carlos Ordonez, Edward Omiecinski, Norberto Ezquerra. A fast algorithm to cluster high dimensional basket data. In ICDM′01: Proc. the 2001 IEEE Int. Conf. Data Mining, San Jose, California, USA, 2001, pp.633–636.Google Scholar
  13. [13]
    Gurmeet Singh Manku, Rajeev Motwani. Approximate frequency counts over data streams. In VLDB′02: Proc. 28th Int. Conf. Very Large Data Bases, Hong Kong, China, 2002, pp.346–357.Google Scholar
  14. [14]
    Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In VLDB′04: Proc. the 30th Int. Conf. Very Large Data Bases, Toronto, Canada, 2004, pp.204–215.Google Scholar
  15. [15]
    Liadan O’Callaghan, Adam Meyerson, Rajeev Motwani et al. Streaming-data algorithms for high-quality clustering. In ICDE′02: Proc. the 18th Int. Conf. Data Engineering, San Jose, California, USA, 2002, p.685.Google Scholar
  16. [16]
    MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math, Stat. and Prob, University of California Press, 1967, pp.281–297.Google Scholar
  17. [17]
    Meyerson A. Online facility location. In FOCS′01: Proc. the 42nd IEEE Symposium on Foundations of Computer Science, Las Vegas, Nevada, USA, 2001, p.426.Google Scholar
  18. [18]
    Haixun Wang, Wei Fan, Philip S Yu, Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In KDD′03: Proc. the Ninth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Washington DC, 2003, pp.226–235.Google Scholar
  19. [19]
    Pankaj K Agarwal, Sariel Har-Peled, Kasturi R Varadarajan. Geometric Approximation via Coresetshttp://valis.cs.uiuc.edu/~sariel/papers/04/survey/
  20. [20]
    Nam Hun Park, Won Suk Lee. Statistical grid-based clustering over data streams. SIGMOD Record, 2004, 33(1): 32–37.CrossRefGoogle Scholar
  21. [21]
    Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan. Maintaining variance and κ-medians over data stream windows. In PODS′03: Proc. the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2003, San Diego, CA, USA, pp.234–243.Google Scholar
  22. [22]
    Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces. In FOCS′99: Proc. the 40th Annual Symposium on Foundations of Computer Science, New York, NY, USA, 1999, p.154.Google Scholar
  23. [23]
    Mettu R R, Plaxton C G. The online median problem. In FOCS′00: Proc. the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, p.339.Google Scholar
  24. [24]
    Alexander Hinneburg, Daniel A Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In VLDB′99: Proc. 25th Int. Conf. Very Large Data Bases, Edinburgh, Scotland, UK, 1999, pp.506–517.Google Scholar
  25. [25]
    Jain A K, Murty M N, Flynn P J. Data clustering: A review. ACM Comput. Surv., 1999, 31(3): 264–323.CrossRefGoogle Scholar
  26. [26]
    Pavel Berkhin. Survey of clustering data mining techniques. Technical Report, Accrue Software, San Jose, CA, 2002. http://citeseer.nj.nec.com/berkhin02survey.html
  27. [27]
    Mohanmed Medhat Gaber, Arkady Zaslavsky, Shonali Krishnaswamy. Mining data streams: A review. SIGMOD Record, 2005, 34(2): 18–26.CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, Inc. 2006

Authors and Affiliations

  • Zhi-Hong Chong
    • 1
    • 4
    • 5
    Email author
  • Jeffrey Xu Yu
    • 2
  • Zhen-Jie Zhang
    • 3
  • Xue-Min Lin
    • 4
  • Wei Wang
    • 4
  • Ao-Ying Zhou
    • 1
  1. 1.Department of Computer Science and EngineeringFudan UniversityShanghaiP.R. China
  2. 2.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongP.R. China
  3. 3.School of ComputingNational University of SingaporeSingapore
  4. 4.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  5. 5.Department of Computer Science and EngineeringSoutheast UniversityNanjingP.R. China

Personalised recommendations