Skip to main content

Clustering Data Streams

  • Chapter
  • First Online:
Data Stream Management

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

Clustering is a useful and ubiquitous tool in data analysis. Broadly speaking, clustering is the problem of grouping a data set into several groups such that, under some definition of “similarity,” similar items are in the same group and dissimilar items are in different groups. In this chapter we focus on clustering in a streaming scenario where a small number of data items are presented at a time and we cannot store all the data points. Thus, our algorithms are restricted to a single pass. The space restriction is typically sublinear, \(o(n)\), where the number of input points is \(n\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P.K. Agarwal, S. Har-Peled, K.R. Varadarajan, Approximating extent measure of points. J. ACM 51(4), 606–635 (2004)

    MathSciNet  MATH  Google Scholar 

  2. N. Alon, S. Dar, M. Parnas, D. Ron, Testing of clustering. SIAM J. Discrete Math. 16(3), 393–417 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  3. S. Arora, P. Raghavan, S. Rao, Approximation schemes for Euclidean \(k\)-medians and related problems, in Proc. of STOC (1998), pp. 106–113

    Google Scholar 

  4. D. Arthur, S. Vassilvitskii, \(k\)-Means++: the advantages of careful seeding, in Proc. of SODA (2007)

    Google Scholar 

  5. V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, V. Pandit, Local search heuristic for \(k\)-median and facility location problems. SIAM J. Comput. 33(3), 544–562 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  6. S. Ben-David, A framework for statistical clustering with a constant time approximation algorithms for \(k\)-median clustering, in Proc. of COLT (2004), pp. 415–426

    Google Scholar 

  7. P.S. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proc. of KDD (1998), pp. 9–15

    Google Scholar 

  8. M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrieval. SIAM J. Comput., 1417–1440 (2004)

    Google Scholar 

  9. M. Charikar, S. Guha, Improved combinatorial algorithms for the facility location and \(k\)-median problems, in Proc. of FOCS (1999), pp. 378–388

    Google Scholar 

  10. M. Charikar, S. Guha, É. Tardos, D.B. Shmoys, A constant factor approximation algorithm for the \(k\)-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  11. M. Charikar, S. Khuller, D.M. Mount, G. Narasimhan, Algorithms for facility location problems with outliers, in Proc. of SODA (2001), pp. 642–651

    Google Scholar 

  12. M. Charikar, L. O’Callaghan, R. Panigrahy, Better streaming algorithms for clustering problems, in Proc. of STOC (2003), pp. 30–39

    Google Scholar 

  13. J. Chuzhoy, S. Guha, E. Halperin, S. Khanna, G. Kortsarz, R. Krauthgamer, J. Naor, Asymmetric \(k\)-center is \({\varOmega}(\log^{*} n)\) hard to approximate, in Proc. of STOC (2004), pp. 21–27

    Google Scholar 

  14. A. Czumaj, C. Sohler, Sublinear-time approximation for clustering via random sampling, in Proc. of ICALP (2004), pp. 396–407

    Google Scholar 

  15. P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering large graphs via the singular value decomposition. Mach. Learn. 56, 9–33 (2004)

    Article  MATH  Google Scholar 

  16. F. Farnstrom, J. Lewis, C. Elkan, True scalability for clustering algorithms, in SIGKDD Explorations (2000)

    Google Scholar 

  17. T. Feder, D.H. Greene, Optimal algorithms for appropriate clustering, in Proc. of STOC (1988), pp. 434–444

    Google Scholar 

  18. T.F. Gonzalez, Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38(2–3), 293–306 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  19. S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering data streams, in Proc. of FOCS (2000), pp. 359–366

    Google Scholar 

  20. S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, in Proc. of SIGMOD (1998), pp. 73–84

    Google Scholar 

  21. S. Har-Peled, S. Mazumdar, On coresets for \(k\)-means and \(k\)-median clustering, in Proc. of STOC (2004), pp. 291–300

    Google Scholar 

  22. D.S. Hochbaum (ed.), Approximation Algorithms for NP-Hard Problems (Brooks/Cole, Pacific Grove, 1996)

    MATH  Google Scholar 

  23. D.S. Hochbaum, D.B. Shmoys, A unified approach to approximate algorithms for bottleneck problems. J. ACM 33(3), 533–550 (1986)

    Article  MathSciNet  Google Scholar 

  24. P. Indyk, Sublinear time algorithms for metric space problems, in Proc. STOC (1999)

    Google Scholar 

  25. K. Jain, V. Vazirani, Approximation algorithms for metric facility location and \(k\)-median problems using the primal–dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  26. T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, A local search approximation algorithm for \(k\)-means clustering, in Proc. of SoCG (2002), pp. 10–18

    Google Scholar 

  27. L. Kaufmann, P.J. Rousseeuw, Clustering by means of medoids, in Statistical Data Analysis Based on the \(L_{1}\) Norm and Related Methods (Elsevier Science, Amsterdam, 1987), pp. 405–416

    Google Scholar 

  28. S. Kolliopoulos, S. Rao, A nearly linear-time approximation scheme for the Euclidean \(k\)-median problem, in Proc. of ESA (1999), pp. 378–389

    Google Scholar 

  29. M.R. Korupolu, C.G. Plaxton, R. Rajaraman, Analysis of a local search heuristic for facility location problems. J. Algorithms 37(1), 146–188 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  30. J.H. Lin, J.S. Vitter, Approximation algorithms for geometric median problems. Inf. Process. Lett. 44, 245–249 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  31. R. Mettu, C.G. Plaxton, Optimal time bounds for approximate clustering, in Proc. of UAI (2002), pp. 344–351

    Google Scholar 

  32. R. Mettu, C.G. Plaxton, The online median problem. SIAM J. Comput. 32(3) (2003)

    Google Scholar 

  33. A. Meyerson, L. O’Callaghan, S. Plotkin, A \(k\)-median algorithm with running time independent of data size. Mach. Learn. 56, 61–87 (2004)

    Article  MATH  Google Scholar 

  34. N. Mishra, D. Oblinger, L. Pitt, Sublinear time approximate clustering, in Proc. of SODA (2001)

    Google Scholar 

  35. M. Thorup, Quick \(k\)-median, \(k\)-center, and facility location for sparse graphs, in Proc. of ICALP (2001), pp. 249–260

    Google Scholar 

  36. V. Vazirani, Approximation Algorithms (Springer, Berlin, 2001)

    MATH  Google Scholar 

  37. A. Weber, Über den Standort der Industrien (Theory of the Location of Industries) (University of Chicago Press, Chicago, 1929). Translated in 1929 by Carl J. Friedrich from Weber’s 1909 text

    Google Scholar 

  38. T. Zhang, R. Ramakrishnan, M.L. Birch, An efficient data clustering method for very large databases, in Proc. of SIGMOD (1996), pp. 103–114

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sudipto Guha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Guha, S., Mishra, N. (2016). Clustering Data Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-28608-0_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28607-3

  • Online ISBN: 978-3-540-28608-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics