Clustering Data Streams

Guha, Sudipto; Mishra, Nina

doi:10.1007/978-3-540-28608-0_8

Sudipto Guha⁶ &
Nina Mishra⁷

Part of the book series: Data-Centric Systems and Applications ((DCSA))

3741 Accesses
40 Citations

Abstract

Clustering is a useful and ubiquitous tool in data analysis. Broadly speaking, clustering is the problem of grouping a data set into several groups such that, under some definition of “similarity,” similar items are in the same group and dissimilar items are in different groups. In this chapter we focus on clustering in a streaming scenario where a small number of data items are presented at a time and we cannot store all the data points. Thus, our algorithms are restricted to a single pass. The space restriction is typically sublinear, \(o(n)\), where the number of input points is \(n\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

P.K. Agarwal, S. Har-Peled, K.R. Varadarajan, Approximating extent measure of points. J. ACM 51(4), 606–635 (2004)
MathSciNet MATH Google Scholar
N. Alon, S. Dar, M. Parnas, D. Ron, Testing of clustering. SIAM J. Discrete Math. 16(3), 393–417 (2003)
Article MathSciNet MATH Google Scholar
S. Arora, P. Raghavan, S. Rao, Approximation schemes for Euclidean \(k\)-medians and related problems, in Proc. of STOC (1998), pp. 106–113
Google Scholar
D. Arthur, S. Vassilvitskii, \(k\)-Means++: the advantages of careful seeding, in Proc. of SODA (2007)
Google Scholar
V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, V. Pandit, Local search heuristic for \(k\)-median and facility location problems. SIAM J. Comput. 33(3), 544–562 (2004)
Article MathSciNet MATH Google Scholar
S. Ben-David, A framework for statistical clustering with a constant time approximation algorithms for \(k\)-median clustering, in Proc. of COLT (2004), pp. 415–426
Google Scholar
P.S. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proc. of KDD (1998), pp. 9–15
Google Scholar
M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrieval. SIAM J. Comput., 1417–1440 (2004)
Google Scholar
M. Charikar, S. Guha, Improved combinatorial algorithms for the facility location and \(k\)-median problems, in Proc. of FOCS (1999), pp. 378–388
Google Scholar
M. Charikar, S. Guha, É. Tardos, D.B. Shmoys, A constant factor approximation algorithm for the \(k\)-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)
Article MathSciNet MATH Google Scholar
M. Charikar, S. Khuller, D.M. Mount, G. Narasimhan, Algorithms for facility location problems with outliers, in Proc. of SODA (2001), pp. 642–651
Google Scholar
M. Charikar, L. O’Callaghan, R. Panigrahy, Better streaming algorithms for clustering problems, in Proc. of STOC (2003), pp. 30–39
Google Scholar
J. Chuzhoy, S. Guha, E. Halperin, S. Khanna, G. Kortsarz, R. Krauthgamer, J. Naor, Asymmetric \(k\)-center is \({\varOmega}(\log^{*} n)\) hard to approximate, in Proc. of STOC (2004), pp. 21–27
Google Scholar
A. Czumaj, C. Sohler, Sublinear-time approximation for clustering via random sampling, in Proc. of ICALP (2004), pp. 396–407
Google Scholar
P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering large graphs via the singular value decomposition. Mach. Learn. 56, 9–33 (2004)
Article MATH Google Scholar
F. Farnstrom, J. Lewis, C. Elkan, True scalability for clustering algorithms, in SIGKDD Explorations (2000)
Google Scholar
T. Feder, D.H. Greene, Optimal algorithms for appropriate clustering, in Proc. of STOC (1988), pp. 434–444
Google Scholar
T.F. Gonzalez, Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38(2–3), 293–306 (1985)
Article MathSciNet MATH Google Scholar
S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering data streams, in Proc. of FOCS (2000), pp. 359–366
Google Scholar
S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, in Proc. of SIGMOD (1998), pp. 73–84
Google Scholar
S. Har-Peled, S. Mazumdar, On coresets for \(k\)-means and \(k\)-median clustering, in Proc. of STOC (2004), pp. 291–300
Google Scholar
D.S. Hochbaum (ed.), Approximation Algorithms for NP-Hard Problems (Brooks/Cole, Pacific Grove, 1996)
MATH Google Scholar
D.S. Hochbaum, D.B. Shmoys, A unified approach to approximate algorithms for bottleneck problems. J. ACM 33(3), 533–550 (1986)
Article MathSciNet Google Scholar
P. Indyk, Sublinear time algorithms for metric space problems, in Proc. STOC (1999)
Google Scholar
K. Jain, V. Vazirani, Approximation algorithms for metric facility location and \(k\)-median problems using the primal–dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)
Article MathSciNet MATH Google Scholar
T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, A local search approximation algorithm for \(k\)-means clustering, in Proc. of SoCG (2002), pp. 10–18
Google Scholar
L. Kaufmann, P.J. Rousseeuw, Clustering by means of medoids, in Statistical Data Analysis Based on the \(L_{1}\) Norm and Related Methods (Elsevier Science, Amsterdam, 1987), pp. 405–416
Google Scholar
S. Kolliopoulos, S. Rao, A nearly linear-time approximation scheme for the Euclidean \(k\)-median problem, in Proc. of ESA (1999), pp. 378–389
Google Scholar
M.R. Korupolu, C.G. Plaxton, R. Rajaraman, Analysis of a local search heuristic for facility location problems. J. Algorithms 37(1), 146–188 (2000)
Article MathSciNet MATH Google Scholar
J.H. Lin, J.S. Vitter, Approximation algorithms for geometric median problems. Inf. Process. Lett. 44, 245–249 (1992)
Article MathSciNet MATH Google Scholar
R. Mettu, C.G. Plaxton, Optimal time bounds for approximate clustering, in Proc. of UAI (2002), pp. 344–351
Google Scholar
R. Mettu, C.G. Plaxton, The online median problem. SIAM J. Comput. 32(3) (2003)
Google Scholar
A. Meyerson, L. O’Callaghan, S. Plotkin, A \(k\)-median algorithm with running time independent of data size. Mach. Learn. 56, 61–87 (2004)
Article MATH Google Scholar
N. Mishra, D. Oblinger, L. Pitt, Sublinear time approximate clustering, in Proc. of SODA (2001)
Google Scholar
M. Thorup, Quick \(k\)-median, \(k\)-center, and facility location for sparse graphs, in Proc. of ICALP (2001), pp. 249–260
Google Scholar
V. Vazirani, Approximation Algorithms (Springer, Berlin, 2001)
MATH Google Scholar
A. Weber, Über den Standort der Industrien (Theory of the Location of Industries) (University of Chicago Press, Chicago, 1929). Translated in 1929 by Carl J. Friedrich from Weber’s 1909 text
Google Scholar
T. Zhang, R. Ramakrishnan, M.L. Birch, An efficient data clustering method for very large databases, in Proc. of SIGMOD (1996), pp. 103–114
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Information Sciences, University of Pennsylvania, Philadelphia, PA, 19104, USA
Sudipto Guha
Computer Science Department, Stanford University, Stanford, CA, 94305, USA
Nina Mishra

Authors

Sudipto Guha
View author publications
You can also search for this author in PubMed Google Scholar
Nina Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sudipto Guha .

Editor information

Editors and Affiliations

University Campus - Kounoupidiana, School of ECE, Techn. Univ. of Crete University Campus - Kounoupidiana, Chania, Greece
Minos Garofalakis
Microsoft Corporation, Redmond, Washington, USA
Johannes Gehrke
Amazon India , Bangalore, India
Rajeev Rastogi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Guha, S., Mishra, N. (2016). Clustering Data Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-28608-0_8
Published: 12 July 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics