Abstract
Clustering is a useful and ubiquitous tool in data analysis. Broadly speaking, clustering is the problem of grouping a data set into several groups such that, under some definition of “similarity,” similar items are in the same group and dissimilar items are in different groups. In this chapter we focus on clustering in a streaming scenario where a small number of data items are presented at a time and we cannot store all the data points. Thus, our algorithms are restricted to a single pass. The space restriction is typically sublinear, \(o(n)\), where the number of input points is \(n\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
P.K. Agarwal, S. Har-Peled, K.R. Varadarajan, Approximating extent measure of points. J. ACM 51(4), 606–635 (2004)
N. Alon, S. Dar, M. Parnas, D. Ron, Testing of clustering. SIAM J. Discrete Math. 16(3), 393–417 (2003)
S. Arora, P. Raghavan, S. Rao, Approximation schemes for Euclidean \(k\)-medians and related problems, in Proc. of STOC (1998), pp. 106–113
D. Arthur, S. Vassilvitskii, \(k\)-Means++: the advantages of careful seeding, in Proc. of SODA (2007)
V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, V. Pandit, Local search heuristic for \(k\)-median and facility location problems. SIAM J. Comput. 33(3), 544–562 (2004)
S. Ben-David, A framework for statistical clustering with a constant time approximation algorithms for \(k\)-median clustering, in Proc. of COLT (2004), pp. 415–426
P.S. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proc. of KDD (1998), pp. 9–15
M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrieval. SIAM J. Comput., 1417–1440 (2004)
M. Charikar, S. Guha, Improved combinatorial algorithms for the facility location and \(k\)-median problems, in Proc. of FOCS (1999), pp. 378–388
M. Charikar, S. Guha, É. Tardos, D.B. Shmoys, A constant factor approximation algorithm for the \(k\)-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)
M. Charikar, S. Khuller, D.M. Mount, G. Narasimhan, Algorithms for facility location problems with outliers, in Proc. of SODA (2001), pp. 642–651
M. Charikar, L. O’Callaghan, R. Panigrahy, Better streaming algorithms for clustering problems, in Proc. of STOC (2003), pp. 30–39
J. Chuzhoy, S. Guha, E. Halperin, S. Khanna, G. Kortsarz, R. Krauthgamer, J. Naor, Asymmetric \(k\)-center is \({\varOmega}(\log^{*} n)\) hard to approximate, in Proc. of STOC (2004), pp. 21–27
A. Czumaj, C. Sohler, Sublinear-time approximation for clustering via random sampling, in Proc. of ICALP (2004), pp. 396–407
P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering large graphs via the singular value decomposition. Mach. Learn. 56, 9–33 (2004)
F. Farnstrom, J. Lewis, C. Elkan, True scalability for clustering algorithms, in SIGKDD Explorations (2000)
T. Feder, D.H. Greene, Optimal algorithms for appropriate clustering, in Proc. of STOC (1988), pp. 434–444
T.F. Gonzalez, Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38(2–3), 293–306 (1985)
S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering data streams, in Proc. of FOCS (2000), pp. 359–366
S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, in Proc. of SIGMOD (1998), pp. 73–84
S. Har-Peled, S. Mazumdar, On coresets for \(k\)-means and \(k\)-median clustering, in Proc. of STOC (2004), pp. 291–300
D.S. Hochbaum (ed.), Approximation Algorithms for NP-Hard Problems (Brooks/Cole, Pacific Grove, 1996)
D.S. Hochbaum, D.B. Shmoys, A unified approach to approximate algorithms for bottleneck problems. J. ACM 33(3), 533–550 (1986)
P. Indyk, Sublinear time algorithms for metric space problems, in Proc. STOC (1999)
K. Jain, V. Vazirani, Approximation algorithms for metric facility location and \(k\)-median problems using the primal–dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)
T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, A local search approximation algorithm for \(k\)-means clustering, in Proc. of SoCG (2002), pp. 10–18
L. Kaufmann, P.J. Rousseeuw, Clustering by means of medoids, in Statistical Data Analysis Based on the \(L_{1}\) Norm and Related Methods (Elsevier Science, Amsterdam, 1987), pp. 405–416
S. Kolliopoulos, S. Rao, A nearly linear-time approximation scheme for the Euclidean \(k\)-median problem, in Proc. of ESA (1999), pp. 378–389
M.R. Korupolu, C.G. Plaxton, R. Rajaraman, Analysis of a local search heuristic for facility location problems. J. Algorithms 37(1), 146–188 (2000)
J.H. Lin, J.S. Vitter, Approximation algorithms for geometric median problems. Inf. Process. Lett. 44, 245–249 (1992)
R. Mettu, C.G. Plaxton, Optimal time bounds for approximate clustering, in Proc. of UAI (2002), pp. 344–351
R. Mettu, C.G. Plaxton, The online median problem. SIAM J. Comput. 32(3) (2003)
A. Meyerson, L. O’Callaghan, S. Plotkin, A \(k\)-median algorithm with running time independent of data size. Mach. Learn. 56, 61–87 (2004)
N. Mishra, D. Oblinger, L. Pitt, Sublinear time approximate clustering, in Proc. of SODA (2001)
M. Thorup, Quick \(k\)-median, \(k\)-center, and facility location for sparse graphs, in Proc. of ICALP (2001), pp. 249–260
V. Vazirani, Approximation Algorithms (Springer, Berlin, 2001)
A. Weber, Über den Standort der Industrien (Theory of the Location of Industries) (University of Chicago Press, Chicago, 1929). Translated in 1929 by Carl J. Friedrich from Weber’s 1909 text
T. Zhang, R. Ramakrishnan, M.L. Birch, An efficient data clustering method for very large databases, in Proc. of SIGMOD (1996), pp. 103–114
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Guha, S., Mishra, N. (2016). Clustering Data Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-28608-0_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)