Advertisement

Abstract

Clustering is a common problem in the analysis of large data sets. Streaming algorithms, which make a single pass over the data set using small working memory and produce a clustering comparable in cost to the optimal offline solution, are especially useful. We develop the first streaming algorithms achieving a constant-factor approximation to the cluster radius for two variations of the k-center clustering problem. We give a streaming (4 + ε)-approximation algorithm using O(ε − 1 kz) memory for the problem with outliers, in which the clustering is allowed to drop up to z of the input points; previous work used a random sampling approach which yields only a bicriteria approximation. We also give a streaming (6 + ε)-approximation algorithm using O(ε − 1 ln (ε − 1) k + k 2) memory for a variation motivated by anonymity considerations in which each cluster must contain at least a certain number of input points.

Keywords

clustering k-center streaming outliers anonymity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agarwal, P., Har-Peled, S., Yu, H.: Robust shape fitting via peeling and grating coresets. In: Proc. of ACM Symp. on Discrete Algorithms (SODA), pp. 182–191 (2006)Google Scholar
  2. 2.
    Aggarwal, G., Feder, T., Kenthapadi, K., Khuller, S., Panigrahy, R., Thomas, D., Zhu, A.: Achieving anonymity via clustering. In: Proc. of ACM Principles of Database Systems (PODS), pp. 153–162 (2006)Google Scholar
  3. 3.
    Aldenderfer, M.S., Blashfield, R.K.: Cluster Analysis. Sage, Beverly Hills (1984)Google Scholar
  4. 4.
    Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. of ACM Symp. on Theory of Computing (STOC), pp. 250–257 (2002)Google Scholar
  5. 5.
    Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic infomation retrieval. In: Proc. of ACM Symp. on Theory of Computing (STOC), pp. 626–635 (1997)Google Scholar
  6. 6.
    Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proc. of ACM Symp. on Theory of Computing (STOC), pp. 30–39 (2003)Google Scholar
  7. 7.
    Charikar, M., Khuller, S., Mount, D., Narasimhan, G.: Algorithms for facility location problems with outliers. In: Proc. of ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 642–651 (2001)Google Scholar
  8. 8.
    Chen, K.: A constant factor approximation algorithm for k-median clustering with outliers. In: Proc. of ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 826–835 (2008)Google Scholar
  9. 9.
    Everitt, B.: Cluster Analysis. Heinemann Educational, London (1974)Google Scholar
  10. 10.
    Gonzalez, T.: Clustering to minimize the maximum inter-cluster distance. Theoretical Computer Science 38, 293–306 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Guha, S.: The k-center karma of a data stream (unpublished manuscript) (2007)Google Scholar
  12. 12.
    Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Proc. of IEEE Foundations of Computer Science (FOCS), pp. 359–366 (2000)Google Scholar
  13. 13.
    Hartigan, J.A.: Clustering Algorithms. Wiley, Chichester (1975)zbMATHGoogle Scholar
  14. 14.
    Hochbaum, D., Shmoys, D.B.: A best possible approximation algorithm for the k-center problem. Math. of Operations Research 10, 180–184 (1985)zbMATHMathSciNetCrossRefGoogle Scholar
  15. 15.
    Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice Hall, NJ (1988)zbMATHGoogle Scholar
  16. 16.
    Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Richard Matthew McCutchen
    • 1
  • Samir Khuller
    • 1
  1. 1.University of Maryland 

Personalised recommendations