Machine Learning

, Volume 66, Issue 2–3, pp 243–257 | Cite as

A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Article

Abstract

We consider a framework of sample-based clustering. In this setting, the input to a clustering algorithm is a sample generated i.i.d by some unknown arbitrary distribution. Based on such a sample, the algorithm has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clustering problems that imply the existence of sampling based clustering algorithms that approximate the optimal clustering. We show that the K-median clustering, as well as K-means and the Vector Quantization problems, satisfy these conditions. Our results apply to the combinatorial optimization setting where, assuming that sampling uniformly over an input set can be done in constant time, we get a sampling-based algorithm for the K-median and K-means clustering problems that finds an almost optimal set of centers in time depending only on the confidence and accuracy parameters of the approximation, but independent of the input size. Furthermore, in the Euclidean input case, the dependence of the running time of our algorithm on the Euclidean dimension is only linear. Our main technical tool is a uniform convergence result for center based clustering that can be viewed as showing that the effective VC-dimension of k-center clustering equals k.

Keywords

k-means clustering k-median clustering Sample-based clustering Approximation algorithms Description schemes 

References

  1. Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge University Press.Google Scholar
  2. Bartlett, P., Linder, T., & Lugosi, G. (1998). The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44, 1802–1813.MATHCrossRefMathSciNetGoogle Scholar
  3. Ben-David, S. (2004). A framework for statistical cluatering with a constant time approximation algorithm for K-median clustering. In Proceedings of the 17th Annual Conference on Learning Theory, COLT’04, Springer.Google Scholar
  4. Buhmann, J. (1998). Empirical risk approximation: An induction principle for unsupervised learning. Technical Report IAI-TR-98-3, Institut for Informatik III, Universitat Bonn.Google Scholar
  5. Czumaj, A., & Sohler, C. (2004). Sublinear-time approximation for clustering via random samples. In Proceedings of the 31st International Colloquium on Automata, Language and Programming (ICALP’04), LNCS 3142:396–407.Google Scholar
  6. Mettu, R. R., & Plaxton, C. G. (2004). Optimal time bounds for approximate clustering. Machine Learning, 56, 35–60.MATHCrossRefGoogle Scholar
  7. Meyerson, A., O’Callaghan, L., & Plotkin, S. (2004). A k-median algorithm with running time independent of data size. Journal of Machine Learning, Special Issue on Theoretical Advances in Data Clustering (MLJ).Google Scholar
  8. Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of Symposium on Discrete Algorithms, SODA, (pp. 439–447).Google Scholar
  9. Pollard, D. (1982). Quantization and the method of k-means. In IEEE Transactions on Information Theory, 28, 199–205.MATHCrossRefMathSciNetGoogle Scholar
  10. Smola, A. J., Mika, S., & Scholkopf, B. (1998). Quantization functionals and regularized principal manifolds. Neuro COLT Technical Report Series NC2-TR-1998-028.Google Scholar
  11. de la Vega, F., Karpinski, M., Kenyon, C., & Rabani, Y. (2003). Approximation schemes for clustering problems. In Proceedings of Symposium on the Theory of Computation, STOC’03.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2007

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations