Advertisement

BICO: BIRCH Meets Coresets for k-Means Clustering

  • Hendrik Fichtenberger
  • Marc Gillé
  • Melanie Schmidt
  • Chris Schwiegelshohn
  • Christian Sohler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8125)

Abstract

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.

BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ [5] on S.

We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)Google Scholar
  2. 2.
    Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)Google Scholar
  3. 3.
    Agarwal, P.K., Erickson, J.: Geometric range searching and its relatives. Contemporary Mathematics 223, 1–56 (1999)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. Journal of the ACM 51(4), 606–635 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. of the 22nd SoCG, pp. 144–153 (2006)Google Scholar
  6. 6.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)Google Scholar
  7. 7.
    Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39(3), 923–947 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)Google Scholar
  10. 10.
    Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43rd STOC, pp. 569–578 (2011)Google Scholar
  11. 11.
    Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proc. 23rd SoCG, pp. 11–18 (2007)Google Scholar
  12. 12.
    Feldman, D., Schmidt, M., Sohler, C.: Constant-size coresets for k-means, pca and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2012)Google Scholar
  13. 13.
    Fink, G.A., Plötz, T.: Open source project ESMERALDAGoogle Scholar
  14. 14.
    Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2(2), 139–172 (1987)Google Scholar
  15. 15.
    Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proc. of the 37th STOC, pp. 209–217 (2005)Google Scholar
  16. 16.
    Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE TKDE 15(3), 515–528 (2003)Google Scholar
  17. 17.
    Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Inform. Systems 25(5), 345–366 (2000)CrossRefGoogle Scholar
  18. 18.
    Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. Inform. Systems 26(1), 35–58 (2001)zbMATHCrossRefGoogle Scholar
  19. 19.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Inform. Systems 17(2-3), 107–145 (2001)zbMATHCrossRefGoogle Scholar
  20. 20.
    Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry 37(1), 3–19 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)Google Scholar
  22. 22.
    Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proc. of the 21st SODA, pp. 598–607 (2010)Google Scholar
  23. 23.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  24. 24.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Math. Stat. and Prob., pp. 281–297 (1967)Google Scholar
  25. 25.
    Ng, R.T., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE TKDE 14(5), 1003–1016 (2002)Google Scholar
  26. 26.
    O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Proc. 18th ICDE, pp. 685–694 (2002)Google Scholar
  27. 27.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Hendrik Fichtenberger
    • 1
  • Marc Gillé
    • 1
  • Melanie Schmidt
    • 1
  • Chris Schwiegelshohn
    • 1
  • Christian Sohler
    • 1
  1. 1.Efficient Algorithms and Complexity TheoryTU DortmundGermany

Personalised recommendations