Skip to main content

Solving k-means on High-Dimensional Big Data

  • Conference paper
  • First Online:
Book cover Experimental Algorithms (SEA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9125))

Included in the following conference series:

Abstract

In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the k-means problem, this has led to the development of several \((1+\varepsilon )\)-approximations (under the assumption that k is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits.

We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C++ library: Lapack++ v2.5.4. (2010). http://sourceforge.net/projects/lapackpp/ (accessed: February 8, 2015)

  2. Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM J. of Exp. Algorithmics 17, 1–30 (2012)

    Google Scholar 

  3. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. of the ACM 51(4), 606–635 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  4. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)

    Google Scholar 

  5. Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. of Algorithms 1(4), 301–358 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  6. Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proc. of the 47th STOC, (to appear 2015)

    Google Scholar 

  7. Drineas, P., Frieze, A.M., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Machine Learning 56, 9–33 (2004)

    Article  MATH  Google Scholar 

  8. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: Workshop on Generative-Model Based Vision, CVPR. IEEE (2004)

    Google Scholar 

  9. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43th STOC, pp. 569–578 (2011)

    Google Scholar 

  10. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2013)

    Google Scholar 

  11. Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C.: BICO: BIRCH meets coresets for k-means clustering. In: Proc. 21st ESA, pp. 481–492 (2013)

    Google Scholar 

  12. Halko, N., Martinsson, P.-G., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review (SIREV) 53(2), 217–288 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  13. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)

    Google Scholar 

  14. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010)

    Article  Google Scholar 

  15. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Pr. Hall (1988)

    Google Scholar 

  16. Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and lagrangian relaxation. J. of the ACM 48(2), 274–296 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  17. Kappmeier, J.-P.W., Schmidt, D.R., Schmidt, M.: Solving k-means on high-dimensional big data (2015). CoRR, abs/1502.04265

    Google Scholar 

  18. Lloyd, S.P.: Least squares quantization in PCM. Bell Lab. Tech. Memor (1957)

    Google Scholar 

  19. Mahoney, M.W.: Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 3(2), 123–224 (2011)

    MATH  MathSciNet  Google Scholar 

  20. Okanohara, D.: C++ project: redsvd - RandomizED Singular Value Decomposition (2011). https://code.google.com/p/redsvd/ (accessed: February 2, 2015)

  21. Stallmann, J.: Benchmarkinstanzen für das \(k\)-means Problem. TU Dortmund University, Bachelorarbeit (2014). In german

    Google Scholar 

  22. Steinhaus, H.: Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences IV(12), 801–804 (1956)

    MathSciNet  Google Scholar 

  23. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for \(k\)-means clustering. Comp. Geom. 28(2–3), 89–112 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  24. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Know. and Inf. Sys. 14(1), 1–37 (2008)

    Article  Google Scholar 

  25. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A New Data Clustering Algorithm and Its Applications. Data M. and Know. Disc. 1(2), 141–182 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Melanie Schmidt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Kappmeier, JP.W., Schmidt, D.R., Schmidt, M. (2015). Solving k-means on High-Dimensional Big Data. In: Bampis, E. (eds) Experimental Algorithms. SEA 2015. Lecture Notes in Computer Science(), vol 9125. Springer, Cham. https://doi.org/10.1007/978-3-319-20086-6_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20086-6_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20085-9

  • Online ISBN: 978-3-319-20086-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics