Solving k-means on High-Dimensional Big Data

Kappmeier, Jan-Philipp W.; Schmidt, Daniel R.; Schmidt, Melanie

doi:10.1007/978-3-319-20086-6_20

Jan-Philipp W. Kappmeier¹⁴,
Daniel R. Schmidt¹⁵ &
Melanie Schmidt¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9125))

Included in the following conference series:

International Symposium on Experimental Algorithms

1273 Accesses
1 Citations

Abstract

In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the k-means problem, this has led to the development of several \((1+\varepsilon )\)-approximations (under the assumption that k is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits.

We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C++ library: Lapack++ v2.5.4. (2010). http://sourceforge.net/projects/lapackpp/ (accessed: February 8, 2015)
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM J. of Exp. Algorithmics 17, 1–30 (2012)
Google Scholar
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. of the ACM 51(4), 606–635 (2004)
Article MATH MathSciNet Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)
Google Scholar
Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. of Algorithms 1(4), 301–358 (1980)
Article MATH MathSciNet Google Scholar
Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proc. of the 47th STOC, (to appear 2015)
Google Scholar
Drineas, P., Frieze, A.M., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Machine Learning 56, 9–33 (2004)
Article MATH Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: Workshop on Generative-Model Based Vision, CVPR. IEEE (2004)
Google Scholar
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43th STOC, pp. 569–578 (2011)
Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2013)
Google Scholar
Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C.: BICO: BIRCH meets coresets for k-means clustering. In: Proc. 21st ESA, pp. 481–492 (2013)
Google Scholar
Halko, N., Martinsson, P.-G., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review (SIREV) 53(2), 217–288 (2011)
Article MATH MathSciNet Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Article Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Pr. Hall (1988)
Google Scholar
Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and lagrangian relaxation. J. of the ACM 48(2), 274–296 (2001)
Article MATH MathSciNet Google Scholar
Kappmeier, J.-P.W., Schmidt, D.R., Schmidt, M.: Solving k-means on high-dimensional big data (2015). CoRR, abs/1502.04265
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. Bell Lab. Tech. Memor (1957)
Google Scholar
Mahoney, M.W.: Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 3(2), 123–224 (2011)
MATH MathSciNet Google Scholar
Okanohara, D.: C++ project: redsvd - RandomizED Singular Value Decomposition (2011). https://code.google.com/p/redsvd/ (accessed: February 2, 2015)
Stallmann, J.: Benchmarkinstanzen für das \(k\)-means Problem. TU Dortmund University, Bachelorarbeit (2014). In german
Google Scholar
Steinhaus, H.: Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences IV(12), 801–804 (1956)
MathSciNet Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for \(k\)-means clustering. Comp. Geom. 28(2–3), 89–112 (2004)
Article MATH MathSciNet Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Know. and Inf. Sys. 14(1), 1–37 (2008)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A New Data Clustering Algorithm and Its Applications. Data M. and Know. Disc. 1(2), 141–182 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Technische Universität Berlin, Berlin, Germany
Jan-Philipp W. Kappmeier
Department of Computer Science, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, 15213, USA
Daniel R. Schmidt & Melanie Schmidt

Authors

Jan-Philipp W. Kappmeier
View author publications
You can also search for this author in PubMed Google Scholar
Daniel R. Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Schmidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Melanie Schmidt .

Editor information

Editors and Affiliations

Université Pierre et Marie Curie LIP6, Paris, France
Evripidis Bampis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kappmeier, JP.W., Schmidt, D.R., Schmidt, M. (2015). Solving k-means on High-Dimensional Big Data. In: Bampis, E. (eds) Experimental Algorithms. SEA 2015. Lecture Notes in Computer Science(), vol 9125. Springer, Cham. https://doi.org/10.1007/978-3-319-20086-6_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-20086-6_20
Published: 20 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20085-9
Online ISBN: 978-3-319-20086-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics