Abstract
In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the k-means problem, this has led to the development of several \((1+\varepsilon )\)-approximations (under the assumption that k is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits.
We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C++ library: Lapack++ v2.5.4. (2010). http://sourceforge.net/projects/lapackpp/ (accessed: February 8, 2015)
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM J. of Exp. Algorithmics 17, 1–30 (2012)
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. of the ACM 51(4), 606–635 (2004)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)
Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. of Algorithms 1(4), 301–358 (1980)
Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proc. of the 47th STOC, (to appear 2015)
Drineas, P., Frieze, A.M., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Machine Learning 56, 9–33 (2004)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: Workshop on Generative-Model Based Vision, CVPR. IEEE (2004)
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43th STOC, pp. 569–578 (2011)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2013)
Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C.: BICO: BIRCH meets coresets for k-means clustering. In: Proc. 21st ESA, pp. 481–492 (2013)
Halko, N., Martinsson, P.-G., Tropp, J.A.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review (SIREV) 53(2), 217–288 (2011)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Pr. Hall (1988)
Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and lagrangian relaxation. J. of the ACM 48(2), 274–296 (2001)
Kappmeier, J.-P.W., Schmidt, D.R., Schmidt, M.: Solving k-means on high-dimensional big data (2015). CoRR, abs/1502.04265
Lloyd, S.P.: Least squares quantization in PCM. Bell Lab. Tech. Memor (1957)
Mahoney, M.W.: Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 3(2), 123–224 (2011)
Okanohara, D.: C++ project: redsvd - RandomizED Singular Value Decomposition (2011). https://code.google.com/p/redsvd/ (accessed: February 2, 2015)
Stallmann, J.: Benchmarkinstanzen für das \(k\)-means Problem. TU Dortmund University, Bachelorarbeit (2014). In german
Steinhaus, H.: Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences IV(12), 801–804 (1956)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for \(k\)-means clustering. Comp. Geom. 28(2–3), 89–112 (2004)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Know. and Inf. Sys. 14(1), 1–37 (2008)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A New Data Clustering Algorithm and Its Applications. Data M. and Know. Disc. 1(2), 141–182 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kappmeier, JP.W., Schmidt, D.R., Schmidt, M. (2015). Solving k-means on High-Dimensional Big Data. In: Bampis, E. (eds) Experimental Algorithms. SEA 2015. Lecture Notes in Computer Science(), vol 9125. Springer, Cham. https://doi.org/10.1007/978-3-319-20086-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-20086-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20085-9
Online ISBN: 978-3-319-20086-6
eBook Packages: Computer ScienceComputer Science (R0)