A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Al Hasib, Abdullah; Cebrian, Juan M.; Natvig, Lasse

doi:10.1007/s11227-018-2310-0

A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Published: 10 March 2018

Volume 74, pages 2705–2728, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Abdullah Al Hasib¹,
Juan M. Cebrian² &
Lasse Natvig¹

338 Accesses
2 Citations
Explore all metrics

Abstract

Clustering algorithms (i.e., Gaussian mixture models, k-means) tackle the problem of grouping a set of elements in such a way that elements from the same group (or cluster) have more similar properties to each other than to those elements in other clusters. This simple concept turns out to be the basis in complex algorithms from many application areas, including sequence analysis and genotyping in bioinformatics, medical imaging, antimicrobial activity, market research, social networking, etc. However, as the data volume continues to increase, the performance of clustering algorithms is heavily influenced by the memory subsystem. In this paper, we propose a novel and efficient implementation of Lloyd’s k-means clustering algorithm to substantially reduce data movement along the memory hierarchy. Our contributions are based on the fact that the vast majority of processors are equipped with powerful Single Instruction Multiple Data (SIMD) instructions that are, in most cases, underused. SIMD improves the CPU computational power and, if used wisely, can be seen as an opportunity to improve on the application data transfers by compressing/decompressing the data, specially for memory-bound applications. Our contributions include a SIMD-friendly data layout organization, in-register implementation of key functions and SIMD-based compression. We demonstrate that using our optimized SIMD-based compression method, it is possible to improve the performance and energy of k-means by a factor of 4.5x and 8.7x, respectively, for a i7 Haswell machine, and 22x and 22.2x for Xeon Phi: KNL, running a single thread.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

Article 13 October 2016

Notes

Positron emission tomography.
Store instructions that skip the first level of the cache hierarchy.
The addition of all the data values within a vector register.
Network on Chip.

References

Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pp 1027–1035
Ayguadé D, Copty N, Duran A, Hoefinger J, Lin Y, Massaioli F, Teruel X, Unnikrishnan P, Zhang G (2009) The design of OpenMP tasks. IEEE Trans Parallel Distrib Syst 20(3):401–418
Article Google Scholar
Browne S, Dongarra J, Garner N, Ho G, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14(3):189–204
Article Google Scholar
Burks S, Harrell G, Wang J (2015) On initial effects of the K-means clustering. In: Proceedings of the International Conference on Scientific Computing, pp 200–205
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370–1380
Article Google Scholar
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Article Google Scholar
Hadian A, Shahrivari S (2014) High performance parallel K-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863
Article Google Scholar
Hamerly G (2010) Making k-means even faster. In: 2010 SIAM International Conference on Data Mining, pp 130–140
Hasib AA, Cebrián JM, Natvig L (2015) V-pfordelta: data compression for energy efficient computation of time series. In: 2015 IEEE 22nd International Conference on High Performance Computing, pp 416–425
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Article MATH Google Scholar
Lemire D, Boytsov L, Kurz N (2015) SIMD compression and the intersection of sorted integers. Softw Pract Exp 46(6):723–749
Article Google Scholar
Lloyd S (2006) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet MATH Google Scholar
Mall R (2015) Sparsity in large scale kernel models. Ph.D. thesis, Leuven Arenberg Doctoral School
Mall R, Jumutc V, Langone R, Suykens JAK (2014) Representative subsets for big data learning using K-NN graphs. In: IEEE International Conference on Big Data, pp 37–42
Mathew J, Vijayakumar R (2015) Enhancement of parallel K-means algorithm. In: Proceedings of the International Conference on Innovations in Information, Embedded and Communication Systems, pp 1–6
Mittal S, Vetter J (2015) A survey of architectural approaches for data compression in cache and main memory systems. IEEE Trans Parallel Distrib Syst 99(1):1–14
Google Scholar
Stephens N (2016) The scalable vector extension (SVE) for the ARMv8-A architecture. https://community.arm.com/groups/processors/blog/2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
Northwestern University, USA (2013) Parallel K-means data clustering. http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html
Ravaee H (2012) Finding protein complexes via fuzzy learning vector quantization algorithm. In: Cai W, Hong H (eds) Protein–protein interactions—computational and experimental tools. InTech, London, United Kingdom, pp 273–284
Google Scholar
Rivoire S, Shah MA, Ranganathan P, Kozyrakis C, Meza J (2007) Models and metrics to enable energy-efficiency optimizations. Computer 40(12):39–48
Article Google Scholar
University of California, Irvine (2018) Machine learning repository. https://archive.ics.uci.edu/ml/datasets.html
University of California, Irvine (2018) Synthetic control chart dataset. http://archive.ics.uci.edu/ml/machine-learning-databases/synthetic_control-mld/synthetic_control.data.html
Fränti P et al (2015) Clustering datasets. http://cs.uef.fi/sipu/datasets/
Wang J, Wang J, Song J, Xu XS, Shen HT, Li S (2015) Optimized cartesian k-means. IEEE Trans Knowl Data Eng 27(1):180–192
Article Google Scholar
Wu F, Wu Q, Tan Y, Wei L, Shao L, Gao L (2013) A vectorized K-means algorithm for intel many integrated core architecture. In: International Symposium on Advanced Parallel Processing Technologies, pp 277–294
Xiao L, Shao Z, Liu G (2006) K-means algorithm based on particle swarm optimization algorithm for anomaly intrusion detection. In: The Sixth World Congress on Intelligent Control and Automation, vol 2, pp 5854–5858
Zechner M, Granitzer M (2009) K-means on the graphics processor: design and experimental analysis. Int J Adv Syst Meas 2:224–235
Google Scholar
Zeng G (2012) Fast approximate K-means via cluster closures. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR’12, Washington, DC, USA, pp 3037–3044

Download references

Author information

Authors and Affiliations

Norwegian University of Science and Technology (NTNU), 7491, Trondheim, Norway
Abdullah Al Hasib & Lasse Natvig
Barcelona Supercomputing Center (BSC), Carrer Jordi Girona 29, 08034, Barcelona, Spain
Juan M. Cebrian

Authors

Abdullah Al Hasib
View author publications
You can also search for this author in PubMed Google Scholar
Juan M. Cebrian
View author publications
You can also search for this author in PubMed Google Scholar
Lasse Natvig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdullah Al Hasib.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al Hasib, A., Cebrian, J.M. & Natvig, L. A vectorized k-means algorithm for compressed datasets: design and experimental analysis. J Supercomput 74, 2705–2728 (2018). https://doi.org/10.1007/s11227-018-2310-0

Download citation

Published: 10 March 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11227-018-2310-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation