, Volume 18, Issue 3, pp 157–169 | Cite as

Efficient and Scalable k‑Means on GPUs

  • Clemens LutzEmail author
  • Sebastian Breß
  • Tilmann Rabl
  • Steffen Zeuch
  • Volker Markl


k-Means is a versatile clustering algorithm widely used in practice. To cluster large data sets, state-of-the-art implementations use GPUs to shorten the data to knowledge time. These implementations commonly assign points on a GPU and update centroids on a CPU.

We identify two main shortcomings of this approach. First, it requires expensive data exchange between processors when switching between the two processing steps point assignment and centroid update. Second, even when processing both steps of k-means on the same processor, points still need to be read two times within an iteration, leading to inefficient use of memory bandwidth.

In this paper, we present a novel approach for centroid update that allows us to efficiently process both phases of k-means on GPUs. We fuse point assignment and centroid update to execute one iteration with a single pass over the points. Our evaluation shows that our k-means approach scales to very large data sets. Overall, we achieve up to 20 × higher throughput compared to the state-of-the-art approach.



This work was funded by the EU projects SAGE (671500) and E2Data (780245), DFG Priority Program “Scalable Data Management for Future Hardware” (MA4662-5), and the German Ministry for Education and Research as BBDC (01IS14013A).


  1. 1.
    Amazon EC (2018) Amazon ec2 pricing. Accessed: 25 May 2018Google Scholar
  2. 2.
    Arthur D, Vassilvitskii S (2007) k‑means++: The advantages of careful seeding. In: ACM-SIAM, pp 1027–1035Google Scholar
  3. 3.
    Bai H et al (2009) k‑means on commodity GPUs with CUDA. In: WRI CSIE, pp 651–655Google Scholar
  4. 4.
    Breß S, Funke H, Teubner J (2016) Robust query processing in co-processor-accelerated databases. In: SIGMOD, pp 1891–1906CrossRefGoogle Scholar
  5. 5.
    Breß S et al (2017) Generating custom code for efficient query execution on heterogeneous processors. CoRR abs/1709.00700Google Scholar
  6. 6.
    Cao F, Tung AKH, Zhou A (2006) Scalable clustering using graphics processors. In: WAIM, pp 372–384Google Scholar
  7. 7.
    Cassou C (2008) Intraseasonal interaction between the madden–julian oscillation and the north atlantic oscillation. Nature 455(7212):523–527CrossRefGoogle Scholar
  8. 8.
    Che S et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IISWC, pp 44–54Google Scholar
  9. 9.
    Dall M et al (2017) Arctic sea ice melt leads to atmospheric new particle formation. Sci Rep 7(1):3318CrossRefGoogle Scholar
  10. 10.
    Elkan C (2003) Using the triangle inequality to accelerate k‑means. In: ICML, pp 147–153Google Scholar
  11. 11.
    Fang W et al (2008) Parallel data mining on graphics processors. Tech. Rep. HKUST-CS08-07, HKUSTGoogle Scholar
  12. 12.
    Farivar R et al (2008) A parallel implementation of k‑means clustering on GPUs. In: PDPTA, pp 340–345Google Scholar
  13. 13.
    Fernando R (2004) GPU gems: programming techniques, tips and tricks for real-time graphics. In: Pearson higher education (chap 37.2)Google Scholar
  14. 14.
    Funke H et al (2018) Pipelined query processing in coprocessor environments. In: SIGMOD, ACMGoogle Scholar
  15. 15.
    Hall J, Hart J (2004) GPU acceleration of iterative clustering. In: GPGPU, pp 45–52Google Scholar
  16. 16.
    He B et al (2009) Relational query coprocessing on graphics processors. ACM Trans Database Syst. CrossRefGoogle Scholar
  17. 17.
    Heimel M et al (2013) Hardware-oblivious parallelism for in-memory column-stores. Proceedings VLDB Endowment 6(9):709–720CrossRefGoogle Scholar
  18. 18.
    Heintzman ND et al (2007) Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 39(3):311CrossRefGoogle Scholar
  19. 19.
    Hellerstein J et al (2012) The MADlib analytics library or MAD skills, the SQL. Proceedings VLDB Endowment 5(12):1700–1711CrossRefGoogle Scholar
  20. 20.
    Karnagel T, Müller R, Lohman GM (2015) Optimizing GPU-accelerated group-by and aggregation. In: ADMS, pp 13–24Google Scholar
  21. 21.
    Kleisner KM et al (2016) The effects of sub-regional climate velocity on the distribution and spatial extent of marine species assemblages. PLoS ONE 11:1–21CrossRefGoogle Scholar
  22. 22.
    Lee S et al (2016) Evaluation of k‑means data clustering algorithm on intel xeon phi. In: BigData, pp 2251–2260Google Scholar
  23. 23.
    Li Y et al (2010) Speeding up k‑means algorithm by GPUs. In: IEEE CIT, pp 115–122Google Scholar
  24. 24.
    Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136MathSciNetCrossRefGoogle Scholar
  25. 25.
    Lutz C et al (2018) Efficient k‑means on GPUs. In: DaMoN CrossRefGoogle Scholar
  26. 26.
    MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proc. Fifth Berkeley Symp. on Math. Statist. and Prob., vol 1, pp 281–297Google Scholar
  27. 27.
    Mhembere D et al (2017) knor: A NUMA-optimized in-memory, distributed and semi-external-memory k‑means library. In: HPDCGoogle Scholar
  28. 28.
    Müller I et al (2015) Cache-efficient aggregation: hashing is sorting. In: SIGMOD, pp 1123–1136Google Scholar
  29. 29.
    Nugteren C et al (2011) High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs. In: GPGPU, p 1Google Scholar
  30. 30.
    Nvidia (2017a) CUDA C programming guide. Tech. Rep. PG-02829-001_v8.0. Accessed: 20 Jan 2017Google Scholar
  31. 31.
    Nvidia (2017b) Tuning CUDA applications for maxwell. Tech. Rep. DA-07173-001_v9.0. Accessed: 20 Jan 2017Google Scholar
  32. 32.
    Passing L et al (2017) SQL- and operator-centric data analytics in relational main-memory databases. In: EDBT, pp 84–95Google Scholar
  33. 33.
    Pirk H, Manegold S, Kersten ML (2014) Waste not…efficient co-processing of relational data. In: ICDE, pp 508–519Google Scholar
  34. 34.
    Pirk H et al (2016) Voodoo – A vector algebra for portable database performance on modern hardware. Proceedings VLDB Endowment 9(14):1707–1718CrossRefGoogle Scholar
  35. 35.
    Sanderson C, Curtin R (2016) Armadillo: a template-based c++ library for linear algebra. J Open Source Softw. CrossRefGoogle Scholar
  36. 36.
    Shalom A, Dash M, Tue M (2008) Efficient k‑means clustering using accelerated graphics processors. In: DaWaK, pp 166–175Google Scholar
  37. 37.
    Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k‑means for large datasets. In: NIPS, pp 2375–2383Google Scholar
  38. 38.
    Sitaridi EA, Ross KA (2013) Optimizing select conditions on gpus. In: DaMoN, p 4Google Scholar
  39. 39.
    Stehle E, Jacobsen H (2017) A memory bandwidth-efficient hybrid radix sort on GPUs. In: SIGMOD, pp 417–432Google Scholar
  40. 40.
    TPC-H (2017) Transaction processing performance council. Accessed: 29 Sep 2017Google Scholar
  41. 41.
    Vitak SA et al (2017) Sequencing thousands of single-cell genomes with combinatorial indexing. Nat Methods 14(3):302CrossRefGoogle Scholar
  42. 42.
    Wu F et al (2013) A vectorized k‑means algorithm for intel many integrated core architecture. In: APPT, pp 277–294Google Scholar
  43. 43.
    Zang C et al (2016) High-dimensional genomic data bias correction and data integration using mancie. Nat Commun 7:11305CrossRefGoogle Scholar
  44. 44.
    Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: SIGMOD, pp 103–114CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Deutschland, ein Teil von Springer Nature 2018

Authors and Affiliations

  1. 1.DFKI GmbHBerlinGermany
  2. 2.TU BerlinBerlinGermany

Personalised recommendations