Advertisement

The Journal of Supercomputing

, Volume 69, Issue 2, pp 845–863 | Cite as

High performance parallel \(k\)-means clustering for disk-resident datasets on multi-core CPUs

  • Ali Hadian
  • Saeed Shahrivari
Article

Abstract

Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\)-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\)-means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\)-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.

Keywords

Clustering \(k\)-means Parallel algorithms Data mining Big data 

Notes

Acknowledgments

The authors would like to thank Ali Gholami Rudi for his constructive comments and discussions.

References

  1. 1.
    Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithm 17(1):2–4CrossRefGoogle Scholar
  2. 2.
    Agarwal PK, Har-Peled S, Varadarajan KR (2004) Approximating extent measures of points. J ACM 51(4):606–635CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Ana LNF, Jain AK (2003) Robust data clustering. In: Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE, 2003, pp II–128Google Scholar
  4. 4.
    Apiletti D, Baralis E, Cerquitelli T, Chiusano S, Grimaudo L (2013) Searum: a cloud-based service for association rule mining. In: Trust, security and privacy in computing and communications (TrustCom), 12th IEEE International Conference on, IEEE, 2013, pp 1283–1290Google Scholar
  5. 5.
    Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035Google Scholar
  6. 6.
    Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633CrossRefGoogle Scholar
  7. 7.
    Bekkerman R, Bilenko M, Langford J (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, CambridgeGoogle Scholar
  8. 8.
    Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications. Springer, BerlinGoogle Scholar
  9. 9.
    Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithm. Advances in Neural Information Processing Systems 7 (NIPS’94), pp 585–592Google Scholar
  10. 10.
    Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B (2011) Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 26–40Google Scholar
  11. 11.
    Chandra A, Yao X (2006) Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7):686–700Google Scholar
  12. 12.
    Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40CrossRefMathSciNetGoogle Scholar
  13. 13.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  14. 14.
    Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, pp 245–260Google Scholar
  15. 15.
    Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: ICML, pp 106–113Google Scholar
  16. 16.
    Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689Google Scholar
  17. 17.
    Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the 24th ACM-SIAM symposium on discrete algorithms (SODA13)Google Scholar
  18. 18.
    Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications, vol 20. Society for Industrial and Applied MathematicsGoogle Scholar
  19. 19.
    Hamerly YFG (2007) PG-means: learning the number of clusters in data. In: Advances in neural information processing systems, vol 19. The MIT Press, Cambridge, pp 393–400Google Scholar
  20. 20.
    Hawwash B, Nasraoui O (2012) Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine ’12ACM, New York, pp 109–117Google Scholar
  21. 21.
    Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRefGoogle Scholar
  22. 22.
    Mahafzah BA, Al-Badarneh AF, Zakaria MZ (2009) A new sampling technique for association rule mining. J Inf Sci 35(3):358–376CrossRefGoogle Scholar
  23. 23.
    McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley-Interscience, New YorkGoogle Scholar
  24. 24.
    OCallaghan L (2003) Approximation algorithms for clustering streams and large data sets. PhD thesis, Stanford UniversityGoogle Scholar
  25. 25.
    Pelleg D, Moore A (2000) Others: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 1. San Francisco, pp 727–734Google Scholar
  26. 26.
    Rajaraman A, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, CambridgeGoogle Scholar
  27. 27.
    Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178Google Scholar
  28. 28.
    Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In. Advances in neural information processing systems, pp 2375–2383Google Scholar
  29. 29.
    Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Series B (Stat Methodol) 63(2):411–423CrossRefMATHMathSciNetGoogle Scholar
  30. 30.
    Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically determining the number of clusters in unlabeled data sets. IEEE Trans Knowl Data Eng 21(3):335–350CrossRefGoogle Scholar
  31. 31.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Others: top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  32. 32.
    Xu R, Wunsch D (2008) Clustering, vol 10. Wiley-IEEE Press, New YorkGoogle Scholar
  33. 33.
    Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63(4):1031–1037CrossRefMATHMathSciNetGoogle Scholar
  34. 34.
    Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, pp 674–679Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer EngineeringIran University of Science and TechnologyTehranIran
  2. 2.Department of Electrical and Computer EngineeringTarbiat Modares UniversityTehranIran

Personalised recommendations