Overview of Scalable Partitional Methods for Big Data Clustering

  • Mohamed Aymen Ben HajKacem
  • Chiheb-Eddine Ben N’Cir
  • Nadia Essoussi
Part of the Unsupervised and Semi-Supervised Learning book series (UNSESUL)


Big data clustering has become an important challenge in machine learning since several applications require scalable clustering methods to organize such data into groups of similar objects. Several methods were proposed during the last decade to deal with this important challenge. We propose in this chapter an overview of the existing clustering methods with a special emphasis on scalable partitional methods. We design a new categorizing model based on the main properties pointed out in the Big data partitional clustering methods to ensure scalability when analyzing a large amount of data. Furthermore, a comparative experimental study of most of the existing methods is given over simulated and real large datasets. Based on the obtained results, we elaborate a guide for researchers and end users who want to decide the best method or framework to use when a task of clustering large scale of data is required.


  1. 1.
    M. Al-Ayyoub, A.M. Abu-Dalo, Y. Jararweh, M. Jarrah, M. Al Sa’d, A GPU-based implementations of the fuzzy C-means algorithms for medical image segmentation. J. Supercond. 71(8), 3149–3162 (2015)Google Scholar
  2. 2.
    B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii, Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)CrossRefGoogle Scholar
  3. 3.
    S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inform. Sci. 146(1), 221–237 (2002)MathSciNetCrossRefGoogle Scholar
  4. 4.
    M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, MapReduce-based k-prototypes clustering method for big data, in Proceedings of Data Science and Advanced Analytics, pp. 1–7 (2015)Google Scholar
  5. 5.
    M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, KP-S: a spark-based design of the K-prototypes clustering for big data, in Proceedings of ACS/IEEE International Conference on Computer Systems and Applications, pp. 1–7 (2017)Google Scholar
  6. 6.
    M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, One-pass MapReduce-based clustering method for mixed large scale data. J. Intell. Inf. Syst. 1–18 (2017)Google Scholar
  7. 7.
    J.C. Bezdek, R. Ehrlich, W. Full, FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)CrossRefGoogle Scholar
  8. 8.
    P.S. Bradley, U.M. Fayyad. Refining initial points for K-means clustering, in Proceeding ICML ’98 Proceedings of the Fifteenth International Conference on Machine Learning, vol. 98, pp. 91–99 (1998)Google Scholar
  9. 9.
    M. Capó, A. Pérez, J.A. Lozano, An efficient approximation to the k-means clustering for massive data. Knowl.-Based Syst. 117, 56–69 (2017)CrossRefGoogle Scholar
  10. 10.
    M.E. Celebi, H.A. Kingravi, P.A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)CrossRefGoogle Scholar
  11. 11.
    O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning (Chapelle, O. et al., Eds.; 2006)[Book reviews]. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)Google Scholar
  12. 12.
    S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68(10), 1370–1380 (2008)CrossRefGoogle Scholar
  13. 13.
    M.C. Chiang, C.W. Tsai, C.S. Yang, A time-efficient pattern reduction algorithm for k-means clustering. Inform. Sci. 181(4), 716–731 (2011)CrossRefGoogle Scholar
  14. 14.
    X. Cui, P. Zhu, X. Yang, K. Li, C. Ji, Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249–1259 (2014)CrossRefGoogle Scholar
  15. 15.
    J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  16. 16.
    R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification. Wiley, Hoboken (2012)zbMATHGoogle Scholar
  17. 17.
    J. Drake, G. Hamerly, Accelerated k-means with adaptive distance bounds, in 5th NIPS Workshop on Optimization for Machine Learning, pp. 42–53 (2012)Google Scholar
  18. 18.
    J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative MapReduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (ACM, 2010)Google Scholar
  19. 19.
    C. Elkan, Using the triangle inequality to accelerate k-means, in Proceeding ICML’03 Proceedings of the Twentieth International Conference on International Conference on Machine Learning, vol. 1(3) (2003), pp. 147–153Google Scholar
  20. 20.
    S. Eschrich, J. Ke, L.O. Hall, D.B. Goldgof, Fast accurate fuzzy clustering through data reduction. IEEE Trans. Fuzzy Syst. 11(2), 262–270 (2003)CrossRefGoogle Scholar
  21. 21.
    A.A. Esmin, R.A. Coelho, S. Matwin, A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data. Artif. Intell. Rev. 44(1), 23–45 (2015)CrossRefGoogle Scholar
  22. 22.
    A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A.Y. Zomaya, …, A. Bouras, A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)CrossRefGoogle Scholar
  23. 23.
    G. Hamerly, C. Elkan, Alternatives to the k-means algorithm that find better clusterings, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 600–607 (ACM, New York, 2002)Google Scholar
  24. 24.
    J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques (Elsevier, New York, 2011)zbMATHGoogle Scholar
  25. 25.
    Z. Huang, Clustering large data sets with mixed numeric and categorical values, in Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34 (1997)Google Scholar
  26. 26.
    Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)MathSciNetCrossRefGoogle Scholar
  27. 27.
    P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)Google Scholar
  28. 28.
    T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)CrossRefGoogle Scholar
  29. 29.
    K. Krishna, M.N. Murty, Genetic K-means algorithm. IEEE Trans. Syst. Man Cybern. B Cybern. 29(3), 433–439 (1999)CrossRefGoogle Scholar
  30. 30.
    T. Kwok, K. Smith, S. Lozano, D. Taniar, Parallel fuzzy c-means clustering for large data sets, in Euro-Par 2002 Parallel Processing, pp. 27–58 (2002)CrossRefGoogle Scholar
  31. 31.
    J.Z. Lai, T.J. Huang, Y.C. Liaw, A fast k-means clustering algorithm using cluster center displacement. Pattern Recogn. 42(11), 2551–2556 (2009)CrossRefGoogle Scholar
  32. 32.
    M. Laszlo, S. Mukherjee, A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 533–543 (2006)CrossRefGoogle Scholar
  33. 33.
    Q. Li, P. Wang, W. Wang, H. Hu, Z. Li, J. Li, An efficient k-means clustering algorithm on MapReduce, in Proceedings of Database Systems for Advanced Applications, pp. 357–371 (2014)CrossRefGoogle Scholar
  34. 34.
    A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)CrossRefGoogle Scholar
  35. 35.
    S.A. Ludwig, MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6, 1–12 (2015)CrossRefGoogle Scholar
  36. 36.
    J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 14(1), 281–297 (1967)MathSciNetzbMATHGoogle Scholar
  37. 37.
    A. Mohebi, S. Aghabozorgi, T. Ying Wah, T. Herawan, R. Yahyapour, Iterative big data clustering algorithms: a review. Softw. Pract. Exp. 46(1), 107–129 (2016)CrossRefGoogle Scholar
  38. 38.
    J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, J.C. Phillips, GPU computing. Proc. IEEE 96(5), 879–899 (2008)CrossRefGoogle Scholar
  39. 39.
    D. Pelleg, A. Moore, Accelerating exact k-means algorithms with geometric reasoning, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281. (ACM, New York, 1999)Google Scholar
  40. 40.
    D. Pelleg, A.W. Moore, X-means: extending k-means with efficient estimation of the number of clusters, in Proceedings of the 17th International Conference on Machine Learning, vol. 1, pp. 727–734 (2000)Google Scholar
  41. 41.
    S.J. Phillips, Acceleration of k-means and related clustering algorithms, in Algorithm Engineering and Experiments, pp. 166–177 (Springer, Berlin, 2002)CrossRefGoogle Scholar
  42. 42.
    S.J. Redmond, C. Heneghan, A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn. Lett. 28(8), 965–973 (2007)CrossRefGoogle Scholar
  43. 43.
    D. Sculley, Web-scale k-means clustering, in Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (ACM, New York, 2010)Google Scholar
  44. 44.
    O. Sievert, H. Casanova, A simple MPI process swapping architecture for iterative applications. Int. J. High Perform. Comput. Appl. 18(3), 341–352 (2004)CrossRefGoogle Scholar
  45. 45.
    D. Singh, C.K. Reddy, A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)Google Scholar
  46. 46.
    M. Snir, MPI—The Complete Reference: The MPI Core, vol. 1 (MIT Press, Cambridge, 1998), pp. 22–56Google Scholar
  47. 47.
    A. Vattani, K-means requires exponentially many iterations even in the plane. Discret. Comput. Geom. 45(4), 596–616 (2011)MathSciNetCrossRefGoogle Scholar
  48. 48.
    T. White, Hadoop: The Definitive Guide (O’Reilly Media, Sebastopol, 2012)Google Scholar
  49. 49.
    R. Xu, D.C. Wunsch, Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)CrossRefGoogle Scholar
  50. 50.
    M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)Google Scholar
  51. 51.
    A. Zayani, C.E. Ben N’Cir, N. Essoussi, Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework, in Proceedings of IEEE International Conference on Big Data, pp. 1064–1069 (IEEE, Piscataway, 2016)Google Scholar
  52. 52.
    J. Zhang, G. Wu, X. Hu, S. Li, S. Hao, A parallel k-means clustering algorithm with MPI, in Proceedings of Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011)Google Scholar
  53. 53.
    W. Zhao, H. Ma, Q. He, Parallel k-means clustering based on MapReduce, in Proceedings of Cloud Computing, pp. 674–679 (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mohamed Aymen Ben HajKacem
    • 1
  • Chiheb-Eddine Ben N’Cir
    • 2
  • Nadia Essoussi
    • 1
  1. 1.LARODEC, Institut Supérieur de Gestion de TunisUniversité de TunisTunisTunisia
  2. 2.University of JeddahJeddahSaudi Arabia

Personalised recommendations