Abstract
Big data clustering has become an important challenge in machine learning since several applications require scalable clustering methods to organize such data into groups of similar objects. Several methods were proposed during the last decade to deal with this important challenge. We propose in this chapter an overview of the existing clustering methods with a special emphasis on scalable partitional methods. We design a new categorizing model based on the main properties pointed out in the Big data partitional clustering methods to ensure scalability when analyzing a large amount of data. Furthermore, a comparative experimental study of most of the existing methods is given over simulated and real large datasets. Based on the obtained results, we elaborate a guide for researchers and end users who want to decide the best method or framework to use when a task of clustering large scale of data is required.
Keywords
- Partitional Clustering Methods
- Hadoop Distributed File System (HDFS)
- MapReduce Job
- Large-scale Data Clustering
- MapReduce Framework
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
M. Al-Ayyoub, A.M. Abu-Dalo, Y. Jararweh, M. Jarrah, M. Al Sa’d, A GPU-based implementations of the fuzzy C-means algorithms for medical image segmentation. J. Supercond. 71(8), 3149–3162 (2015)
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii, Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)
S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inform. Sci. 146(1), 221–237 (2002)
M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, MapReduce-based k-prototypes clustering method for big data, in Proceedings of Data Science and Advanced Analytics, pp. 1–7 (2015)
M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, KP-S: a spark-based design of the K-prototypes clustering for big data, in Proceedings of ACS/IEEE International Conference on Computer Systems and Applications, pp. 1–7 (2017)
M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, One-pass MapReduce-based clustering method for mixed large scale data. J. Intell. Inf. Syst. 1–18 (2017)
J.C. Bezdek, R. Ehrlich, W. Full, FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
P.S. Bradley, U.M. Fayyad. Refining initial points for K-means clustering, in Proceeding ICML ’98 Proceedings of the Fifteenth International Conference on Machine Learning, vol. 98, pp. 91–99 (1998)
M. Capó, A. Pérez, J.A. Lozano, An efficient approximation to the k-means clustering for massive data. Knowl.-Based Syst. 117, 56–69 (2017)
M.E. Celebi, H.A. Kingravi, P.A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning (Chapelle, O. et al., Eds.; 2006)[Book reviews]. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)
S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68(10), 1370–1380 (2008)
M.C. Chiang, C.W. Tsai, C.S. Yang, A time-efficient pattern reduction algorithm for k-means clustering. Inform. Sci. 181(4), 716–731 (2011)
X. Cui, P. Zhu, X. Yang, K. Li, C. Ji, Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249–1259 (2014)
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification. Wiley, Hoboken (2012)
J. Drake, G. Hamerly, Accelerated k-means with adaptive distance bounds, in 5th NIPS Workshop on Optimization for Machine Learning, pp. 42–53 (2012)
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative MapReduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (ACM, 2010)
C. Elkan, Using the triangle inequality to accelerate k-means, in Proceeding ICML’03 Proceedings of the Twentieth International Conference on International Conference on Machine Learning, vol. 1(3) (2003), pp. 147–153
S. Eschrich, J. Ke, L.O. Hall, D.B. Goldgof, Fast accurate fuzzy clustering through data reduction. IEEE Trans. Fuzzy Syst. 11(2), 262–270 (2003)
A.A. Esmin, R.A. Coelho, S. Matwin, A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data. Artif. Intell. Rev. 44(1), 23–45 (2015)
A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A.Y. Zomaya, …, A. Bouras, A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
G. Hamerly, C. Elkan, Alternatives to the k-means algorithm that find better clusterings, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 600–607 (ACM, New York, 2002)
J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques (Elsevier, New York, 2011)
Z. Huang, Clustering large data sets with mixed numeric and categorical values, in Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34 (1997)
Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)
P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
K. Krishna, M.N. Murty, Genetic K-means algorithm. IEEE Trans. Syst. Man Cybern. B Cybern. 29(3), 433–439 (1999)
T. Kwok, K. Smith, S. Lozano, D. Taniar, Parallel fuzzy c-means clustering for large data sets, in Euro-Par 2002 Parallel Processing, pp. 27–58 (2002)
J.Z. Lai, T.J. Huang, Y.C. Liaw, A fast k-means clustering algorithm using cluster center displacement. Pattern Recogn. 42(11), 2551–2556 (2009)
M. Laszlo, S. Mukherjee, A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 533–543 (2006)
Q. Li, P. Wang, W. Wang, H. Hu, Z. Li, J. Li, An efficient k-means clustering algorithm on MapReduce, in Proceedings of Database Systems for Advanced Applications, pp. 357–371 (2014)
A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)
S.A. Ludwig, MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6, 1–12 (2015)
J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 14(1), 281–297 (1967)
A. Mohebi, S. Aghabozorgi, T. Ying Wah, T. Herawan, R. Yahyapour, Iterative big data clustering algorithms: a review. Softw. Pract. Exp. 46(1), 107–129 (2016)
J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, J.C. Phillips, GPU computing. Proc. IEEE 96(5), 879–899 (2008)
D. Pelleg, A. Moore, Accelerating exact k-means algorithms with geometric reasoning, in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281. (ACM, New York, 1999)
D. Pelleg, A.W. Moore, X-means: extending k-means with efficient estimation of the number of clusters, in Proceedings of the 17th International Conference on Machine Learning, vol. 1, pp. 727–734 (2000)
S.J. Phillips, Acceleration of k-means and related clustering algorithms, in Algorithm Engineering and Experiments, pp. 166–177 (Springer, Berlin, 2002)
S.J. Redmond, C. Heneghan, A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn. Lett. 28(8), 965–973 (2007)
D. Sculley, Web-scale k-means clustering, in Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (ACM, New York, 2010)
O. Sievert, H. Casanova, A simple MPI process swapping architecture for iterative applications. Int. J. High Perform. Comput. Appl. 18(3), 341–352 (2004)
D. Singh, C.K. Reddy, A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
M. Snir, MPI—The Complete Reference: The MPI Core, vol. 1 (MIT Press, Cambridge, 1998), pp. 22–56
A. Vattani, K-means requires exponentially many iterations even in the plane. Discret. Comput. Geom. 45(4), 596–616 (2011)
T. White, Hadoop: The Definitive Guide (O’Reilly Media, Sebastopol, 2012)
R. Xu, D.C. Wunsch, Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
A. Zayani, C.E. Ben N’Cir, N. Essoussi, Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework, in Proceedings of IEEE International Conference on Big Data, pp. 1064–1069 (IEEE, Piscataway, 2016)
J. Zhang, G. Wu, X. Hu, S. Li, S. Hao, A parallel k-means clustering algorithm with MPI, in Proceedings of Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011)
W. Zhao, H. Ma, Q. He, Parallel k-means clustering based on MapReduce, in Proceedings of Cloud Computing, pp. 674–679 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
HajKacem, M.A.B., N’Cir, CE.B., Essoussi, N. (2019). Overview of Scalable Partitional Methods for Big Data Clustering. In: Nasraoui, O., Ben N'Cir, CE. (eds) Clustering Methods for Big Data Analytics. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-97864-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-97864-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97863-5
Online ISBN: 978-3-319-97864-2
eBook Packages: EngineeringEngineering (R0)