Abstract
Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, VLDB ’03, vol 29. VLDB Endowment, Berlin, pp 81–92
Akhter S, Roberts J (2006) Multi-core programming: increasing performance through software multi-threading, 1st edn. Books by engineers, for engineers. Intel Press, Hillsboro
Altinigneli MC, Plant C, Böhm C (2013) Massively parallel expectation maximization using graphics processing units. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, Chicago, pp 838–846. https://doi.org/10.1145/2487575.2487628
An F, Koide T, Mattausch HJ (2012) A k-means-based multi-prototype high-speed learning system with FPGA-implemented coprocessor for 1-NN searching. IEICE Trans Inf Syst E95–D(9):2327–2338
Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18(Supplement C):369–378. https://doi.org/10.1016/j.procs.2013.05.200
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, SIGMOD ’99. ACM, Philadelphia, pp 49–60. https://doi.org/10.1145/304182.304187
Azimi R, Sajedi H, Ghayekhloo M (2017) A distributed data clustering algorithm in p2p networks. Appl Soft Comput 51(Supplement C):147–167. https://doi.org/10.1016/j.asoc.2016.11.045
Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern Recognit Lett 93(Supplement C):78–84. https://doi.org/10.1016/j.patrec.2016.07.027
Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3–4):281–297. https://doi.org/10.1089/106652799318274
Bharill N, Tiwari A, Malviya A (2016) Fuzzy based scalable clustering algorithms for handling big data using apache spark. IEEE Trans Big Data 2(4):339–352. https://doi.org/10.1109/TBDATA.2016.2622288
Brown SD, Francis RJ, Rose J, Vranesic ZG (1992) Field-programmable gate arrays. Kluwer international series in engineering and computer science. Springer, Boston. https://doi.org/10.1007/978-1-4615-3572-0
Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format. IEEE/ACM Trans Comput Biol Bioinform 9(3):679–692. https://doi.org/10.1109/TCBB.2011.68
Cordova I, Moh TS (2015) DBSCAN on resilient distributed datasets. In: International conference on high performance computing simulation (HPCS). IEEE, Amsterdam, pp 531–540. https://doi.org/10.1109/HPCSim.2015.7237086
Cui X, Gao J, Potok TE (2006) A flocking based algorithm for document clustering analysis. J Syst Archit 52(8):505–515. https://doi.org/10.1016/j.sysarc.2006.02.003
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using MapReduce. J Supercomput 70(3):1249–1259. https://doi.org/10.1007/s11227-014-1225-7
Cuomo S, De Angelis V, Farina G, Marcellino L, Toraldo G (2017) A GPU-accelerated parallel k-means algorithm. Comput Electr Eng. https://doi.org/10.1016/j.compeleceng.2017.12.002
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on opearting systems design and implementation, OSDI’04, vol 6. USENIX Association, Berkeley
Deng Z, Hu Y, Zhu M, Huang X, Du B (2015) A scalable and fast optics for clustering trajectory big data. Cluster Comput 18(2):549–562. https://doi.org/10.1007/s10586-014-0413-9
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584
Erdem A, Gündem Tİ (2014) M-FDBSCAN: a multicore density-based uncertain data clustering algorithm. Turk J Electr Eng Comput Sci 22:143–154. https://doi.org/10.3906/elk-1202-83
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96. AAAI Press, Portland, pp 226–231
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for Big Data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Farooq U, Marrakchi Z, Mehrez H (2012) FPGA architectures: an overview. In: Tree-based heterogeneous FPGA architectures, chap. 2. Springer, New York, pp 7–48. https://doi.org/10.1007/978-1-4614-3594-5_2
Ferreira Cordeiro RL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11. ACM, San Diego, pp 690–698. https://doi.org/10.1145/2020408.2020516
Gehweiler J, Meyerhenke H (2010) A distributed diffusive heuristic for clustering a virtual p2p supercomputer. In: IEEE international symposium on parallel distributed processing, workshops and Phd forum (IPDPSW). IEEE, Atlanta, pp 1–8. https://doi.org/10.1109/IPDPSW.2010.5470922
Gepner P, Kowalik MF (2006) Multi-core processors: New way to achieve high system performance. In: International symposium on parallel computing in electrical engineering (PARELEC’06). Bialystok, Poland, pp 9–13. https://doi.org/10.1109/PARELEC.2006.54
Gouineau F, Landry T, Triplet T (2016) Patchwork, a scalable density-grid clustering algorithm. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, Pisa, pp 824–831. https://doi.org/10.1145/2851613.2851643
Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863. https://doi.org/10.1007/s11227-014-1185-y
Han D, Agrawal A, Liao WK, Choudhary A (2016) A novel scalable DBSCAN algorithm with Spark. In: IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, Chicago, pp 1393–1402. https://doi.org/10.1109/IPDPSW.2016.57
Han J, Kamber M, Pei J (2012) Cluster analysis: basic concepts and methods. In: Data mining, The Morgan Kaufmann series in data management systems, 3rd edn, chap. 10. Morgan Kaufmann, pp 443–495. https://doi.org/10.1016/B978-0-12-381479-1.00010-1
Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: High performance computing—HiPC 2007. Lecture notes in computer science. Springer, Berlin, pp 197–208. https://doi.org/10.1007/978-3-540-77220-0_21
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28(1):100. https://doi.org/10.2307/2346830
Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130–1146. https://doi.org/10.1109/TFUZZ.2012.2201485
He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front Comput Sci 8(1):83–99. https://doi.org/10.1007/s11704-013-3158-3
Huang P, Li X, Yuan B (2015) A parallel gpu-based approach to clustering very fast data streams. In: Proceedings of the 24th ACM international on conference on information and knowledge management, CIKM ’15. ACM, Melbourne, pp 23–32. https://doi.org/10.1145/2806416.2806545
Hussain HM, Benkrid K, Seker H, Erdogan AT (2011) FPGA implementation of k-means algorithm for bioinformatics application: an accelerated approach to clustering microarray data. In: NASA/ESA conference on adaptive hardware and systems (AHS). IEEE, San Diego, pp 248–255. https://doi.org/10.1109/AHS.2011.5963944
Jia F, Wang C, Li X, Zhou X (2015) SAKMA: specialized FPGA-based accelerator architecture for data-intensive k-means algorithms. In: Algorithms and architectures for parallel processing. Springer, Cham, pp 106–119. https://doi.org/10.1007/978-3-319-27122-4_8
Jin R, Kou C, Liu R, Li Y (2013) Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. J Cloud Comput Adv Syst Appl 2(1):18. https://doi.org/10.1186/2192-113X-2-18
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892. https://doi.org/10.1109/TPAMI.2002.1017616
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. https://doi.org/10.1137/S1064827595287997
Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinform 11(1):169. https://doi.org/10.1186/1471-2105-11-169
Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, Chicago, pp 672–677. https://doi.org/10.1145/1081870.1081955
Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm., Press Office Los Angeles
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical Report, 949, Gartner
Li C, Zhang Y, Jiao M, Yu G (2014) Mux-Kmeans: multiplex Kmeans for clustering large-scale data set. In: Proceedings of the 5th ACM workshop on scientific cloud computing, ScienceCloud ’14. ACM, Vancouver, pp 25–32. https://doi.org/10.1145/2608029.2608033
Lin F, Cohen WW (2010) Power iteration clustering. In: Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, Haifa, pp 655–662
Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on GPU. Soft Comput 18(3):539–547. https://doi.org/10.1007/s00500-013-1074-y
Liu R, Li X, Du L, Zhi S, Wei M (2017) Parallel implementation of density peaks clustering algorithm based on spark. Procedia Comput Sci 107(Supplement C):442–447. https://doi.org/10.1016/j.procs.2017.03.138
Luo G, Luo X, Gooch TF, Tian L, Qin K (2016) A parallel DBSCAN algorithm based on spark. In: IEEE international conferences on big data and cloud computing, social computing and networking, sustainable computing and communications. IEEE, Atlanta, pp 548–553. https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85
Mallios X, Vassalos V, Venetis T, Vlachou A (2016) A framework for clustering and classification of big data using spark. In: Debruyne C, Panetto H, Meersman R, Dillon T, Kühn E, O’Sullivan D, Ardagna CA (eds) On the move to meaningful internet systems: OTM 2016 conferences, vol 10033. Springer, Cham, pp 344–362. https://doi.org/10.1007/978-3-319-48472-3_20
Melo D, Toledo S, Mourao F, Sachetto R, Andrade G, Ferreira R, Parthasarathy S, Rocha L (2016) Hierarchical density-based clustering based on GPU accelerated data indexing strategy. Procedia Comput Sci 80:951–961. https://doi.org/10.1016/j.procs.2016.05.389
Milojicic DS, Kalogeraki V, Lukose R, Nagaraja K, Pruyne J, Richard B, Rollins S, Xu Z (2002) Peer-to-peer computing. Technical Report. HPL-2002-57, HP Labs
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3):267–289. https://doi.org/10.1007/s10844-006-9953-7
Nickolls J, Buck I, Garland M (2008) Scalable parallel programming. In: IEEE hot chips 20 symposium (HCS). IEEE, pp 40–53
Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. Proc IEEE 96(5):879–899. https://doi.org/10.1109/JPROC.2008.917757
Patwary MA, Palsetia D, Agrawal A, Liao WK, Manne F, Choudhary A (2013) Scalable parallel optics data clustering using graph algorithmic techniques. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13. ACM, Denver, pp 49:1–49:12. https://doi.org/10.1145/2503210.2503255
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496. https://doi.org/10.1126/science.1242072
Savvas IK, Tselios D (2016) Parallelizing DBSCAN algorithm using MPI. In: IEEE 25th International conference on enabling technologies: infrastructure for collaborative enterprises (WETICE). IEEE, Paris, pp 77–82. https://doi.org/10.1109/WETICE.2016.26
Scicluna N, Bouganis CS (2015) ARC 2014: a multidimensional FPGA-based parallel DBSCAN architecture. ACM Trans Reconfig Technol Syst 9(1):2:1–2:15. https://doi.org/10.1145/2724722
Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Int J Very Large Data Bases 8(3–4):289–304. https://doi.org/10.1007/s007780050009
Shi S, Yue Q, Wang Q (2014) FPGA based accelerator for parallel DBSCAN algorithm. Comput Model New Technol 18(2):135–142
Singh D, Reddy CK (2014) A survey on platforms for big data analytics. J Big Data 2(1):8. https://doi.org/10.1186/s40537-014-0008-6
Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1875–1879. https://doi.org/10.1109/ICACCI.2016.7732323
Skillicorn D (1999) Strategies for parallel data mining. IEEE Concurr 7(4):26–35. https://doi.org/10.1109/4434.806976
Sotiropoulou CL, Gkaitatzis S, Annovi A, Beretta M, Giannetti P, Kordas K, Luciano P, Nikolaidis S, Petridou C, Volpi G (2014) A multi-core FPGA-based 2D-clustering implementation for real-time image processing. IEEE Trans Nuclear Sci 61(6):3599–3606. https://doi.org/10.1109/TNS.2014.2364183
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66
Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467. https://doi.org/10.1007/s11227-014-1174-1
Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2015) A distributed framework for trimmed kernel k-means clustering. Pattern Recognit 48(8):2685–2698. https://doi.org/10.1016/j.patcog.2015.02.020
Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2016) Efficient mapreduce kernel k-means for big data clustering. In: Proceedings of the 9th hellenic conference on artificial intelligence, SETN ’16. ACM, Thessaloniki, pp 28:1–28:5. https://doi.org/10.1145/2903220.2903255
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111. https://doi.org/10.1145/79173.79181
Van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141. https://doi.org/10.1137/040608635
Voulgaris S, Gavidia D, van Steen M (2005) Cyclon: inexpensive membership management for unstructured p2p overlays. J Netw Syst Manag 13(2):197–217. https://doi.org/10.1007/s10922-005-4441-x
Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: IEEE 14th international conference on communication technology, pp 1203–1208. IEEE, Chengdu. https://doi.org/10.1109/ICCT.2012.6511380
Wang B, Yin J, Hua Q, Wu Z, Cao J (2016) Parallelizing k-means-based clustering on spark. In: International conference on advanced cloud and Big Data (CBD). IEEE, Chengdu, pp 31–36. https://doi.org/10.1109/CBD.2016.016
Winterstein F, Bayliss S, Constantinides GA (2013) FPGA-based k-means clustering using tree-based data structures. In: The 23rd international conference on field programmable logic and applications. IEEE, Porto, pp 1–6. https://doi.org/10.1109/FPL.2013.6645501
Yan W, Brahmakshatriya U, Xue Y, Gilder M, Wise B (2013) p-PIC: parallel power iteration clustering for big data. J Parallel Distrib Comput 73(3):352–359. https://doi.org/10.1016/j.jpdc.2012.06.009
Yang J, Li X (2013) MapReduce based method for big data semantic clustering. In: IEEE international conference on systems, man, and cybernetics. IEEE, pp 2814–2819. https://doi.org/10.1109/SMC.2013.480
Yıldırım AA, Özdoğan C (2011) Parallel wavecluster: a linear scaling parallel clustering algorithm implementation with application to very large datasets. J Parallel Distrib Comput 71(7):955–962. https://doi.org/10.1016/j.jpdc.2011.03.007
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, HotCloud’10. USENIX Association, Berkeley
Zayani A, Ben N’Cir CE, Essoussi N (2016) Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework. In: IEEE international conference on big data (Big Data). IEEE, Washington, DC, pp 1064–1069. https://doi.org/10.1109/BigData.2016.7840708
Zhang Y, Mueller F, Cui X, Potok T (2010) Large-scale multi-dimensional document clustering on GPU clusters. In: IEEE international symposium on parallel distributed processing (IPDPS). IEEE, pp 1–10. https://doi.org/10.1109/IPDPS.2010.5470429
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing. Lecture notes in computer science. Springer, Berlin, pp 674–679. https://doi.org/10.1007/978-3-642-10665-1_71
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dafir, Z., Lamari, Y. & Slaoui, S.C. A survey on parallel clustering algorithms for Big Data. Artif Intell Rev 54, 2411–2443 (2021). https://doi.org/10.1007/s10462-020-09918-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-020-09918-2