A High Performance Modified K-Means Algorithm for Dynamic Data Clustering in Multi-core CPUs Based Environments

Laccetti, Giuliano; Lapegna, Marco; Mele, Valeria; Romano, Diego

doi:10.1007/978-3-030-34914-1_9

Giuliano Laccetti¹³,
Marco Lapegna¹³,
Valeria Mele¹³ &
…
Diego Romano¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11874))

Included in the following conference series:

International Conference on Internet and Distributed Computing Systems

918 Accesses
5 Citations

Abstract

K-means algorithm is one of the most widely used methods in data mining and statistical data analysis to partition several objects in K distinct groups, called clusters, on the basis of their similarities. The main problel and distributed clustering algorithms start to be designem of this algorithm is that it requires the number of clusters as an input data, but in the real life it is very difficult to fix in advance such value. In this work we propose a parallel modified K-means algorithm where the number of clusters is increased at run time in a iterative procedure until a given cluster quality metric is satisfied. To improve the performance of the procedure, at each iteration two new clusters are created, splitting only the cluster with the worst value of the quality metric. Furthermore, experiments in a multi-core CPUs based environment are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abubaker, M., Ashour, W.M.: Efficient data clustering algorithms: improvements over K-means. Int. J. Intell. Syst. Appl. 5, 37–49 (2013)
Google Scholar
Aggarwal, C.C., Reddy, C.K.: Data Clustering, Algorithms and Applications. Chapman and Hall/CRC, London (2013)
Book Google Scholar
Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput. Sci. 18, 369–378 (2013)
Article Google Scholar
Boccia, V., Carracciuolo, L., Laccetti, G., Lapegna, M., Mele, V.: HADAB: enabling fault tolerance in parallel applications running in distributed environments. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7203, pp. 700–709. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31464-3_71
Chapter Google Scholar
Caruso, P., Laccetti, G., Lapegna, M.: A performance contract system in a grid enabling, component based programming environment. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 982–992. Springer, Heidelberg (2005). https://doi.org/10.1007/11508380_100
Chapter Google Scholar
D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Advanced environments for parallel and distributed applications: a view of the current status. Parallel Comput. 28, 1637–1662 (2002)
Article Google Scholar
D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Integrating MPI-based numerical software into an advanced parallel computing environment. In: Proceedings of the Eleventh Euromicro Conference on Parallel Distributed and Network-based Procesing, Clematis ed., pp. 283–291. IEEE (2003)
Google Scholar
D’Apuzzo, M., Lapegna, M., Murli, A.: Scalability and load balancing in adaptive algorithms for multidimensional integration. Parallel Comput. 23, 1199–1210 (1997)
Article MathSciNet Google Scholar
Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised K-means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(2013), 317–329 (2013)
Article Google Scholar
Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2017). http://archive.ics.uci.edu/ml
Frey, P.W., Slate, D.J.: Letter recognition using Holland-style adaptive classifiers. Mach. Learn. 6, 161–182 (1991)
Google Scholar
Gan, D.G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM, Philadelphia. ASA, Alexandria (2007)
Google Scholar
Gregoretti, F., Laccetti, G., Murli, A., Oliva, G., Scafuri, U.: MGF: a grid-enabled MPI library. Future Gener. Comput. Syst. 24, 158–165 (2008)
Article Google Scholar
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 83–99 (2014)
Article MathSciNet Google Scholar
Huang, Z.X.: Extensions to the K-means algorithm for clustering large datasets with categorical values. Data Min. Knowl. Disc. 2, 283–304 (1998)
Article Google Scholar
Joshi, A., Kaur, R.: A review: comparative study of various clustering techniques in data mining. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 55–57 (2013)
Google Scholar
Karypis, G., Kumar, V.: Parallel multilevel K-way partitioning for irregular graphs. SIAM Rev. 41, 278–300 (1999)
Article MathSciNet Google Scholar
Laccetti, G., Lapegna, M.: PAMIHR. a parallel FORTRAN program for multidimensional quadrature on distributed memory architectures. In: Amestoy, P., et al. (eds.) Euro-Par 1999. LNCS, vol. 1685, pp. 1144–1148. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48311-X_160
Chapter Google Scholar
Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPUGPU systems. Concurr. Comput. Pract. Exp. 31, e4945 (2018)
Google Scholar
Laccetti, G., Lapegna, M., Mele, V., Romano, D., Murli, A.: A double adaptive algorithm for multidimensional integration on multicore based HPC systems. Int. J. Parallel Program. 40, 397–409 (2012)
Article Google Scholar
Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55224-3_66
Chapter Google Scholar
Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44, 901–921 (2016)
Article Google Scholar
Lapegna, M.: A global adaptive quadrature for the approximate computation of multidimensional integrals on a distributed memory multiprocessor. Concurr. Pract. Exp. 4, 413–426 (1992)
Article Google Scholar
Patibandla, R.S.M.L., Veeranjaneyulu, N.: Survey on clustering algorithms for unstructured data. In: Bhateja, V., Coello Coello, C.A., Satapathy, S.C., Pattnaik, P.K. (eds.) Intelligent Engineering Informatics. AISC, vol. 695, pp. 421–429. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7566-7_41
Chapter Google Scholar
Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann (2000)
Google Scholar
Pena, J.M., Lozano, J.A., Larranaga, P.: An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 1027–1040 (1999)
Article Google Scholar
Shindler, M., Wong, A., Meyerson, A.: Fast and accurate k-means for large datasets. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.): Proceedings of 25th Annual Conference on Neural Information Processing Systems, pp. 2375–2383 (2011)
Google Scholar
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data clustering: a review. In: Murgante, B., et al. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 707–720. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09156-3_49
Chapter Google Scholar
Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2, 165–193 (2015)
Article Google Scholar
Xu, R., Wunsch, D.: Survey of clustering algorithms. Trans. Neural Netw. 16, 645–678 (2005)
Article Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_71
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Applications, University of Naples Federico II, Naples, Italy
Giuliano Laccetti, Marco Lapegna & Valeria Mele
Institute for High Performance Computing and Networking (ICAR), National Research Council (CNR), Naples, Italy
Diego Romano

Authors

Giuliano Laccetti
View author publications
You can also search for this author in PubMed Google Scholar
Marco Lapegna
View author publications
You can also search for this author in PubMed Google Scholar
Valeria Mele
View author publications
You can also search for this author in PubMed Google Scholar
Diego Romano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Lapegna .

Editor information

Editors and Affiliations

Department of Science and Technology, Parthenope University of Naples, Napoli, Italy
Raffaele Montella
Parthenope University of Naples, Napoli, Italy
Angelo Ciaramella
University of Calabria, Rende, Italy
Giancarlo Fortino
ICAR, Consiglio Nazionale delle Ricerche, Rende, Cosenza, Italy
Antonio Guerrieri
Edinburgh Napier University, Edinburgh, UK
Antonio Liotta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laccetti, G., Lapegna, M., Mele, V., Romano, D. (2019). A High Performance Modified K-Means Algorithm for Dynamic Data Clustering in Multi-core CPUs Based Environments. In: Montella, R., Ciaramella, A., Fortino, G., Guerrieri, A., Liotta, A. (eds) Internet and Distributed Computing Systems . IDCS 2019. Lecture Notes in Computer Science(), vol 11874. Springer, Cham. https://doi.org/10.1007/978-3-030-34914-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-34914-1_9
Published: 10 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34913-4
Online ISBN: 978-3-030-34914-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics