Advertisement

A High Performance Modified K-Means Algorithm for Dynamic Data Clustering in Multi-core CPUs Based Environments

  • Giuliano Laccetti
  • Marco LapegnaEmail author
  • Valeria Mele
  • Diego Romano
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11874)

Abstract

K-means algorithm is one of the most widely used methods in data mining and statistical data analysis to partition several objects in K distinct groups, called clusters, on the basis of their similarities. The main problel and distributed clustering algorithms start to be designem of this algorithm is that it requires the number of clusters as an input data, but in the real life it is very difficult to fix in advance such value. In this work we propose a parallel modified K-means algorithm where the number of clusters is increased at run time in a iterative procedure until a given cluster quality metric is satisfied. To improve the performance of the procedure, at each iteration two new clusters are created, splitting only the cluster with the worst value of the quality metric. Furthermore, experiments in a multi-core CPUs based environment are presented.

Keywords

K-Means clustering Parallel adaptive algorithm Unsupervised learning Data mining 

References

  1. 1.
    Abubaker, M., Ashour, W.M.: Efficient data clustering algorithms: improvements over K-means. Int. J. Intell. Syst. Appl. 5, 37–49 (2013)Google Scholar
  2. 2.
    Aggarwal, C.C., Reddy, C.K.: Data Clustering, Algorithms and Applications. Chapman and Hall/CRC, London (2013)CrossRefGoogle Scholar
  3. 3.
    Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput. Sci. 18, 369–378 (2013)CrossRefGoogle Scholar
  4. 4.
    Boccia, V., Carracciuolo, L., Laccetti, G., Lapegna, M., Mele, V.: HADAB: enabling fault tolerance in parallel applications running in distributed environments. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7203, pp. 700–709. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31464-3_71CrossRefGoogle Scholar
  5. 5.
    Caruso, P., Laccetti, G., Lapegna, M.: A performance contract system in a grid enabling, component based programming environment. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 982–992. Springer, Heidelberg (2005).  https://doi.org/10.1007/11508380_100CrossRefGoogle Scholar
  6. 6.
    D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Advanced environments for parallel and distributed applications: a view of the current status. Parallel Comput. 28, 1637–1662 (2002)CrossRefGoogle Scholar
  7. 7.
    D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Integrating MPI-based numerical software into an advanced parallel computing environment. In: Proceedings of the Eleventh Euromicro Conference on Parallel Distributed and Network-based Procesing, Clematis ed., pp. 283–291. IEEE (2003)Google Scholar
  8. 8.
    D’Apuzzo, M., Lapegna, M., Murli, A.: Scalability and load balancing in adaptive algorithms for multidimensional integration. Parallel Comput. 23, 1199–1210 (1997)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised K-means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(2013), 317–329 (2013)CrossRefGoogle Scholar
  10. 10.
    Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2017). http://archive.ics.uci.edu/ml
  11. 11.
    Frey, P.W., Slate, D.J.: Letter recognition using Holland-style adaptive classifiers. Mach. Learn. 6, 161–182 (1991)Google Scholar
  12. 12.
    Gan, D.G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM, Philadelphia. ASA, Alexandria (2007)Google Scholar
  13. 13.
    Gregoretti, F., Laccetti, G., Murli, A., Oliva, G., Scafuri, U.: MGF: a grid-enabled MPI library. Future Gener. Comput. Syst. 24, 158–165 (2008)CrossRefGoogle Scholar
  14. 14.
    He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 83–99 (2014)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Huang, Z.X.: Extensions to the K-means algorithm for clustering large datasets with categorical values. Data Min. Knowl. Disc. 2, 283–304 (1998)CrossRefGoogle Scholar
  16. 16.
    Joshi, A., Kaur, R.: A review: comparative study of various clustering techniques in data mining. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 55–57 (2013)Google Scholar
  17. 17.
    Karypis, G., Kumar, V.: Parallel multilevel K-way partitioning for irregular graphs. SIAM Rev. 41, 278–300 (1999)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Laccetti, G., Lapegna, M.: PAMIHR. a parallel FORTRAN program for multidimensional quadrature on distributed memory architectures. In: Amestoy, P., et al. (eds.) Euro-Par 1999. LNCS, vol. 1685, pp. 1144–1148. Springer, Heidelberg (1999).  https://doi.org/10.1007/3-540-48311-X_160CrossRefGoogle Scholar
  19. 19.
    Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPUGPU systems. Concurr. Comput. Pract. Exp. 31, e4945 (2018)Google Scholar
  20. 20.
    Laccetti, G., Lapegna, M., Mele, V., Romano, D., Murli, A.: A double adaptive algorithm for multidimensional integration on multicore based HPC systems. Int. J. Parallel Program. 40, 397–409 (2012)CrossRefGoogle Scholar
  21. 21.
    Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-642-55224-3_66CrossRefGoogle Scholar
  22. 22.
    Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44, 901–921 (2016)CrossRefGoogle Scholar
  23. 23.
    Lapegna, M.: A global adaptive quadrature for the approximate computation of multidimensional integrals on a distributed memory multiprocessor. Concurr. Pract. Exp. 4, 413–426 (1992)CrossRefGoogle Scholar
  24. 24.
    Patibandla, R.S.M.L., Veeranjaneyulu, N.: Survey on clustering algorithms for unstructured data. In: Bhateja, V., Coello Coello, C.A., Satapathy, S.C., Pattnaik, P.K. (eds.) Intelligent Engineering Informatics. AISC, vol. 695, pp. 421–429. Springer, Singapore (2018).  https://doi.org/10.1007/978-981-10-7566-7_41CrossRefGoogle Scholar
  25. 25.
    Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann (2000)Google Scholar
  26. 26.
    Pena, J.M., Lozano, J.A., Larranaga, P.: An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 1027–1040 (1999)CrossRefGoogle Scholar
  27. 27.
    Shindler, M., Wong, A., Meyerson, A.: Fast and accurate k-means for large datasets. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.): Proceedings of 25th Annual Conference on Neural Information Processing Systems, pp. 2375–2383 (2011)Google Scholar
  28. 28.
    Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data clustering: a review. In: Murgante, B., et al. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 707–720. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-09156-3_49 CrossRefGoogle Scholar
  29. 29.
    Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2, 165–193 (2015)CrossRefGoogle Scholar
  30. 30.
    Xu, R., Wunsch, D.: Survey of clustering algorithms. Trans. Neural Netw. 16, 645–678 (2005)CrossRefGoogle Scholar
  31. 31.
    Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-10665-1_71CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Giuliano Laccetti
    • 1
  • Marco Lapegna
    • 1
    Email author
  • Valeria Mele
    • 1
  • Diego Romano
    • 2
  1. 1.Department of Mathematics and ApplicationsUniversity of Naples Federico IINaplesItaly
  2. 2.Institute for High Performance Computing and Networking (ICAR)National Research Council (CNR)NaplesItaly

Personalised recommendations