Spark-Based Design of Clustering Using Particle Swarm Optimization

  • Mariem Moslah
  • Mohamed Aymen Ben HajKacem
  • Nadia Essoussi
Part of the Unsupervised and Semi-Supervised Learning book series (UNSESUL)


Particle swarm optimization (PSO) algorithm is widely used in cluster analysis. PSO clustering has been fitted into MapReduce model and has become an effective solution for Big data. However, MapReduce is unsuitable for iterative algorithms since it requires repeated times of reading and writing to disks. In addition, PSO suffers from a low convergence speed when it approaches the global optimum region. To deal with these issues, we propose in this chapter a new Spark-based PSO clustering method. We take advantage of in-memory operations of Spark to build grouping from large-scale data. Furthermore, we propose a new version of PSO which is based on running k-means when approaching the global optimum region to accelerate the convergence. Experiments conducted on real and simulated large data sets show that the proposed method is scalable and improves the efficiency of the existing PSO methods.


  1. 1.
    A. Ahmadyfard, H. Modares, Combining PSO and k-means to enhance data clustering, in International Symposium on Telecommunications, 2008 (2008), pp. 688–691Google Scholar
  2. 2.
    I. Aljarah, S.A. Ludwig, Parallel particle swarm optimization clustering algorithm based on MapReduce methodology, in 2012 Fourth World Congress on Nature and Biologically Inspired Computing (nabic) (2012), pp. 104–111Google Scholar
  3. 3.
    G.P. Babu, M.N. Murty, Simulated annealing for selecting optimal initial seeds in the k-means algorithm. Indian J. Pure Appl. Math. 25(1–2), 85–94 (1994)zbMATHGoogle Scholar
  4. 4.
    M.A. Ben HajKacem, C.E. Ben N’cir, N. Essoussi, MapReduce-based k-prototypes clustering method for big data, in Proceedings of Data Science and Advanced Analytics (2015), pp. 1–7Google Scholar
  5. 5.
    M.E. Celebi, H.A. Kingravi, P.A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert syst. Appl. 40(1), 200–210 (2013)CrossRefGoogle Scholar
  6. 6.
    C.P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)CrossRefGoogle Scholar
  7. 7.
    J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    K.-L. Du, M. Swamy, Search and Optimization by Metaheuristics: Techniques and Algorithms Inspired by Nature (Birkhäuser, Basel, 2016)CrossRefGoogle Scholar
  9. 9.
    A.A.A. Esmin, D.L. Pereira, F. De Araujo, Study of different approach to clustering data by using the particle swarm optimization algorithm, in IEEE Congress on Evolutionary Computation, 2008. CEC 2008 (IEEE World Congress on Computational Intelligence) (2008), pp. 1817–1822Google Scholar
  10. 10.
    A.A. Esmin, R.A. Coelho, S. Matwin, A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data. Artif. Intell. Rev. 44(1), 23–45 (2015)CrossRefGoogle Scholar
  11. 11.
    V. Gorodetsky, Big data: opportunities, challenges and solutions, in Information and Communication Technologies in Education, Research, and Industrial Applications (2014), pp. 3–22Google Scholar
  12. 12.
    K. Krishna, M.N. Murty, Genetic k-means algorithm. IEEE Trans. Syst. Man Cybern. B Cybern. 29(3), 433–439 (1999)CrossRefGoogle Scholar
  13. 13.
    S.A. Ludwig, MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6(6), 923–934 (2015)CrossRefGoogle Scholar
  14. 14.
    J. MacQueen et al., Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 (1967), pp. 281–297Google Scholar
  15. 15.
    R. Poli, J. Kennedy, T. Blackwell, Particle swarm optimization. Swarm Intell. 1(1), 33–57 (2007)CrossRefGoogle Scholar
  16. 16.
    R. Shyam, B.G. HB, S. Kumar, P. Poornachandran, K. Soman, Apache spark a big data analytics platform for smart grid. Proc. Technol. 21, 171–178 (2015)Google Scholar
  17. 17.
    D. Van der Merwe, A.P. Engelbrecht, Data clustering using particle swarm optimization, in The 2003 Congress on Evolutionary Computation, 2003. CEC’03, vol. 1 (2003), pp. 215–220Google Scholar
  18. 18.
    D. Xu, Y. Tian, A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015)MathSciNetCrossRefGoogle Scholar
  19. 19.
    X. Xu, J. Jager, H.-P. Kriegel, A fast parallel clustering algorithm for large spatial databases, in High Performance Data Mining (Springer, Berlin, 1999), pp. 263–290Google Scholar
  20. 20.
    W. Zhao, H. Ma, Q. He, Parallel k-means clustering based on MapReduce, in IEEE International Conference on Cloud Computing (2009), pp. 674–679Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mariem Moslah
    • 1
  • Mohamed Aymen Ben HajKacem
    • 1
  • Nadia Essoussi
    • 1
  1. 1.LARODEC, Institut Supérieur de Gestion de TunisUniversité de TunisLe BardoTunisia

Personalised recommendations