Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark

  • V. Santhi
  • Rini Jose
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10722)


Clustering divides data into meaningful, useful groups known as clusters without any prior knowledge about the data. One of the drawbacks of K-Means clustering is the estimation of initial centroids which influence the performance of the algorithm. To overcome this issue, optimization algorithms like Bat and Firefly are executed as pre-processing step. These algorithms return optimal centroids which is given as input to the K-Means algorithm. Clustering is carried out on large data sets, therefore Apache Spark, an open source software framework is used. The performance of the optimization algorithms is evaluated and the best algorithm is determined.


Clustering K-Means Bat algorithm Firefly algorithm Big data Spark 


  1. 1.
    Kanungo, T., Netanyahu, N.S., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881–892 (2002)CrossRefGoogle Scholar
  2. 2.
    Yang, X.S.: Bat algorithm: literature review and applications. Int. J. Bio-Inspir. Com. 5, 9–141 (2013)Google Scholar
  3. 3.
    Yang, X.S.: Firefly algorithms for multimodal optimization. In: Watanabe, O., Zeugmann, T. (eds.) SAGA 2009. LNCS, vol. 5792, pp. 169–178. Springer, Heidelberg (2009). CrossRefGoogle Scholar
  4. 4.
    Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 645–678 (2005)CrossRefGoogle Scholar
  5. 5.
    Komarasamy, G., Wahi, A.: An optimized k-means clustering technique using bat algorithm. Eur. J. Sci. Res. 84, 263–273 (2012)Google Scholar
  6. 6.
    Senthilnath, J., Omkar, S.N., Mani, V.: Clustering using firefly algorithm: performance study. Swarm Evol. Comput. 1, 164–171 (2011)CrossRefGoogle Scholar
  7. 7.
    Wang, B., Yin, J., Hua, Q., Wu, Z., Cao, J.: Parallelizing k-means-based clustering on spark. In: IEEE International Conference on Advanced Cloud and Big Data, Chengdu, China (2016).
  8. 8.
    Huang, Q., Zhou, F.: Research on retailer data clustering algorithm based on spark. In: AIP Conference Proceedings (2017).
  9. 9.
    Mathew, J., Vijayakumar, R.: Scalable parallel clustering approach for large data using parallel k means and firefly algorithms. In: IEEE International Conference on High Performance Computing and Applications, Bhubaneswar, India (2014).
  10. 10.
    Kusuma, I., Ma’sum, M.A., Habibie, N., Jatmiko, W., Suhartanto, H.: In design of intelligent k-means based on spark for big data clustering. In: IEEE International Workshop on Big Data and Information Security, Jakarta, Indonesia (2016).
  11. 11.
  12. 12.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means ++. In: Proceedings of the VLDB Endowment, pp. 622–633 (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.PSG College of TechnologyAnna UniversityCoimbatoreIndia

Personalised recommendations