Comparative Study of Apache Spark MLlib Clustering Algorithms

  • Sasan Harifi
  • Ebrahim Byagowi
  • Madjid Khalilian
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10387)


Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.


Clustering k-means Bisecting k-means Spark MLlib Big data KDD cup 99 Cover type Train time Cohesion 


  1. 1.
    Chen, X.: A new clustering algorithm based on near neighbor influence. Expert Syst. Appl. 42, 7746–7758 (2015)CrossRefGoogle Scholar
  2. 2.
    Gómez, D., Zarrazola, E., Yáñez, J., Montero, J.: A Divide-and-Link algorithm for hierarchical clustering in networks. Inf. Sci. 316, 308–328 (2015)CrossRefGoogle Scholar
  3. 3.
    Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchan-dran, K., I. Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)Google Scholar
  4. 4.
    Khalilian, M., Mustapha, N., Sulaiman, N.: Data stream clustering by divide and conquer approach based on vector model. J. Big Data 3, 1 (2016)CrossRefGoogle Scholar
  5. 5.
    Khalilian, M., Mustapha, N., Sulaiman, N., Mamat, A.: Different aspects of data stream clustering. In: Elleithy, K., Sobh, T. (eds.) Innovations and Advances in Computer. Information, Systems Sciences, and Engineering, pp. 1181–1191. Springer, New York (2013). doi: 10.1007/978-1-4614-3535-8_97 Google Scholar
  6. 6.
    Wan, R., Yan, X., Su, X.: A weighted fuzzy clustering algorithm for data stream. In: 2008 ISECS International Colloquium on Computing, Communication, Control, and Management, pp. 360–364. IEEE (2008)Google Scholar
  7. 7.
    Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via cluster closures. In: Multimedia Data Mining and Analytics, pp. 373–395. Springer International Publishing (2015)Google Scholar
  8. 8.
    Finazzi, F., Haggarty, R., Miller, C., Scott, M., Fassò, A.: A comparison of clustering approaches for the study of the temporal coherence of multiple time series. Stochast. Environ. Res. Risk Assess. 29, 463–475 (2014)CrossRefGoogle Scholar
  9. 9.
    Brust, M.R., Turgut, D.: VBCA: a virtual forces clustering algorithm for autonomous aerial drone systems. In: 2016 Annual IEEE Systems Conference (SysCon), pp. 1–6. IEEE (2016)Google Scholar
  10. 10.
    Ozturk, C., Hancer, E., Karaboga, D.: Dynamic clustering with improved binary artificial bee colony algorithm. Appl. Soft Comput. 28, 69–80 (2015)CrossRefGoogle Scholar
  11. 11.
    Ding, S., Wu, F., Qian, J., Jia, H., Jin, F.: Research on data stream clustering algorithms. Artif. Intell. Rev. 43, 593–600 (2015)CrossRefGoogle Scholar
  12. 12.
    Yan, Y., Ricci, E., Liu, G., Sebe, N.: Egocentric daily activity recognition via multitask clustering. IEEE Trans. Image Process. 24, 2984–2995 (2015)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., (2015)Google Scholar
  14. 14.
    Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J.: Mllib: machine learning in apache spark. JMLR 17(34), 1–7 (2016)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection for clustering with gaussian mixture models. Biometrics 65, 701–709 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    He, X., Cai, D., Shao, Y., Bao, H., Han, J.: Laplacian regularized gaussian mixture model for data clustering. IEEE Trans. Knowl. Data Eng. 23, 1406–1418 (2011)CrossRefGoogle Scholar
  17. 17.
    Clustering - RDD-based API - Spark 2.1.0 Documentation.
  18. 18.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  19. 19.
    Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the Third ACM Conference on Recommender Systems, pp. 61–68. ACM (2009)Google Scholar
  20. 20.
    Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 224–227 (1979)CrossRefGoogle Scholar
  21. 21.
    Lin, F., Cohen, W.: Power iteration clustering. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 655–662 (2010)Google Scholar
  22. 22.
    Yan, W., Brahmakshatriya, U., Xue, Y., Gilder, M., Wise, B.: p-PIC: parallel power iteration clustering for big data. J. Parallel Distrib. Comput. 73, 352–359 (2013)CrossRefGoogle Scholar
  23. 23.
    Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584 (2001)Google Scholar
  24. 24.
    Alsabti, K., Ranka, S., Singh, V.: An efficient k-means clustering algorithm. Electrical Engineering and Computer Science (1997)Google Scholar
  25. 25.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassil-vitskii, S.: Scalable k-means++. Proc. VLDB Endowment 5, 622–633 (2012)CrossRefGoogle Scholar
  26. 26.
    Meila, M., Shi, J.: A random walks view of spectral segmentation (2001)Google Scholar
  27. 27.
    Blackard, J., Dean, D.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24, 131–151 (1999)CrossRefGoogle Scholar
  28. 28.
    Kumar, D., Bezdek, J., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens, T.: A hybrid approach to clustering in big data. IEEE Trans. Cybern. 46, 2372–2385 (2016)CrossRefGoogle Scholar
  29. 29.
    Alvarez, S.A., Kawato, T., Ruiz, C.: Mining over loosely coupled data sources using neural experts. In: International Workshop on Multimedia Data Mining. In Conjunction with the Ninth ACM SIGKDD International Conference on Knowledge Dis-cover and Data Mining (2003)Google Scholar
  30. 30.
    Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Sasan Harifi
    • 1
  • Ebrahim Byagowi
    • 1
  • Madjid Khalilian
    • 1
  1. 1.Department of Computer EngineeringKaraj Branch, Islamic Azad UniversityKarajIran

Personalised recommendations