Skip to main content

Comparative Study of Apache Spark MLlib Clustering Algorithms

  • Conference paper
  • First Online:
Data Mining and Big Data (DMBD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10387))

Included in the following conference series:

Abstract

Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    TT(S) in all Tables describes Training Time (Second).

References

  1. Chen, X.: A new clustering algorithm based on near neighbor influence. Expert Syst. Appl. 42, 7746–7758 (2015)

    Article  Google Scholar 

  2. Gómez, D., Zarrazola, E., Yáñez, J., Montero, J.: A Divide-and-Link algorithm for hierarchical clustering in networks. Inf. Sci. 316, 308–328 (2015)

    Article  Google Scholar 

  3. Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchan-dran, K., I. Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)

    Google Scholar 

  4. Khalilian, M., Mustapha, N., Sulaiman, N.: Data stream clustering by divide and conquer approach based on vector model. J. Big Data 3, 1 (2016)

    Article  Google Scholar 

  5. Khalilian, M., Mustapha, N., Sulaiman, N., Mamat, A.: Different aspects of data stream clustering. In: Elleithy, K., Sobh, T. (eds.) Innovations and Advances in Computer. Information, Systems Sciences, and Engineering, pp. 1181–1191. Springer, New York (2013). doi:10.1007/978-1-4614-3535-8_97

    Google Scholar 

  6. Wan, R., Yan, X., Su, X.: A weighted fuzzy clustering algorithm for data stream. In: 2008 ISECS International Colloquium on Computing, Communication, Control, and Management, pp. 360–364. IEEE (2008)

    Google Scholar 

  7. Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via cluster closures. In: Multimedia Data Mining and Analytics, pp. 373–395. Springer International Publishing (2015)

    Google Scholar 

  8. Finazzi, F., Haggarty, R., Miller, C., Scott, M., Fassò, A.: A comparison of clustering approaches for the study of the temporal coherence of multiple time series. Stochast. Environ. Res. Risk Assess. 29, 463–475 (2014)

    Article  Google Scholar 

  9. Brust, M.R., Turgut, D.: VBCA: a virtual forces clustering algorithm for autonomous aerial drone systems. In: 2016 Annual IEEE Systems Conference (SysCon), pp. 1–6. IEEE (2016)

    Google Scholar 

  10. Ozturk, C., Hancer, E., Karaboga, D.: Dynamic clustering with improved binary artificial bee colony algorithm. Appl. Soft Comput. 28, 69–80 (2015)

    Article  Google Scholar 

  11. Ding, S., Wu, F., Qian, J., Jia, H., Jin, F.: Research on data stream clustering algorithms. Artif. Intell. Rev. 43, 593–600 (2015)

    Article  Google Scholar 

  12. Yan, Y., Ricci, E., Liu, G., Sebe, N.: Egocentric daily activity recognition via multitask clustering. IEEE Trans. Image Process. 24, 2984–2995 (2015)

    Article  MathSciNet  Google Scholar 

  13. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., (2015)

    Google Scholar 

  14. Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J.: Mllib: machine learning in apache spark. JMLR 17(34), 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  15. Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection for clustering with gaussian mixture models. Biometrics 65, 701–709 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  16. He, X., Cai, D., Shao, Y., Bao, H., Han, J.: Laplacian regularized gaussian mixture model for data clustering. IEEE Trans. Knowl. Data Eng. 23, 1406–1418 (2011)

    Article  Google Scholar 

  17. Clustering - RDD-based API - Spark 2.1.0 Documentation. http://spark.apache.org/docs/latest/mllib-clustering.html

  18. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  19. Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the Third ACM Conference on Recommender Systems, pp. 61–68. ACM (2009)

    Google Scholar 

  20. Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 224–227 (1979)

    Article  Google Scholar 

  21. Lin, F., Cohen, W.: Power iteration clustering. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 655–662 (2010)

    Google Scholar 

  22. Yan, W., Brahmakshatriya, U., Xue, Y., Gilder, M., Wise, B.: p-PIC: parallel power iteration clustering for big data. J. Parallel Distrib. Comput. 73, 352–359 (2013)

    Article  Google Scholar 

  23. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584 (2001)

    Google Scholar 

  24. Alsabti, K., Ranka, S., Singh, V.: An efficient k-means clustering algorithm. Electrical Engineering and Computer Science (1997)

    Google Scholar 

  25. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassil-vitskii, S.: Scalable k-means++. Proc. VLDB Endowment 5, 622–633 (2012)

    Article  Google Scholar 

  26. Meila, M., Shi, J.: A random walks view of spectral segmentation (2001)

    Google Scholar 

  27. Blackard, J., Dean, D.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24, 131–151 (1999)

    Article  Google Scholar 

  28. Kumar, D., Bezdek, J., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens, T.: A hybrid approach to clustering in big data. IEEE Trans. Cybern. 46, 2372–2385 (2016)

    Article  Google Scholar 

  29. Alvarez, S.A., Kawato, T., Ruiz, C.: Mining over loosely coupled data sources using neural experts. In: International Workshop on Multimedia Data Mining. In Conjunction with the Ninth ACM SIGKDD International Conference on Knowledge Dis-cover and Data Mining (2003)

    Google Scholar 

  30. Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013). http://archive.ics.uci.edu/ml

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sasan Harifi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Harifi, S., Byagowi, E., Khalilian, M. (2017). Comparative Study of Apache Spark MLlib Clustering Algorithms. In: Tan, Y., Takagi, H., Shi, Y. (eds) Data Mining and Big Data. DMBD 2017. Lecture Notes in Computer Science(), vol 10387. Springer, Cham. https://doi.org/10.1007/978-3-319-61845-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61845-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61844-9

  • Online ISBN: 978-3-319-61845-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics