Comparative Study of Apache Spark MLlib Clustering Algorithms

Harifi, Sasan; Byagowi, Ebrahim; Khalilian, Madjid

doi:10.1007/978-3-319-61845-6_7

Sasan Harifi¹⁶,
Ebrahim Byagowi¹⁶ &
Madjid Khalilian¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10387))

Included in the following conference series:

International Conference on Data Mining and Big Data

4073 Accesses
12 Citations
3 Altmetric

Abstract

Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
TT(S) in all Tables describes Training Time (Second).

References

Chen, X.: A new clustering algorithm based on near neighbor influence. Expert Syst. Appl. 42, 7746–7758 (2015)
Article Google Scholar
Gómez, D., Zarrazola, E., Yáñez, J., Montero, J.: A Divide-and-Link algorithm for hierarchical clustering in networks. Inf. Sci. 316, 308–328 (2015)
Article Google Scholar
Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchan-dran, K., I. Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)
Google Scholar
Khalilian, M., Mustapha, N., Sulaiman, N.: Data stream clustering by divide and conquer approach based on vector model. J. Big Data 3, 1 (2016)
Article Google Scholar
Khalilian, M., Mustapha, N., Sulaiman, N., Mamat, A.: Different aspects of data stream clustering. In: Elleithy, K., Sobh, T. (eds.) Innovations and Advances in Computer. Information, Systems Sciences, and Engineering, pp. 1181–1191. Springer, New York (2013). doi:10.1007/978-1-4614-3535-8_97
Google Scholar
Wan, R., Yan, X., Su, X.: A weighted fuzzy clustering algorithm for data stream. In: 2008 ISECS International Colloquium on Computing, Communication, Control, and Management, pp. 360–364. IEEE (2008)
Google Scholar
Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via cluster closures. In: Multimedia Data Mining and Analytics, pp. 373–395. Springer International Publishing (2015)
Google Scholar
Finazzi, F., Haggarty, R., Miller, C., Scott, M., Fassò, A.: A comparison of clustering approaches for the study of the temporal coherence of multiple time series. Stochast. Environ. Res. Risk Assess. 29, 463–475 (2014)
Article Google Scholar
Brust, M.R., Turgut, D.: VBCA: a virtual forces clustering algorithm for autonomous aerial drone systems. In: 2016 Annual IEEE Systems Conference (SysCon), pp. 1–6. IEEE (2016)
Google Scholar
Ozturk, C., Hancer, E., Karaboga, D.: Dynamic clustering with improved binary artificial bee colony algorithm. Appl. Soft Comput. 28, 69–80 (2015)
Article Google Scholar
Ding, S., Wu, F., Qian, J., Jia, H., Jin, F.: Research on data stream clustering algorithms. Artif. Intell. Rev. 43, 593–600 (2015)
Article Google Scholar
Yan, Y., Ricci, E., Liu, G., Sebe, N.: Egocentric daily activity recognition via multitask clustering. IEEE Trans. Image Process. 24, 2984–2995 (2015)
Article MathSciNet Google Scholar
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., (2015)
Google Scholar
Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J.: Mllib: machine learning in apache spark. JMLR 17(34), 1–7 (2016)
MathSciNet MATH Google Scholar
Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection for clustering with gaussian mixture models. Biometrics 65, 701–709 (2009)
Article MathSciNet MATH Google Scholar
He, X., Cai, D., Shao, Y., Bao, H., Han, J.: Laplacian regularized gaussian mixture model for data clustering. IEEE Trans. Knowl. Data Eng. 23, 1406–1418 (2011)
Article Google Scholar
Clustering - RDD-based API - Spark 2.1.0 Documentation. http://spark.apache.org/docs/latest/mllib-clustering.html
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the Third ACM Conference on Recommender Systems, pp. 61–68. ACM (2009)
Google Scholar
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 224–227 (1979)
Article Google Scholar
Lin, F., Cohen, W.: Power iteration clustering. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 655–662 (2010)
Google Scholar
Yan, W., Brahmakshatriya, U., Xue, Y., Gilder, M., Wise, B.: p-PIC: parallel power iteration clustering for big data. J. Parallel Distrib. Comput. 73, 352–359 (2013)
Article Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584 (2001)
Google Scholar
Alsabti, K., Ranka, S., Singh, V.: An efficient k-means clustering algorithm. Electrical Engineering and Computer Science (1997)
Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassil-vitskii, S.: Scalable k-means++. Proc. VLDB Endowment 5, 622–633 (2012)
Article Google Scholar
Meila, M., Shi, J.: A random walks view of spectral segmentation (2001)
Google Scholar
Blackard, J., Dean, D.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24, 131–151 (1999)
Article Google Scholar
Kumar, D., Bezdek, J., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens, T.: A hybrid approach to clustering in big data. IEEE Trans. Cybern. 46, 2372–2385 (2016)
Article Google Scholar
Alvarez, S.A., Kawato, T., Ruiz, C.: Mining over loosely coupled data sources using neural experts. In: International Workshop on Multimedia Data Mining. In Conjunction with the Ninth ACM SIGKDD International Conference on Knowledge Dis-cover and Data Mining (2003)
Google Scholar
Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013). http://archive.ics.uci.edu/ml

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Karaj Branch, Islamic Azad University, Karaj, Iran
Sasan Harifi, Ebrahim Byagowi & Madjid Khalilian

Authors

Sasan Harifi
View author publications
You can also search for this author in PubMed Google Scholar
Ebrahim Byagowi
View author publications
You can also search for this author in PubMed Google Scholar
Madjid Khalilian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sasan Harifi .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Kyushu University, Fukuoka, Japan
Hideyuki Takagi
Southern University of Science and Technology, Shenzhen, China
Yuhui Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Harifi, S., Byagowi, E., Khalilian, M. (2017). Comparative Study of Apache Spark MLlib Clustering Algorithms. In: Tan, Y., Takagi, H., Shi, Y. (eds) Data Mining and Big Data. DMBD 2017. Lecture Notes in Computer Science(), vol 10387. Springer, Cham. https://doi.org/10.1007/978-3-319-61845-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-61845-6_7
Published: 24 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61844-9
Online ISBN: 978-3-319-61845-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics