Abstract
Efficient big data clustering is a requirement for massive data generating in this digitalized connected world. The traditional clustering algorithms do not scale over massively sized and highly unstructured big data. Thus, to obtain efficiency in clustering big data new architecture and programming paradigm is required. In this work, a novel MapReduce-based Fuzzy C-Medoids clustering algorithm is designed and experimented with to cluster big data repository of documents datasets. The performance of the proposed algorithm is experimentally evaluated for different-sized Hadoop cluster sizes and different-sized document datasets. The algorithm is found to be scalable and efficient in performing clustering jobs.
Similar content being viewed by others
References
A. Oussous et al., Big data technologies: a survey. J. King Saud Univ. Comput. Inf. Sci. 30(4), 431–448 (2018)
T.H. Sardar, A.R. Faizabadi, Z. Ansari, An evaluation of MapReduce framework in cluster analysis, in 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (IEEE, 2017), pp. 110–114.
T.H. Sardar, A.R. Faizabadi, Z. Ansari, An analysis of data processing using MapReduce paradigm on the hadoop framework. Spec. Issue Int. J. Emerg. Res. Manag. Technol. 6(5), 922–927 (2017)
N. Shah, S. Mahajan, Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 4(5), 30–38 (2012)
T.H. Sardar, Z. Ansari, Partition-based clustering of large datasets using MapReduce framework: an analysis of recent themes and directions. Future Comput. Inform. J. 3(2), 247–261 (2018)
S. Ghosh, S.K. Dubey, Comparative analysis of k-means and fuzzy c-means algorithms. Int. J. Adv. Comput. Sci. Appl. 4(4), 35 (2013)
A. Zahid, A.R. Faizabadi, A. Afzal, Fuzzy c-Least Medians clustering for the discovery of web access patterns from web user sessions data. Intell. Data Anal. 21(3), 553–575 (2017)
J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques (Elsevier, Hoboken, 2011)
J. Blazewicz et al. (eds.), Handbook on Data Management in Information Systems (Springer, Berlin, 2012)
https://www.geeksforgeeks.org/ml-k-medoids-clustering-with-example/
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce, in: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (IEEE, 2011) , pp. 473–480.
B.-R. Dai, I.-C. Lin, Efficient map/reduce-based dbscan algorithm with optimized data partition, in: 2012 IEEE Fifth International Conference on Cloud Computing (IEEE, 2012), pp. 59–66.
S. Shahrivari, S. Jalili, Single-pass and linear-time k-means clustering based on MapReduce. Inf. Syst. 60, 1–12 (2016)
T.H. Sardar, Z. Ansari, A. Khatun, An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means, in: 2017 IEEE International Conference on Circuits and Systems (ICCS) (IEEE, 2017), pp. 17–20.
H. Singh, Clustering of text documents by implementation of K-means algorithms. Streamed Info-Ocean 1(1), 53–63 (2016)
R.C. Balabantaray, C. Sarma, M. Jha, Document clustering using K-means and K-medoids. arXiv preprint arxiv:1502.07938 (2015).
T.H. Sardar, Z. Ansari, Detection and confirmation of web robot requests for cleaning the voluminous web log data, in: 2014 International Conference on the IMpact of E-Technology on US (IMPETUS) (IEEE, 2014), pp. 13–19.
T. HabibSardar, Z. Ansari, An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Comput. Inform. J. 3(2), 200–209 (2018)
W. Wiharto, E. Suryani, The comparison of clustering algorithms K-means and Fuzzy C-means for segmentation retinal blood vessels. Acta Informatica Medica 28(1), 42 (2020)
G. Ball, D. Hall, A clustering technique for summarizing multivariate data. Behav. Sci. 153, 12 (1967)
P. Zhou et al., Large-scale data sets clustering based on MapReduce and hadoop. J. Comput. Inf. Syst. 7, 16 (2011)
P. Anchalia, Improved MapReduce K-means clustering algorithm with combiner, in: 16th International Conference on Computer Modeling and Simulation (UKSim) (2014), pp. 386–391.
W. JiJi, Q. Guo, S. Zhong, E. Zhou, Improved K-medoids clustering algorithm under semantic web, in: Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (Atlantis Press, 2013), pp. 731–733.
R. Krishnapuram, A. Joshi, L. Yi, A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering, in: FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No. 99CH36315), vol. 3 (IEEE, 1999), pp. 1281–1286.
V.N. Phu, T.N.T. Vo, K-Medoids algorithm used for english sentiment classification in a distributed system. Comput. Model. New Technol. 22(1), 20–39 (2018)
H. Song, J.-G. Lee, Wo.-S. Han, PAMAE: parallel k-medoids clustering with high accuracy and efficiency, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 1087–1096.
C. Rong, Using Mahout for clustering Wikipedia's latest articles: a comparison between k-means and fuzzy c-means in the cloud, in: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (IEEE, 2011), pp. 565–569.
M. Rojček, System for fuzzy document clustering and fast fuzzy classification, in: 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI) (IEEE, 2014), pp. 39–42.
J.-P. Mei, Y. Wang, Hyperspherical fuzzy clustering for online document categorization, in: 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (IEEE, 2016), pp. 1487–1493.
T.M. Nogueira, S.O. Rezende, H.A. Camargo, On the use of fuzzy rules to text document classification, in: 2010 10th International Conference on Hybrid Intelligent Systems (IEEE, 2010), pp. 19–24.
H. Zongzhen, Z. Weina, D. Xiaojuan, A fuzzy approach to clustering of text documents based on MapReduce, in: 2013 International Conference on Computational and Information Sciences (IEEE, 2013), pp. 666–669.
M. Allahyari et al., A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arxiv:1707.02919 (2017).
T.H. Sardar, Z. Ansari, An analysis of distributed document clustering using MapReduce based K-Means algorithm. J. Inst. Eng. India Ser. B 101(6), 641–650 (2020)
Funding
No external sources of funding were used.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sardar, T.H., Ansari, Z. Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids. J. Inst. Eng. India Ser. B 103, 73–82 (2022). https://doi.org/10.1007/s40031-021-00647-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40031-021-00647-w