Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Sardar, Tanvir H.; Ansari, Zahid

doi:10.1007/s40031-021-00647-w

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Original Contribution
Published: 27 July 2021

Volume 103, pages 73–82, (2022)
Cite this article

Journal of The Institution of Engineers (India): Series B Aims and scope Submit manuscript

235 Accesses
8 Citations
Explore all metrics

Abstract

Efficient big data clustering is a requirement for massive data generating in this digitalized connected world. The traditional clustering algorithms do not scale over massively sized and highly unstructured big data. Thus, to obtain efficiency in clustering big data new architecture and programming paradigm is required. In this work, a novel MapReduce-based Fuzzy C-Medoids clustering algorithm is designed and experimented with to cluster big data repository of documents datasets. The performance of the proposed algorithm is experimentally evaluated for different-sized Hadoop cluster sizes and different-sized document datasets. The algorithm is found to be scalable and efficient in performing clustering jobs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

The state of the art and taxonomy of big data analytics: view from new big data framework

Article 01 February 2019

Classification of Users of a Health Service Provider Using Unsupervised Machine Learning Methods

Article 13 May 2024

References

A. Oussous et al., Big data technologies: a survey. J. King Saud Univ. Comput. Inf. Sci. 30(4), 431–448 (2018)
Google Scholar
T.H. Sardar, A.R. Faizabadi, Z. Ansari, An evaluation of MapReduce framework in cluster analysis, in 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (IEEE, 2017), pp. 110–114.
T.H. Sardar, A.R. Faizabadi, Z. Ansari, An analysis of data processing using MapReduce paradigm on the hadoop framework. Spec. Issue Int. J. Emerg. Res. Manag. Technol. 6(5), 922–927 (2017)
Google Scholar
N. Shah, S. Mahajan, Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 4(5), 30–38 (2012)
Google Scholar
T.H. Sardar, Z. Ansari, Partition-based clustering of large datasets using MapReduce framework: an analysis of recent themes and directions. Future Comput. Inform. J. 3(2), 247–261 (2018)
Article Google Scholar
S. Ghosh, S.K. Dubey, Comparative analysis of k-means and fuzzy c-means algorithms. Int. J. Adv. Comput. Sci. Appl. 4(4), 35 (2013)
Google Scholar
A. Zahid, A.R. Faizabadi, A. Afzal, Fuzzy c-Least Medians clustering for the discovery of web access patterns from web user sessions data. Intell. Data Anal. 21(3), 553–575 (2017)
Article Google Scholar
J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques (Elsevier, Hoboken, 2011)
MATH Google Scholar
J. Blazewicz et al. (eds.), Handbook on Data Management in Information Systems (Springer, Berlin, 2012)
Google Scholar
https://www.geeksforgeeks.org/ml-k-medoids-clustering-with-example/
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce, in: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (IEEE, 2011) , pp. 473–480.
B.-R. Dai, I.-C. Lin, Efficient map/reduce-based dbscan algorithm with optimized data partition, in: 2012 IEEE Fifth International Conference on Cloud Computing (IEEE, 2012), pp. 59–66.
S. Shahrivari, S. Jalili, Single-pass and linear-time k-means clustering based on MapReduce. Inf. Syst. 60, 1–12 (2016)
Article Google Scholar
T.H. Sardar, Z. Ansari, A. Khatun, An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means, in: 2017 IEEE International Conference on Circuits and Systems (ICCS) (IEEE, 2017), pp. 17–20.
H. Singh, Clustering of text documents by implementation of K-means algorithms. Streamed Info-Ocean 1(1), 53–63 (2016)
Google Scholar
R.C. Balabantaray, C. Sarma, M. Jha, Document clustering using K-means and K-medoids. arXiv preprint arxiv:1502.07938 (2015).
T.H. Sardar, Z. Ansari, Detection and confirmation of web robot requests for cleaning the voluminous web log data, in: 2014 International Conference on the IMpact of E-Technology on US (IMPETUS) (IEEE, 2014), pp. 13–19.
T. HabibSardar, Z. Ansari, An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Comput. Inform. J. 3(2), 200–209 (2018)
Article Google Scholar
W. Wiharto, E. Suryani, The comparison of clustering algorithms K-means and Fuzzy C-means for segmentation retinal blood vessels. Acta Informatica Medica 28(1), 42 (2020)
Article Google Scholar
G. Ball, D. Hall, A clustering technique for summarizing multivariate data. Behav. Sci. 153, 12 (1967)
Google Scholar
P. Zhou et al., Large-scale data sets clustering based on MapReduce and hadoop. J. Comput. Inf. Syst. 7, 16 (2011)
Google Scholar
P. Anchalia, Improved MapReduce K-means clustering algorithm with combiner, in: 16th International Conference on Computer Modeling and Simulation (UKSim) (2014), pp. 386–391.
W. JiJi, Q. Guo, S. Zhong, E. Zhou, Improved K-medoids clustering algorithm under semantic web, in: Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (Atlantis Press, 2013), pp. 731–733.
R. Krishnapuram, A. Joshi, L. Yi, A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering, in: FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No. 99CH36315), vol. 3 (IEEE, 1999), pp. 1281–1286.
V.N. Phu, T.N.T. Vo, K-Medoids algorithm used for english sentiment classification in a distributed system. Comput. Model. New Technol. 22(1), 20–39 (2018)
Google Scholar
H. Song, J.-G. Lee, Wo.-S. Han, PAMAE: parallel k-medoids clustering with high accuracy and efficiency, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 1087–1096.
C. Rong, Using Mahout for clustering Wikipedia's latest articles: a comparison between k-means and fuzzy c-means in the cloud, in: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (IEEE, 2011), pp. 565–569.
M. Rojček, System for fuzzy document clustering and fast fuzzy classification, in: 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI) (IEEE, 2014), pp. 39–42.
J.-P. Mei, Y. Wang, Hyperspherical fuzzy clustering for online document categorization, in: 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (IEEE, 2016), pp. 1487–1493.
T.M. Nogueira, S.O. Rezende, H.A. Camargo, On the use of fuzzy rules to text document classification, in: 2010 10th International Conference on Hybrid Intelligent Systems (IEEE, 2010), pp. 19–24.
H. Zongzhen, Z. Weina, D. Xiaojuan, A fuzzy approach to clustering of text documents based on MapReduce, in: 2013 International Conference on Computational and Information Sciences (IEEE, 2013), pp. 666–669.
M. Allahyari et al., A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arxiv:1707.02919 (2017).
T.H. Sardar, Z. Ansari, An analysis of distributed document clustering using MapReduce based K-Means algorithm. J. Inst. Eng. India Ser. B 101(6), 641–650 (2020)
Article Google Scholar

Download references

Funding

No external sources of funding were used.

Author information

Authors and Affiliations

School of Computer Science & Engineering, Jain University, Bangalore, India
Tanvir H. Sardar
Electrical Engineering Section, University Polytechnic, Aligarh Muslim University, Aligarh, India
Zahid Ansari

Authors

Tanvir H. Sardar
View author publications
You can also search for this author in PubMed Google Scholar
Zahid Ansari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zahid Ansari.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sardar, T.H., Ansari, Z. Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids. J. Inst. Eng. India Ser. B 103, 73–82 (2022). https://doi.org/10.1007/s40031-021-00647-w

Download citation

Received: 23 October 2020
Accepted: 02 July 2021
Published: 27 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s40031-021-00647-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

The state of the art and taxonomy of big data analytics: view from new big data framework

Classification of Users of a Health Service Provider Using Unsupervised Machine Learning Methods

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

The state of the art and taxonomy of big data analytics: view from new big data framework

Classification of Users of a Health Service Provider Using Unsupervised Machine Learning Methods

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation