Skip to main content
Log in

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

  • Original Contribution
  • Published:
Journal of The Institution of Engineers (India): Series B Aims and scope Submit manuscript

Abstract

Efficient big data clustering is a requirement for massive data generating in this digitalized connected world. The traditional clustering algorithms do not scale over massively sized and highly unstructured big data. Thus, to obtain efficiency in clustering big data new architecture and programming paradigm is required. In this work, a novel MapReduce-based Fuzzy C-Medoids clustering algorithm is designed and experimented with to cluster big data repository of documents datasets. The performance of the proposed algorithm is experimentally evaluated for different-sized Hadoop cluster sizes and different-sized document datasets. The algorithm is found to be scalable and efficient in performing clustering jobs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. A. Oussous et al., Big data technologies: a survey. J. King Saud Univ. Comput. Inf. Sci. 30(4), 431–448 (2018)

    Google Scholar 

  2. T.H. Sardar, A.R. Faizabadi, Z. Ansari, An evaluation of MapReduce framework in cluster analysis, in 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (IEEE, 2017), pp. 110–114.

  3. T.H. Sardar, A.R. Faizabadi, Z. Ansari, An analysis of data processing using MapReduce paradigm on the hadoop framework. Spec. Issue Int. J. Emerg. Res. Manag. Technol. 6(5), 922–927 (2017)

    Google Scholar 

  4. N. Shah, S. Mahajan, Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 4(5), 30–38 (2012)

    Google Scholar 

  5. T.H. Sardar, Z. Ansari, Partition-based clustering of large datasets using MapReduce framework: an analysis of recent themes and directions. Future Comput. Inform. J. 3(2), 247–261 (2018)

    Article  Google Scholar 

  6. S. Ghosh, S.K. Dubey, Comparative analysis of k-means and fuzzy c-means algorithms. Int. J. Adv. Comput. Sci. Appl. 4(4), 35 (2013)

    Google Scholar 

  7. A. Zahid, A.R. Faizabadi, A. Afzal, Fuzzy c-Least Medians clustering for the discovery of web access patterns from web user sessions data. Intell. Data Anal. 21(3), 553–575 (2017)

    Article  Google Scholar 

  8. J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques (Elsevier, Hoboken, 2011)

    MATH  Google Scholar 

  9. J. Blazewicz et al. (eds.), Handbook on Data Management in Information Systems (Springer, Berlin, 2012)

    Google Scholar 

  10. https://www.geeksforgeeks.org/ml-k-medoids-clustering-with-example/

  11. Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce, in: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (IEEE, 2011) , pp. 473–480.

  12. B.-R. Dai, I.-C. Lin, Efficient map/reduce-based dbscan algorithm with optimized data partition, in: 2012 IEEE Fifth International Conference on Cloud Computing (IEEE, 2012), pp. 59–66.

  13. S. Shahrivari, S. Jalili, Single-pass and linear-time k-means clustering based on MapReduce. Inf. Syst. 60, 1–12 (2016)

    Article  Google Scholar 

  14. T.H. Sardar, Z. Ansari, A. Khatun, An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means, in: 2017 IEEE International Conference on Circuits and Systems (ICCS) (IEEE, 2017), pp. 17–20.

  15. H. Singh, Clustering of text documents by implementation of K-means algorithms. Streamed Info-Ocean 1(1), 53–63 (2016)

    Google Scholar 

  16. R.C. Balabantaray, C. Sarma, M. Jha, Document clustering using K-means and K-medoids. arXiv preprint arxiv:1502.07938 (2015).

  17. T.H. Sardar, Z. Ansari, Detection and confirmation of web robot requests for cleaning the voluminous web log data, in: 2014 International Conference on the IMpact of E-Technology on US (IMPETUS) (IEEE, 2014), pp. 13–19.

  18. T. HabibSardar, Z. Ansari, An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Comput. Inform. J. 3(2), 200–209 (2018)

    Article  Google Scholar 

  19. W. Wiharto, E. Suryani, The comparison of clustering algorithms K-means and Fuzzy C-means for segmentation retinal blood vessels. Acta Informatica Medica 28(1), 42 (2020)

    Article  Google Scholar 

  20. G. Ball, D. Hall, A clustering technique for summarizing multivariate data. Behav. Sci. 153, 12 (1967)

    Google Scholar 

  21. P. Zhou et al., Large-scale data sets clustering based on MapReduce and hadoop. J. Comput. Inf. Syst. 7, 16 (2011)

    Google Scholar 

  22. P. Anchalia, Improved MapReduce K-means clustering algorithm with combiner, in: 16th International Conference on Computer Modeling and Simulation (UKSim) (2014), pp. 386–391.

  23. W. JiJi, Q. Guo, S. Zhong, E. Zhou, Improved K-medoids clustering algorithm under semantic web, in: Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (Atlantis Press, 2013), pp. 731–733.

  24. R. Krishnapuram, A. Joshi, L. Yi, A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering, in: FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No. 99CH36315), vol. 3 (IEEE, 1999), pp. 1281–1286.

  25. V.N. Phu, T.N.T. Vo, K-Medoids algorithm used for english sentiment classification in a distributed system. Comput. Model. New Technol. 22(1), 20–39 (2018)

    Google Scholar 

  26. H. Song, J.-G. Lee, Wo.-S. Han, PAMAE: parallel k-medoids clustering with high accuracy and efficiency, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 1087–1096.

  27. C. Rong, Using Mahout for clustering Wikipedia's latest articles: a comparison between k-means and fuzzy c-means in the cloud, in: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (IEEE, 2011), pp. 565–569.

  28. M. Rojček, System for fuzzy document clustering and fast fuzzy classification, in: 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI) (IEEE, 2014), pp. 39–42.

  29. J.-P. Mei, Y. Wang, Hyperspherical fuzzy clustering for online document categorization, in: 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (IEEE, 2016), pp. 1487–1493.

  30. T.M. Nogueira, S.O. Rezende, H.A. Camargo, On the use of fuzzy rules to text document classification, in: 2010 10th International Conference on Hybrid Intelligent Systems (IEEE, 2010), pp. 19–24.

  31. H. Zongzhen, Z. Weina, D. Xiaojuan, A fuzzy approach to clustering of text documents based on MapReduce, in: 2013 International Conference on Computational and Information Sciences (IEEE, 2013), pp. 666–669.

  32. M. Allahyari et al., A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arxiv:1707.02919 (2017).

  33. T.H. Sardar, Z. Ansari, An analysis of distributed document clustering using MapReduce based K-Means algorithm. J. Inst. Eng. India Ser. B 101(6), 641–650 (2020)

    Article  Google Scholar 

Download references

Funding

No external sources of funding were used.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zahid Ansari.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sardar, T.H., Ansari, Z. Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids. J. Inst. Eng. India Ser. B 103, 73–82 (2022). https://doi.org/10.1007/s40031-021-00647-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40031-021-00647-w

Keywords

Navigation