Skip to main content
Log in

MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The management and analysis of big data has been identified as one of the most important emerging needs in recent years. This is because of the sheer volume and increasing complexity of data being created or collected. Current clustering algorithms can not handle big data, and therefore, scalable solutions are necessary. Since fuzzy clustering algorithms have shown to outperform hard clustering approaches in terms of accuracy, this paper investigates the parallelization and scalability of a common and effective fuzzy clustering algorithm named fuzzy c-means (FCM) algorithm. The algorithm is parallelized using the MapReduce paradigm outlining how the Map and Reduce primitives are implemented. A validity analysis is conducted in order to show that the implementation works correctly achieving competitive purity results compared to state-of-the art clustering algorithms. Furthermore, a scalability analysis is conducted to demonstrate the performance of the parallel FCM implementation with increasing number of computing nodes used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://portal.longhorn.tacc.utexas.edu/.

  2. https://www.tacc.utexas.edu/systems/rustler.

References

  1. Bell G, Hey AJG, Szalay A (2009) Beyond the data deluge. Science 323(AAAS):1297–1298. doi:10.112/science.1170411

  2. Ghosh A, Jain LC (2005) Evolutionary computation in data mining series: studies in fuzziness and soft computing, vol 163. Springer, New York

    Book  Google Scholar 

  3. Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, New York. ISBN: 0-321-32136-7

  4. Jabeen H, Baig AR (2010) Review of classification using genetic programming. Int J Eng Sci Technol 2(2):94–103

    Google Scholar 

  5. Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  6. Ludwig SA (2014) Clonal selection based fuzzy C-means algorithm for clustering. In: GECCO '14 Proceedings of the 2014 conference on genetic and evolutionary computation, pp 105–112

  7. Babuska R (2014) Fuzzy clustering lecture. http://homes.di.unimi.it/ valenti/SlideCorsi/Bioinformatica05/Fuzzy-Clustering-lecture-Babuska. Accessed Oct 2014

  8. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell

    Book  MATH  Google Scholar 

  9. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River

    MATH  Google Scholar 

  10. Ruspini EH (1970) Numerical methods for fuzzy clustering. Inf Sci 2:319350

    Article  Google Scholar 

  11. Yang MS (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16

    Article  MATH  Google Scholar 

  12. Lee HS (1999) Automatic clustering of business process in business systems planning. Eur J Oper Res 114:354–362

    Article  MATH  Google Scholar 

  13. Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic theory and application. Prentice Hall PTR, Upper Saddle River

    Google Scholar 

  14. Rosenfeld A (1975) Fuzzy graphs. In: Zadeh LA, Fu KS, Shimura M (eds) Fuzzy sets and their applications to cognitive and decision processes. Academic Press, New York

    Google Scholar 

  15. Matula DW (1970) Cluster analysis via graph theoretic techniques. In: Proceedings of the Louisiana conference on combinatorics, graph theory and computing, Winnipeg

  16. Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57

    Article  MathSciNet  MATH  Google Scholar 

  17. Guerrero-Bote VP, Lopez-Pujalte C, de Moya-Anegon F, Herrero-Solana V (2003) Comparison of neural models for document clustering. Int J Approx Reason 34:287–305

    Article  MATH  Google Scholar 

  18. Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–781

    Article  Google Scholar 

  19. Bezdek JC, Coray C, Gunderson R, Watson J (1981) Detection and characterization of cluster substructure—linear structure, fuzzy c-varieties and convex combinations thereof. SIAM J Appl Math 40(2):358–372

    Article  MathSciNet  MATH  Google Scholar 

  20. Yang Y, Huang S (2007) Image segmentation by fuzzy c-means clustering algorithm with a novel penalty term. Comput Inform 26:17–31

    MATH  Google Scholar 

  21. Cai W, Chen S, Zhang D (2007) Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit 40(3):825–838

    Article  MATH  Google Scholar 

  22. Sarma TH, Viswanath P, Reddy BE (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4:107–113

    Article  Google Scholar 

  23. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation (OSDI’04), vol 6, p 10

  24. He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99

    Article  MathSciNet  Google Scholar 

  25. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Proceedings of the CloudCom’09. Springer, Berlin, pp 674–679

  26. Zhou P, Lei J, Ye W (2011) Large-scale data sets clustering based on mapreduce and hadoop. Comput Inf Syst 7(16):5956–5963

    Google Scholar 

  27. Li H-G, Wu G-Q, Hu X-G, Zhang J, Li L, Wu X (2011) K-means clustering with bagging and mapreduce. In: Proceedings of the 44th Hawaii international conference on system sciences. IEEE Computer Society, Washington, DC, pp 1–8

  28. Nair S, Mehta J (2011) Clustering with Apache Hadoop. In: Proceedings of the international conference, workshop on emerging trends in technology (ICWET’11), New York. ACM, New York, pp 505–509

  29. Papadimitriou S, Sun J (2008) Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Proceedings of the IEEE ICDM’08, Washington, DC, pp 512–521

  30. Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 681–689

  31. Yang J, Li X (2013) Mapreduce based method for big data semantic clustering. In: Proceedings of the 2013 IEEE international conference on systems, man, and cybernetics (SMC’13). IEEE Computer Society, Washington, DC, pp 2814–2819

  32. Cordeiro F, Traina Jr C, Traina AJM, Lopez J, Kang U, Taloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 690–698

  33. MPI (Message Passing Interface). http://www-unix.mcs.anl.gov/mpi/. Accessed 25 Apr 2015

  34. PVM (Parallel Virtual Machine). http://www.csm.ornl.gov/pvm/. Accessed 25 Apr 2015

  35. Modenesi MV, Costa MCA, Evsukoff AG, Ebecken NF (2007) Parallel fuzzy c-means cluster analysis. In: Lecture notes in computer science on high performance computing for computational science (VECPAR’06). Springer, New York

  36. Blackard JA (1998) Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. dissertation, Department of Forest Sciences, Colorado State University, Fort Collins, Colorado

  37. Apache Hadoop. http://hadoop.apache.org/. Accessed 25 Apr 2015

  38. Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco

    Google Scholar 

  39. Karypis G (2003) CLUTO: a clustering toolkit. University of Minnesota, Computer Science. Tech. Rep. 02-017

  40. Havens TC, Chitta R, Jain AK, Rong J (2011) Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering. In: Proceedings of IEEE international conference on fuzzy systems (FUZZ), pp 463–470

  41. Hathaway R, Bezdek J (1995) Optimization of clustering criteria by reformulation. IEEE Trans Fuzzy Syst 3:241245

    Article  Google Scholar 

  42. Mahout Library. http://mahout.apache.org/. Accessed 25 Apr 2015

Download references

Acknowledgments

The author acknowledges the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simone A. Ludwig.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ludwig, S.A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. & Cyber. 6, 923–934 (2015). https://doi.org/10.1007/s13042-015-0367-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-015-0367-0

Keywords

Navigation