MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

Ludwig, Simone A.

doi:10.1007/s13042-015-0367-0

MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

Original Article
Published: 29 April 2015

Volume 6, pages 923–934, (2015)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Simone A. Ludwig¹

1285 Accesses
65 Citations
Explore all metrics

Abstract

The management and analysis of big data has been identified as one of the most important emerging needs in recent years. This is because of the sheer volume and increasing complexity of data being created or collected. Current clustering algorithms can not handle big data, and therefore, scalable solutions are necessary. Since fuzzy clustering algorithms have shown to outperform hard clustering approaches in terms of accuracy, this paper investigates the parallelization and scalability of a common and effective fuzzy clustering algorithm named fuzzy c-means (FCM) algorithm. The algorithm is parallelized using the MapReduce paradigm outlining how the Map and Reduce primitives are implemented. A validity analysis is conducted in order to show that the implementation works correctly achieving competitive purity results compared to state-of-the art clustering algorithms. Furthermore, a scalability analysis is conducted to demonstrate the performance of the parallel FCM implementation with increasing number of computing nodes used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Big data analytics: a survey

Article Open access 01 October 2015

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

Article 24 April 2021

Notes

References

Bell G, Hey AJG, Szalay A (2009) Beyond the data deluge. Science 323(AAAS):1297–1298. doi:10.112/science.1170411
Ghosh A, Jain LC (2005) Evolutionary computation in data mining series: studies in fuzziness and soft computing, vol 163. Springer, New York
Book Google Scholar
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, New York. ISBN: 0-321-32136-7
Jabeen H, Baig AR (2010) Review of classification using genetic programming. Int J Eng Sci Technol 2(2):94–103
Google Scholar
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Ludwig SA (2014) Clonal selection based fuzzy C-means algorithm for clustering. In: GECCO '14 Proceedings of the 2014 conference on genetic and evolutionary computation, pp 105–112
Babuska R (2014) Fuzzy clustering lecture. http://homes.di.unimi.it/ valenti/SlideCorsi/Bioinformatica05/Fuzzy-Clustering-lecture-Babuska. Accessed Oct 2014
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell
Book MATH Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River
MATH Google Scholar
Ruspini EH (1970) Numerical methods for fuzzy clustering. Inf Sci 2:319350
Article Google Scholar
Yang MS (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16
Article MATH Google Scholar
Lee HS (1999) Automatic clustering of business process in business systems planning. Eur J Oper Res 114:354–362
Article MATH Google Scholar
Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic theory and application. Prentice Hall PTR, Upper Saddle River
Google Scholar
Rosenfeld A (1975) Fuzzy graphs. In: Zadeh LA, Fu KS, Shimura M (eds) Fuzzy sets and their applications to cognitive and decision processes. Academic Press, New York
Google Scholar
Matula DW (1970) Cluster analysis via graph theoretic techniques. In: Proceedings of the Louisiana conference on combinatorics, graph theory and computing, Winnipeg
Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Article MathSciNet MATH Google Scholar
Guerrero-Bote VP, Lopez-Pujalte C, de Moya-Anegon F, Herrero-Solana V (2003) Comparison of neural models for document clustering. Int J Approx Reason 34:287–305
Article MATH Google Scholar
Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–781
Article Google Scholar
Bezdek JC, Coray C, Gunderson R, Watson J (1981) Detection and characterization of cluster substructure—linear structure, fuzzy c-varieties and convex combinations thereof. SIAM J Appl Math 40(2):358–372
Article MathSciNet MATH Google Scholar
Yang Y, Huang S (2007) Image segmentation by fuzzy c-means clustering algorithm with a novel penalty term. Comput Inform 26:17–31
MATH Google Scholar
Cai W, Chen S, Zhang D (2007) Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit 40(3):825–838
Article MATH Google Scholar
Sarma TH, Viswanath P, Reddy BE (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4:107–113
Article Google Scholar
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation (OSDI’04), vol 6, p 10
He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99
Article MathSciNet Google Scholar
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Proceedings of the CloudCom’09. Springer, Berlin, pp 674–679
Zhou P, Lei J, Ye W (2011) Large-scale data sets clustering based on mapreduce and hadoop. Comput Inf Syst 7(16):5956–5963
Google Scholar
Li H-G, Wu G-Q, Hu X-G, Zhang J, Li L, Wu X (2011) K-means clustering with bagging and mapreduce. In: Proceedings of the 44th Hawaii international conference on system sciences. IEEE Computer Society, Washington, DC, pp 1–8
Nair S, Mehta J (2011) Clustering with Apache Hadoop. In: Proceedings of the international conference, workshop on emerging trends in technology (ICWET’11), New York. ACM, New York, pp 505–509
Papadimitriou S, Sun J (2008) Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Proceedings of the IEEE ICDM’08, Washington, DC, pp 512–521
Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 681–689
Yang J, Li X (2013) Mapreduce based method for big data semantic clustering. In: Proceedings of the 2013 IEEE international conference on systems, man, and cybernetics (SMC’13). IEEE Computer Society, Washington, DC, pp 2814–2819
Cordeiro F, Traina Jr C, Traina AJM, Lopez J, Kang U, Taloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 690–698
MPI (Message Passing Interface). http://www-unix.mcs.anl.gov/mpi/. Accessed 25 Apr 2015
PVM (Parallel Virtual Machine). http://www.csm.ornl.gov/pvm/. Accessed 25 Apr 2015
Modenesi MV, Costa MCA, Evsukoff AG, Ebecken NF (2007) Parallel fuzzy c-means cluster analysis. In: Lecture notes in computer science on high performance computing for computational science (VECPAR’06). Springer, New York
Blackard JA (1998) Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. dissertation, Department of Forest Sciences, Colorado State University, Fort Collins, Colorado
Apache Hadoop. http://hadoop.apache.org/. Accessed 25 Apr 2015
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Google Scholar
Karypis G (2003) CLUTO: a clustering toolkit. University of Minnesota, Computer Science. Tech. Rep. 02-017
Havens TC, Chitta R, Jain AK, Rong J (2011) Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering. In: Proceedings of IEEE international conference on fuzzy systems (FUZZ), pp 463–470
Hathaway R, Bezdek J (1995) Optimization of clustering criteria by reformulation. IEEE Trans Fuzzy Syst 3:241245
Article Google Scholar
Mahout Library. http://mahout.apache.org/. Accessed 25 Apr 2015

Download references

Acknowledgments

The author acknowledges the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper.

Author information

Authors and Affiliations

Department of Computer Science, North Dakota State University, Fargo, ND, USA
Simone A. Ludwig

Authors

Simone A. Ludwig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone A. Ludwig.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ludwig, S.A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. & Cyber. 6, 923–934 (2015). https://doi.org/10.1007/s13042-015-0367-0

Download citation

Received: 30 October 2014
Accepted: 20 April 2015
Published: 29 April 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s13042-015-0367-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Big data analytics: a survey

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Big data analytics: a survey

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation