Skip to main content

Advertisement

Log in

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means\( ++ \) initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347

    Article  Google Scholar 

  2. IBM, Big Data and Analytics (2015). URL http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html. Accessed 10 Nov 2016

  3. Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6:70

    Google Scholar 

  4. Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115

    Article  Google Scholar 

  5. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  6. Sinha Ankita, Jana PK (2016) Clustering algorithms for big data: a survey, the human element of big data: issues, analytics, and performance. CRC Press, Baca Raton, pp 140–157

    Google Scholar 

  7. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Article  Google Scholar 

  8. Tan PN (2006) Introduction to data mining. Pearson Education India, Delhi

    Google Scholar 

  9. De Maesschalck R, Jouan-Rimbaud D, Massart DL (2000) The mahalanobis distance. Chemom Intell Lab Syst 50(1):1–18

    Article  Google Scholar 

  10. Teknomo Kardi (2015) Similarity measurement. http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html. Accessed 10 Nov 2016

  11. Xiang S, Nie F, Zhang C (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit 41(12):3600–3612

    Article  MATH  Google Scholar 

  12. Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248

    Article  MATH  Google Scholar 

  13. Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33

    Article  MATH  Google Scholar 

  14. Goldberg DE (2006) Genetic algorithms. Pearson Education India, Delhi

    Google Scholar 

  15. Bhattacharya RK (2012) Introduction to genetic algorithms Department of Civil Engineering. Indian Institute of Technology, Guwahati

    Google Scholar 

  16. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  17. Reddy D, Jana PK, Member IS (2012) Initialization for K-means clustering using Voronoi diagram. Proced Technol 4:395–400

    Article  Google Scholar 

  18. Reddy D, Mishra D, Jana P.K (2011) MST-based cluster initialization for k-means. In: International Conference on Computer Science and Information Technology. Springer Berlin Heidelberg, pp 329–338

  19. Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33(9):1455–1465

    Article  Google Scholar 

  20. Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowl Based Syst 71:345–365

    Article  Google Scholar 

  21. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE International Conference on Cloud Computing. Springer Berlin Heidelberg, pp 674–679

  22. Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259

    Article  Google Scholar 

  23. Shahrivari S, Jalili S (2016) Single-pass and linear-time k-means clustering based on MapReduce. Inf Syst 60:1–12

    Article  Google Scholar 

  24. Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035

  25. HDFS (2016). https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. Accessed 10 Nov 2016

  26. Verma A, Llor X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using mapreduce. In: 2009 IEEE Ninth International Conference on Intelligent Systems Design and Applications, pp 13–18

  27. Banharnsakun A (2017) A MapReduce-based artificial bee colony for large-scale data clustering. Pattern Recognit Lett 93:78–84

  28. Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: 2012 IEEE 14th International Conference on Communication Technology (ICCT), pp 1203–1208

  29. Naldi MC, Campello RJGB (2014) Evolutionary k-means for distributed datasets. Neurocomputing 127:30–42

    Article  Google Scholar 

  30. Apache (2016) Apache hadoop. http://hadoop.apache.org. Accessed 10 Nov 2016

  31. Cant-Paz E (1998) A survey of parallel genetic algorithms. Calculateurs Paralleles Reseaux et Systems Repartis 10(2):141–171

    Google Scholar 

  32. Gong YJ, Chen WN, Zhan ZH, Zhang J, Li Y, Zhang Q, Li JJ (2015) Distributed evolutionary algorithms and their models: a survey of the state-of-the-art. Appl Soft Comput 34:286–300

    Article  Google Scholar 

  33. Mitchell TM (1997) Machine learning. McGraw Hill, New York City

    MATH  Google Scholar 

  34. UCI Machine Learning Repository (2016). http://archive.ics.uci.edu/ml/dataset. Accessed 10 Nov 2016

  35. Davies DL, Donald W (1979) Bouldin.: a cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227

    Article  Google Scholar 

  36. Traganitis PA, Slavakis K, Giannakis GB (2015) Sketch and validate for big data clustering. IEEE J Sel Top Sig Process 9(4):678–690

    Article  Google Scholar 

  37. http://libguides.library.kent.edu/SPSS/PairedSamplestTest. Accessed 10 Nov 2016

Download references

Acknowledgements

We sincerely thank the Council of Scientific and Industrial Research (CSIR), New Delhi, India, for supporting this work (File No. 09\(\backslash \)085(0111)2014-EMR-1). We are grateful to CSIR, India, for the financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankita Sinha.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sinha, A., Jana, P.K. A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74, 1562–1579 (2018). https://doi.org/10.1007/s11227-017-2182-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2182-8

Keywords

Navigation