Skip to main content
Log in

Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In order to improve the accuracy and efficiency of the clustering mining algorithm, this paper focuses on the clustering mining algorithm for large data. Firstly, the traditional clustering mining algorithm is improved to improve the accuracy, and then the improved clustering algorithm is parallelized to improve the efficiency. In order to improve the accuracy of clustering, an incremental K-means clustering algorithm based on density is proposed on the basis of K-means algorithm. Firstly, the density of data points is calculated, and each basic cluster is composed of the center points whose density is not less than the given threshold and the points within the density range. Then, the basic cluster is merged according to the distance between the two cluster centers. Finally, the points that are not divided into any cluster are divided into the clusters nearest to them. In order to improve the efficiency of the algorithm and reduce the time complexity of the algorithm, the distributed database was used to simulate the shared memory space and parallelize the algorithm on the Hadoop platform of cloud computing. The simulation results show that the clustering accuracy of the proposed algorithm is higher than that of the other two algorithms by more than 10%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Cai, Z., Lee, I., Chu, S.C., et al.: SimSim: a service discovery method preserving content similarity and spatial similarity in P2P mobile cloud. J. Grid Comput. 17(3), 1–17 (2019)

    Google Scholar 

  2. Saeed, Z., Abbasi, R.A., Maqbool, O., et al.: What’s happening around the world? A survey and framework on event detection techniques on twitter. J. Grid Comput. 17(2), 1–34 (2019)

    Article  Google Scholar 

  3. Righi, R.D.R., Lehmann, M., Gomes, M.M., et al.: A survey on global management view: toward combining system monitoring, resource management, and load prediction. J. Grid Comput. 17(9), 1–30 (2019)

    Google Scholar 

  4. Salabat, K., Amir, K., Muazzam, M., et al.: Optimized Gabor feature extraction for mass classification using cuckoo search for big data E-healthcare. J. Grid Comput. 17(2), 239–254 (2019)

    Article  Google Scholar 

  5. Bandyopadhyay, S.S., Halder, A.K., Chatterjee, P., et al.: HdK-means: Hadoop based parallel K-means clustering for big data IEEE Calcutta Conference, pp. 452–456 (2018)

  6. Chen, Z., Guo, J., Liu, Q.: DBSCAN algorithm clustering for massive AIS data based on the Hadoop platform 2017 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII). IEEE Computer Society, pp. 25–28 (2017)

  7. Ye, K., Jiang, X., He, Y., et al.: vHadoop: a scalable Hadoop virtual cluster platform for mapreduce-based parallel machine learning with performance consideration. IEEE International Conference on Cluster Computing Workshops, pp. 152–160 (2012)

  8. Soler, L.J.G., Suárez, A.P., Chang, L.: Efficient overlapping document clustering using GPUs and Multi-core systems. Iberoamerican Congress on Pattern Recognition Ciarp, pp. 264–271 (2014)

  9. Bousbaci, A., Kamel, N.: A parallel sampling-PSO-multi-core-K-means algorithm using mapreduce. International Conference on Hybrid Intelligent Systems, pp. 129–134 (2015)

  10. Kim, J., Kim, M.H.: An efficient parallel processing method for skyline queries in MapReduce. J. Supercomput. 74(2), 1–50 (2018)

    Article  Google Scholar 

  11. Suresh Kumar, N., Thangamani, M.: Multi-ontology based points of interests (MO-POIS) and parallel fuzzy clustering (PFC) algorithm for travel sequence recommendation with Mobile communication on big social media. Wirel. Pers. Commun. 103(11), 1–20 (2018)

    Google Scholar 

  12. Tripathi, A.K., Sharma, K., Bala, M.: Dynamic frequency based parallel k-bat algorithm for massive data clustering (DFBPKBA). Int. J. Syst. Assur. Eng. Manag. 9(1), 1–9 (2018)

    Article  Google Scholar 

  13. Xing, X., Shimada, A., Taniguchi, R.I., et al.: Coupled dictionary learning and feature mapping for cross-modal retrieval. IEEE International Conference on Multimedia & Expo, pp. 1–6 (2015)

  14. Wang, J., Li, G., Peng, P., et al.: Semi-supervised semantic factorization hashing for fast cross-modal retrieval. Multimed. Tools Appl. 76(3), 1–19 (2017)

    Google Scholar 

  15. Yonggui, W., Cui, P., University L T: An efficient K-means parallel algorithm based on MapReduce. J. Liaoning Tech. Univ. 36(11), 1204–1211 (2017)

    Google Scholar 

  16. Xiao-Yu, L.I., Li-Ying, Y.U., Lei, H., et al.: The parallel implementation and application of an improved K-means algorithm. J. Univ. Elect. Sci. Technol. China. 46(1), 61–68 (2017)

    Google Scholar 

  17. Gao, B., Qin, Y., Xiao, X.M., et al.: K-means clustering analysis of key nodes and edges in Beijing subway network. Jiaotong Yunshu Xitong Gongcheng Yu Xinxi/J. Transp. Syst. Eng. Inf. Technol. 14(3), 207–213 (2014)

    Google Scholar 

  18. Tripathi, A.K., Sharma, K., Bala, M.: Dynamic frequency based parallel k-bat algorithm for massive data clustering (DFBPKBA). Int. J. Syst. Assur. Eng. Manag. 9(1), 1–9 (2017)

    Google Scholar 

  19. Wang, H., Wang, Q., Wang, W.: Text mining for educational literature on big data with Hadoop. 166–170 (2018)

  20. Agarwal, R., Singh, S., Vats, S.: Implementation of an improved algorithm for frequent itemset mining using Hadoop. International Conference on Computing, pp. 13–18 (2017)

  21. Afrati, F., Stasinopoulos, N., Ullman, J.D., et al.: SharesSkew: an algorithm to handle skew for joins in MapReduce. Inf. Syst. 77(2018), 129–150 (2018)

    Article  Google Scholar 

  22. Ye, H., Meng, C., Wang, Y.: Frequent pattern mining algorithm based on MapReduce. J. Nanjing Univ. Sci. Technolo. 42(1), 62–67 (2018)

    Google Scholar 

  23. Ma, K., Dong, F., Bo, Y.: Large-scale schema-free data deduplication approach with adaptive sliding window using MapReduce. Comput. J. 58(11), 3187–3201 (2018)

    Article  Google Scholar 

  24. Qureshi, N.M.F., Siddiqui, I.F., Unar, M.A., et al.: An aggregate MapReduce data block placement strategy for wireless IoT edge nodes in smart grid. Wirel. Pers. Commun. 106(2), 2225–2236 (2018)

    Google Scholar 

  25. Takizawa, S., Matsuda, M., Maruyama, N., et al.: A scalable multi-granular data model for data parallel workflows. International Conference on High Performance Computing in Asia-pacific Region, pp. 1–10 (2018)

  26. Zhou, Z., Zhao, X., Zhu, S.: K-harmonic means clustering algorithm using feature weighting for color image segmentation. Multimed. Tools Appl. 77(12), 15139–15160 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Nantong natural science foundation project (No. MS12017026-3).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weijia Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, W. Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework. J Grid Computing 18, 239–250 (2020). https://doi.org/10.1007/s10723-019-09503-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-019-09503-0

Keywords

Navigation