Neural Computing and Applications

, Volume 32, Issue 1, pp 93–99 | Cite as

Research on large data set clustering method based on MapReduce

  • Pengcheng Wei
  • Fangcheng HeEmail author
  • Li Li
  • Chuanfu Shang
  • Jing Li
Brain- Inspired computing and Machine learning for Brain Health


The similarities and differences between the K-means algorithm and the Canopy algorithm’s MapReduce implementation are described in detail, and the possibility of combining the two to design a better algorithm suitable for clustering analysis of large data sets is analyzed in this paper. Different from the previous literature’s improvement ideas for K-means algorithm, it proposes new ideas for sampling and analyzes the selection of relevant thresholds in this paper. Finally, it introduces the MapReduce implementation framework based on Canopy partitioning and filtering K-means algorithm and analyzes some pseudocode in this chapter. Finally, it briefly analyzes the time complexity of the algorithm in this paper.


MapReduce Large data Set clustering method 



This work was supported by Chongqing Big Data Engineering Laboratory for Children, Chongqing Electronics Engineering Technology Research Center for Interactive Learning, the Science and Technology Research Project of Chongqing Municipal Education Commission of China (No. KJ1601401), the Science and Technology Research Project of Chongqing University of Education (No. KY201725C), Basic research and Frontier Exploration of Chongqing Science and Technology Commission (CSTC2014jcyjA40019), Project of Science and Technology Research Program of Chongqing Education Commission of China (No. KJZD-K201801601).


  1. 1.
    Alexey B, Dmytro I, Oleg R et al (2018) Constraints on decaying dark matter from XMM-Newton observations of M31. Mon Not R Astron Soc 387(4):1361–1373Google Scholar
  2. 2.
    Treu T, Dutton AA, Auger MW et al (2018) The SWELLS survey-I. A large spectroscopically selected sample of edge-on late-type lens galaxies. Mon Not R Astron Soc 417(3):1601–1620CrossRefGoogle Scholar
  3. 3.
    Efstathiou G, Gratton S, Paci F (2018) Impact of Galactic polarized emission on B-mode detection at low multipoles. Mon Not R Astron Soc 397(3):1355–1373CrossRefGoogle Scholar
  4. 4.
    Driver SP, Robotham ASG (2018) Quantifying cosmic variance. Mon Not R Astron Soc 407(4):2131–2140CrossRefGoogle Scholar
  5. 5.
    Humphrey PJ, Buote DA, Brighenti F et al (2018) Reconciling stellar dynamical and hydrostatic X-ray mass measurements of an elliptical galaxy with gas rotation, turbulence and magnetic fields. Mon Not R Astron Soc 430(3):1516–1528CrossRefGoogle Scholar
  6. 6.
    Barentsen G, Vink JS, Drew JE et al (2018) Bayesian inference of T Tauri star properties using multi-wavelength survey photometry. Mon Not R Astron Soc 429(3):1981–2000CrossRefGoogle Scholar
  7. 7.
    Littlefair SP, Naylor T, Mayne NJ et al (2018) Rotation of young stars in Cepheus OB3b. Mon Not R Astron Soc 403(2):545–557CrossRefGoogle Scholar
  8. 8.
    Clark CD (2017) Emergent drumlins and their clones: from till dilatancy to flow instabilities. J Glaciol 51(200):1011–1025CrossRefGoogle Scholar
  9. 9.
    Peng H, Li B, Ling H et al (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832CrossRefGoogle Scholar
  10. 10.
    Mukherjee AP, Tirthapura S (2017) Enumerating maximal bicliques from a large graph using MapReduce. IEEE Trans Serv Comput 10(5):771–784CrossRefGoogle Scholar
  11. 11.
    Kim Y, Shim K, Kim MS et al (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(2):15–35CrossRefGoogle Scholar
  12. 12.
    Río SD, López V, Benítez JM et al (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8(3):422–437CrossRefGoogle Scholar
  13. 13.
    Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18CrossRefGoogle Scholar
  14. 14.
    Xiaoshan YU, Yangyang WU (2014) Parallel text hierarchical clustering based on MapReduce. J Comput Appl 34(6):1595–1599Google Scholar
  15. 15.
    Fan T (2017) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 1:1–15Google Scholar
  16. 16.
    Leng YL, Zhang QC (2014) A big graph clustering algorithm based on MapReduce. Adv Mater Res 1049–1050:1467–1470CrossRefGoogle Scholar
  17. 17.
    Xia D, Wang B, Li Y et al (2015) An efficient MapReduce-based parallel clustering algorithm for distributed traffic subarea division. Discrete Dyn Nat Soc 2015(6018):1–18MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Lamari Y, Slaoui SC (2017) Clustering categorical data based on the relational analysis approach and MapReduce. J Big Data 4(1):28CrossRefGoogle Scholar
  19. 19.
    Hajkacem MAB, N’Cir CEB, Essoussi N (2017) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 2:1–18Google Scholar
  20. 20.
    Sun Z, Fox G, Gu W et al (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467CrossRefGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2018

Authors and Affiliations

  • Pengcheng Wei
    • 1
  • Fangcheng He
    • 2
    Email author
  • Li Li
    • 1
  • Chuanfu Shang
    • 1
  • Jing Li
    • 1
  1. 1.School of Mathematics and Information EngineeringChongqing University of EducationChongqingChina
  2. 2.School of Foreign Languages and LiteraturesChongqing University of EducationChongqingChina

Personalised recommendations