Multimedia Tools and Applications

, Volume 77, Issue 8, pp 10017–10031 | Cite as

Research and implementation of user clustering based on MapReduce in multimedia big data

  • Tongke FanEmail author


Poor understanding and low clustering efficiency of massive data is a problem under the context of big data. To solve this problem, Canopy + K-means clustering algorithm is proposed, and the MapReduce programming model is used to make full use of the computing and storage capacity of Hadoop cluster. Large quantities of buyers on taobao are taken as application context to do case study through Hadoop platform’s data mining set Mahout. General procedure for miming with Mahout is also given. Clustering algorithm based on MapReduce shows preferable clustering quality and operation speed. Comparison is made between Canopy + K-means algorithm and K-means algorithm in respect of runtime, speed-up ratio and extendibility. Test is conducted for these two clustering algorithms on clusters with different numbers of nodes in context of dataset of various scales. The experimental results show that Canopy + K-means algorithm has faster operation speed than K-means algorithm, but both of them show good speed-up ratio under Hadoop environment and Canopy + K-means algorithm is even much better K-means algorithm.


Multimedia big data Cloud computing Hadoop MapReduce Clustering algorithm 



This work was supported by the Scientific research project 2015 of Shaanxi Provincial Education Department (NO.15JK2113) and the Xi’an social science research project (special for Xi’an International University) (NO. 161 N08). The authors would like to thank the anonymous reviewers and the editor for the very instructive suggestions that led to the much improved quality of this paper.


  1. 1.
    Cai B, Chen XP (2013) Hadoop technique: in-depth Interpretation of Hadoop common and HDFS architecture design and realization principle. China Machine Press, BeijingGoogle Scholar
  2. 2.
    Capriolo E, Wampler D, Rutberglen J (2013) Hive programming guide, translated by Cao kun. Posts and Telecom Press, BeijingGoogle Scholar
  3. 3.
    Chen Q, Deng QN (2009) Cloud computing and its key techniques. J Comput Appl 26:2–6Google Scholar
  4. 4.
    Chu CT, Kim SK, Lin YA et al (2006) MapReduce for machine learning on multicore. NIPS 6:281–288Google Scholar
  5. 5.
    Dean J., Ghemawat S. (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM, 107-133.Google Scholar
  6. 6.
    Dong XC (2013) Hadoop technique: in-depth Interpretation of MapReduce architecture design and realization principle. China Machine Press, BeijingGoogle Scholar
  7. 7.
    Ericson C, Pallickara S (2013) On the performance of high dimensionaldata clustering and classification algorithms. Futur Gener Comput Syst 29:1024–1034CrossRefGoogle Scholar
  8. 8.
    Fan Z (2014) Mahout algorithm Interpretation and case study. China Machine Press, BeijingGoogle Scholar
  9. 9.
    Fan DL (2015) Hadoop massive data processing: technique Interpretation and project application. Posts and Telecom Press, Beijing, pp 290–296Google Scholar
  10. 10.
    Giacomelli P (2014) Apache mahout cookbook, Translated by Qi Xiaobo. China Machine PressGoogle Scholar
  11. 11.
    Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques, translated by fan Ming, Meng Xiaofeng, et al. China Machine Press, BeijingCrossRefGoogle Scholar
  12. 12.
    Highland F, Stephenson J (2012) Fitting the problem to the paradigm:algorithm characteristics required for effective use of MapReduce. Procedia Comput Sci 12:212–217CrossRefGoogle Scholar
  13. 13.
    Jiang XP, Li CH, Xiang W, Zhang XF, Yan HT (2011) MapReduce parallelization of K-means cluster algorithm. J Huazhong Univ Sci Technol 39:120–124Google Scholar
  14. 14.
    Li YA (2010) On MapReduce-based cluster algorithm parallelization. Zhongshan Uniersity, GuangzhouGoogle Scholar
  15. 15.
    Li WH, Xu SHR (2014) Design and implementation of electronic commerce recommendation system based on Hadoop. Comput Eng Des 35(1):130–137Google Scholar
  16. 16.
    Niu YH, Hai M (2015) Comparison research on mahout clustering algorithms under Hadoop Platforn. Comput Sci:465–469Google Scholar
  17. 17.
    Owen S, Anil R, Dunning T et al (2010) Mahout in action. Manning Publications, USAGoogle Scholar
  18. 18.
    Pan WB (2013) Research and application of parallel K-means meteorological data mining based on Cloud Computing. Nanjing University of Information Science & Technology, NanjingGoogle Scholar
  19. 19.
    Polo J, Carrera D (2010) Performance-driven task Cosche-duling for MapReduce environments. Proc. of IEEE network operations and management symposium, [S.1]. IEEE Press, pp 373-380Google Scholar
  20. 20.
    Tan PN (2007) Introduction to data mining. Pearson education, India, 490.Google Scholar
  21. 21.
    Wegener D, Mock M, Adranale D, et al. (2009) Toolkit-based high-performance data mining of large data on MapReduce clusters. IEEE International conference on data mining-ICDM. Washington: IEEE, 296–301.Google Scholar
  22. 22.
    White T (2011) Hadoop definitive guide, translated by Zhou MinQi, Wang XiaoLing, et al. Tsinghua University Press, BeijingGoogle Scholar
  23. 23.
    Wikipedia (2015) k-means-clustering [EB/OL]. [2015–12-08].
  24. 24.
    Wu X, Kumar V (2013) The top ten algorithms in data mining, translated by li Wenbo Wu Suyan, et al. Tsinghua University PressGoogle Scholar
  25. 25.
    Yu CJ, Zhang R (2014) FCM algorithm research based on canopy cluster under cloud environment. Comput Sci 41:316–319Google Scholar
  26. 26.
    Zhu Q, Qian L (2013) Analysis and design of mahout-based recommender system. Bull Sci Technol 29(6):35–36Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.School of Information and NetworkXi’an International UniversityXi’anChina

Personalised recommendations