Neural Computing and Applications

, Volume 31, Issue 12, pp 8239–8252 | Cite as

MapReduce-based adaptive random forest algorithm for multi-label classification

  • Qinghua Wu
  • Haihui Wang
  • Xuesong YanEmail author
  • Xiaobo Liu
Machine Learning - Applications & Techniques in Cyber Intelligence


Due to the complexity of data characteristics, multi-label learning in data mining has been proposed by scholars to solve the problem of information knowledge in the era of big data. In the era of big data, the complexity of the data structures makes it impossible for traditional single-label learning methods to meet the needs of technological development. Moreover, the importance of multi-label learning is gradually becoming evident. The random forest (RF) algorithm is regarded as one of the best classification algorithms. In this study, the traditional decision tree algorithm was improved, and the traditional RF method was converted into an adaptive RF (ARF) method for multi-label classification. By experiments, the effectiveness of the proposed method was verified. The RF method may not be able to classify massive data in a short time, but Hadoop, which was by Apache, is suitable for data-intensive tasks. On this basis, we modified the MapReduce programming mode to make it suitable for the proposed ARF method. This method was implemented on the cloud platform, and the time effectiveness of the parallel model was verified by experiments.


Multi-label classification Random forest algorithm Hadoop MapReduce 



This paper is supported by Natural Science Foundation of China (No. 61673354), the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan), the State Key Lab of Digital Manufacturing Equipment & Technology (DMETKF2018020), and Huazhong University of Science & Technology.


  1. 1.
    Gibaja E, Ventura S (2015) A tutorial on multilabel learning. ACM Comput Surv 47(3):1–38Google Scholar
  2. 2.
    Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehouse Min 3(3):1–13Google Scholar
  3. 3.
    Streich AP, Buhmann JM (2008) Classification of multi-labeled data: a generative approach. Mach Learn Knowl Discov Databases DBLP:390–405Google Scholar
  4. 4.
    Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139MathSciNetzbMATHGoogle Scholar
  5. 5.
    Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771Google Scholar
  6. 6.
    Li X, Wang L, Sung E (2004) Multilabel SVM active learning for image classification. Int Conf Image Process 4(4):2207–2210Google Scholar
  7. 7.
    Diplaris S, Tsoumakas G, Mitkas PA, Vlahavas IP (2005) Protein classification with multiple algorithms. In: Panhellenic conference on informatics, pp 448–456Google Scholar
  8. 8.
    Trohidis K, Tsoumakas G, Kalliris G, Vlahavas IP (2008) Multi-label classification of music into emotions. ISMIR 8:325–330Google Scholar
  9. 9.
    Tawiah CA, Sheng VS (2013) Empirical comparison of multi-label classification algorithms. In: Proceedings of the 27th national conference on artificial intelligence (AAAI), Bellevue, Washington, pp 1645–1646Google Scholar
  10. 10.
    Cherman EA, Monard MC, Metz J (2011) Multi-label problem transformation methods: a case study. Clei Electron J 14(1):4Google Scholar
  11. 11.
    Tawiah CA, Sheng VS (2013) A study on multi-label classification. In: Industrial conference on data mining (ICDM), Springer, Berlin, pp 137–150Google Scholar
  12. 12.
    Yan X, Wu Q, Sheng VS (2016) A double weighted Naive Bayes with niching cultural algorithm for multi-label classification. Int J Pattern Recognit Artif Intell 30(06):1650013Google Scholar
  13. 13.
    Wu J, Zhao S, Sheng VS, Ye C, Zhao P, Cui Z (2017) Weak labeled active learning with conditional label dependence for multi-label image classification. IEEE Trans Multimed 19(6):1156–1169Google Scholar
  14. 14.
    Wu Q, Liu H, Yan X (2016) Multi-label classification algorithm research based on swarm intelligence. Clust Comput 19(4):2075–2085Google Scholar
  15. 15.
    Wu J, Guo A, Sheng VS, Zhao P, Cui Z (2018) An active learning approach for multi-label image classification with sample noise. Int J Pattern Recognit Artif Intell 32(3):1–23MathSciNetGoogle Scholar
  16. 16.
    Ma J, Zhou H, Zhao J, Gao Y, Jiang J, Tian J (2015) Robust feature matching for remote sensing image registration via locally linear transforming. IEEE Trans Geosci Remote Sens 53(12):6469–6481Google Scholar
  17. 17.
    Zang H, Zhang T, Zhang Y (2015) Bifurcation analysis of a mathematical model for genetic regulatory network with time delays. Appl Math Comput 260:204–226MathSciNetzbMATHGoogle Scholar
  18. 18.
    Zhou H, Ma J, Yang C, Sun S, Liu R, Zhao J (2016) Nonrigid feature matching for remote sensing images via probabilistic inference with global and local regularizations. IEEE Geosci Remote Sens Lett 13(3):374–378Google Scholar
  19. 19.
    Xia P (2016) Haptics for product design and manufacturing simulation. IEEE Trans Haptics 9(3):358–375MathSciNetGoogle Scholar
  20. 20.
    Lu T, Peng L, Zhang Y (2016) Edge feature based approach for object recognition. Pattern Recognit Image Anal 26(2):350–353Google Scholar
  21. 21.
    Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text Categorization. Mach Learn 39:135–168zbMATHGoogle Scholar
  22. 22.
    Elisseeff A, Weston J (2002) A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp 681–687Google Scholar
  23. 23.
    De Comite F, Gilleron R, Tommasi M (2003) Learning multi-label alternating decision trees from texts and data. In: International workshop on machine learning and data mining in pattern recognition. Springer, Berlin, pp 35–49zbMATHGoogle Scholar
  24. 24.
    Zhu S, Ji X, Xu W, Gong Y (2005) Multi-labelled classification using maximum entropy method. In: International ACM SIGIR conference on research and development in information retrieval, pp 274–281Google Scholar
  25. 25.
    Zhang M, Zhou Z (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048zbMATHGoogle Scholar
  26. 26.
    Zhang M, Zhou Z (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351Google Scholar
  27. 27.
    Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 3(3):1–13Google Scholar
  28. 28.
    De Carvalho AC, Freitas AA (2009) A tutorial on multi-label classification techniques. Found Comput Intell 5:177–195Google Scholar
  29. 29.
    Liu F, Zhang X, Ye Y, Zhao Y, Li Y (2015) MLRF: multi-label classification through random forest with label-set partition. In: International conference on intelligent computing, pp 407–418Google Scholar
  30. 30.
    Breiman Leo (2001) Random Forests. Mach Learn 45(1):5–32. zbMATHGoogle Scholar
  31. 31.
    Gall J, Lempitsky VS (2009) Class-specific Hough forests for object detection. In: Decision forests for computer vision and medical image analysis. Springer, London, pp 143–157Google Scholar
  32. 32.
    Gall J, Yao A, Razavi N, Van Gool L, Lempitsky VS (2011) Hough Forests for object detection, tracking, and action recognition. IEEE Trans Pattern Anal Mach Intell 33(11):2188–2202Google Scholar
  33. 33.
    Prinzie A, Den Poel DV (2008) Random forests for multiclass classification: random multinomial logit. Expert Syst Appl 34(3):1721–1732Google Scholar
  34. 34.
    Chen XW, Liu M (2005) Prediction of protein–protein interactions using random decision forest framework. Bioinformatics 21(24):4394–4400Google Scholar
  35. 35.
    Pang H, Datta D, Zhao H (2009) Pathway analysis using random forests with bivariate node-split for survival outcomes. Bioinformatics 26(2):250–258Google Scholar
  36. 36.
    Rio SD, Lopez V, Benitez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using random forest. Inf Sci 285:112–137Google Scholar
  37. 37.
    Ben-Haim Y, Tom-Tov E (2010) A streaming parallel decision tree algorithm. J Mach Learn Res 11:849–872MathSciNetzbMATHGoogle Scholar
  38. 38.
    Yan X, Zhu Z, Wu Q (2018) Intelligent inversion method for pre-stack seismic big data based on MapReduce. Comput Geosci 110:81–89Google Scholar
  39. 39.
    Yan X, Zhu Z, Hu C, Gong W, Wu Q (2018) Spark-based intelligent parameter inversion method for prestack seismic data. Neural Comput Appl. Google Scholar
  40. 40.
    Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinf 9(1):307Google Scholar
  41. 41.
    Breiman Leo (1996) Bagging predictors. Mach Learn 24(2):123–140. zbMATHGoogle Scholar
  42. 42.
    Borthakur D (2007) The Hadoop distributed file system: architecture and design. Hadoop Proj Website 11(11):1–10Google Scholar
  43. 43.
    White T (2015) Hadoop—the definitive guide 4e. Hadoop: the definitive guide. O’Reilly Media Inc, NewtonGoogle Scholar
  44. 44.
    Zikopoulos P, Eaton C (1989) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New York CityGoogle Scholar
  45. 45.
    Zhenhai Z, Shining L, Zhigang L, Hao C (2013) Multi-label feature selection algorithm based on information entropy. J Comput Res Dev 50(6):1177–1184Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  • Qinghua Wu
    • 1
  • Haihui Wang
    • 1
  • Xuesong Yan
    • 2
    • 3
    Email author
  • Xiaobo Liu
    • 4
  1. 1.Faculty of Computer Science and EngineeringWuhan Institute of TechnologyWuhanChina
  2. 2.School of Computer ScienceChina University of GeosciencesWuhanChina
  3. 3.State Key Lab of Digital Manufacturing Equipment and TechnologyHuazhong University of Science and TechnologyWuhanChina
  4. 4.School of AutomationChina University of GeosciencesWuhanChina

Personalised recommendations