Advertisement

Cluster Computing

, Volume 20, Issue 4, pp 2821–2831 | Cite as

A strategy for scheduling reduce task based on intermediate data locality of the MapReduce

  • Fengjun ShangEmail author
  • Xuanling Chen
  • Chenyun Yan
Article

Abstract

In this paper, researching on task scheduling is a way from the perspective of resource allocation and management to improve performance of Hadoop system. In order to save the network bandwidth resources in Hadoop cluster environment and improve the performance of Hadoop system, a ReduceTask scheduling strategy that based on data-locality is improved. In MapReduce stage, there are two main data streams in cluster network, they are slow task migration and remote copies of data. The two overlapping burst data transfer can easily become bottlenecks of the cluster network. To reduce the amount of remote copies of data, combining with data-locality, we establish a minimum network resource consumption model (MNRC). MNRC is used to calculate the network resources consumption of ReduceTask. Based on this model, we design a delay priority scheduling policy for the ReduceTask which is based on the cost of network resource consumption. Finally, MNRC is verified by simulation experiments. Evaluation results show that MNRC outperforms the saving cluster network resource by an average of 7.5% in heterogeneous.

Keywords

Hadoop Task scheduling Data locality Bandwidth savings 

Notes

Acknowledgements

The author would like to thank the Chongqing Basic and Frontier Research Project under Grant No. cstc2016jcyjA0590. The work is partly funded by the National Nature Science Foundation of China (No. 61672004).

References

  1. 1.
    Landset, S., Khoshgoftaar, T.M., Richter, A.N., et al.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 2–11 (2015)CrossRefGoogle Scholar
  2. 2.
    Kaisler, S., Armour, F., Espinosa, J.A.: Introduction to big data: challenges, opportunities, and realities minitrack. In: 47th Hawaii International Conference on System Sciences, pp. 728–728 (2014)Google Scholar
  3. 3.
    Xun, Y., Zhang, J., Qin, X., Zhao, X.: FiDoop-DP: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2016)CrossRefGoogle Scholar
  4. 4.
    Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapReduce using distributed memory. In: ACM Sigmod International Conference on Management of Data, pp. 551–562 (2014)Google Scholar
  5. 5.
    Jiang, T., Zhang, Q., Hou, R.: Understanding the behavior of in-memory computing workloads. In: IEEE International Symposium on Workload Characterization (IISWC). IEEE (2014)Google Scholar
  6. 6.
    Ahmad, F., Lee, S., Thottethodi, M.: MapReduce with communication overlap (MaRCO). J. Parallel Distrib. Comput. 73(5), 608–620 (2013)CrossRefGoogle Scholar
  7. 7.
    Katarina, G., Michael, H., Wilson, A.H.: Challenges for MapReduce in big data. In: Proceeding of the IEEE 10th 2014 world congress on services (SERVICES 2014)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: System and method for large-scale data processing using an application-independent framework. United States Patent, 12/686292 (2013)Google Scholar
  9. 9.
    Gunasekaran, S., Kannan, A., SaiRamesh, L., Sabena, S., et al.: Dynamic scheduling algorithm for reducing start time in Hadoop. ACM Proc. Int. Conf. Inform. Anal. Artic. 8(25–26), 123 (2016)Google Scholar
  10. 10.
    Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high performance graph processing library on the GPU. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 30, pp. 265–266. ACM, New York (2015)Google Scholar
  11. 11.
    Huang, T.-C., Chu, K.-C., Shieh, C.-K., Tsai, M.-F.: Speed-based load balancer for scheduling reduce tasks to process intermediate data of MapReduce applications on cloud computing. In: ACM ASE BD&SI ’15 Proceedings of the ASE BigData & SocialInformatics 2015 Article, vol. 10(07-09), p. 49. ACM, New York (2015)Google Scholar
  12. 12.
    Lin, C.H., Guo, W.Z., Chen, H.N., et al.: Node-capability-aimed data distribution strategy in heterogeneous Hadoop cluster. J. Chin. Comput. Syst. 01, 83–88 (2015)Google Scholar
  13. 13.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Symposium on Cloud Computing, pp. 1–16. ACM, New York (2013)Google Scholar
  14. 14.
    Tang, X., Wang, L., Geng, Z., et al.: A reduce task scheduler for MapReduce with minimum transmission cost based on sampling evaluation. Int. J. Database Theory Appl. 8(1), 1–10 (2015)CrossRefGoogle Scholar
  15. 15.
    Cheng, D., Rao, J., Guo, Y., Jiang, C.J., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 99(99), 1–1 (2016)Google Scholar
  16. 16.
    Chen, Q., Yao, J., Xiao, Z.: LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans. Parallel Distrib. Syst. 26(9), 2520–2533 (2015)CrossRefGoogle Scholar
  17. 17.
    Hadoop, W.T.: The definitive guide, pp. 125–230. O’Reilly Media, Inc., America (2015)Google Scholar
  18. 18.
    Li, Z., Shen, Y., Yao, B., et al.: OFScheduler: a dynamic network optimizer for MapReduce in heterogeneous cluster. Int. J. Parallel Program. 43(3), 472–488 (2015)CrossRefGoogle Scholar
  19. 19.
    Saxena, V.K., Pushkar, S.: Cloud computing challenges and implementations. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2583–2588 (2016)Google Scholar
  20. 20.
    Zhao, Y., Wu, J., Liu, C.: Dache: a data aware caching for big-data applications using the MapReduce framework. Tsinghua Sci. Technol. 19(1), 39–50 (2014)CrossRefGoogle Scholar
  21. 21.
    Zhang, K., Chen, X.: Large-scale deep belief nets with mapreduce. IEEE Access 2, 395–403 (2014)CrossRefGoogle Scholar
  22. 22.
    Li, F., Ooi, B.C., Ozsu, M.T.: Distributed data management using MapReduce. ACM Comput. SURVEYS 46(3), 31 (2014)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Institute of Computer Network EngineeringChongqing University of Posts and TelecommunicationsChongqingChina

Personalised recommendations