Skip to main content

HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce

Abstract

As one of the most popular parallel data processing models, data analysis system MapReduce has been widely used in many fields. Task scheduling is the core module in MapReduce system, and the quality of the scheduling algorithm directly affects the processing capacity of the system. Since new nodes need to be continuously added in the cluster to improve the processing capacity of the cluster, objectively, the heterogeneity of the cluster is caused. Heterogeneous environment is common in practical application scenarios, but there has been little research on task scheduling in heterogeneous environment. For this reason, this paper presents an in-depth study of task scheduling in heterogeneous environment and proposes a new task scheduling algorithm HTD. First, we give a formal definition of the throughput-driven task scheduling problem in a heterogeneous environment. Second, we design the scheduling algorithm HTD, which quickly obtains the completion sequence of a jobs set and optimizes the task scheduling details in heterogeneous environment. Finally, a series of experiments show the efficiency and effectiveness of the algorithm.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  1. Maleki, N., Faragardi, H.R., Rahmani, A.M., Conti, M., Lofstead, J.F.: TMaR: A two-stage MapReduce scheduler for heterogeneous environments. Hum. Centric Comput. Inf. Sci 10, 42 (2020)

    Article  Google Scholar 

  2. Mitsuzuka, K., Hayashi, A., Koibuchi, M., Amano, H., Matsutani, H.: In-switch approximate processing: Delayed tasks management for MapReduce applications, 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4 (2017)

  3. Chen, C., Lin, J., Kuo, S.: MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans. Cloud Comput. 6(1), 127–140 (2018)

    Article  Google Scholar 

  4. Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for MapReduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 241–250 (2016)

  5. http://hadoop.apache.org

  6. Camacho-Rodríguez, J., Chauhan, A., Gates, A., et al.: Apache hive: From MapReduce to enterprise-grade big data warehousing. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1773–1786 (2019)

  7. Wu, Y., Li, X., Liu, J., Cui, L.: Hadoop-EDF: Large-scale distributed processing of electrophysiological signal data in hadoop MapReduce. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2265–2271 (2019)

  8. Tiwari, N., Sarkar, S., Bellur, U., Indrawan, M.: Classification framework of MapReduce scheduling algorithms. ACM Comput. Surv. 47, 49:1-49:38 (2015)

    Article  Google Scholar 

  9. Bibal Benifa, J.V.: Dejey, performance improvement of MapReduce for heterogeneous clusters based on efficient locality and replica aware scheduling (ELRAS) strategy. Wirel. Pers. Commun. 95, 2709–2733 (2017)

    Article  Google Scholar 

  10. Jiang, Y., Zhu, Y., Weili, W., Li, D.: Makespan minimization for MapReduce systems with different servers. Fut. Gener. Comput. Syst. 67, 13–21 (2017)

    Article  Google Scholar 

  11. Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing MapReduce on heterogeneous clusters. ASPLOS 40, 61–74 (2012)

    Article  Google Scholar 

  12. Hsieh, S., Chen, C., Chen, C., Yen, T., Hsiao, H., Buyya, R.: Novel scheduling algorithms for efficient deployment of MapReduce applications in heterogeneous computing environments. IEEE Trans. Cloud Comput. 6(4), 1080–1095 (2018)

    Article  Google Scholar 

  13. Cheng, D., Rao, J., Guo, Y., Jiang, C., Zhou, X.: Improving performance of heterogeneous MapReduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 28(3), 774–786 (2017)

    Article  Google Scholar 

  14. Rasooli, A., Down, D.G.: COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Gener Comput Syst 36, 1–15 (2014)

    Article  Google Scholar 

  15. Bellatreche, L., Cuzzocrea, A., Benkrid, S.: Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: The F&A approach. J. Database Manag. 23(4), 17–51 (2012)

    Article  Google Scholar 

  16. Kerkad, A., Bellatreche, L., Richard, P., Ordonez, C., Geniet, D.: A query beehive algorithm for data warehouse buffer management and query scheduling. Int. J. Data Warehousing Mining (IJDWM) 10(3), 34–58 (2014)

    Article  Google Scholar 

  17. Chi, Y., Hacigümüs, H., Hsiung, W.-P., Jeffrey, F.: Naughton: Distribution-based query scheduling. Proc. VLDB Endow. 6(9), 673–684 (2013)

    Article  Google Scholar 

  18. Mansouri, N.: Cost-based job scheduling strategy in cloud computing environments. Distrib. Parallel Databases 38(2), 365–400 (2020)

    Article  Google Scholar 

  19. Hagras, T., Atef, A., Mahdy, Y.B.: Greening duplication-based dependent-tasks scheduling on heterogeneous large-scale computing platforms. J. Grid Comput. 19(1), 13 (2021)

    Article  Google Scholar 

  20. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. OSDI 8, 29–42 (2008)

    Google Scholar 

  21. Kwon, Y., Balazinska, M., Howe, B., et al.: SkewTune: Mitigating skew in MapReduce applications. ACM SIGMOD Int. Conf. Manag. Data 2012, 25–36 (2012)

    Google Scholar 

  22. Kwon, Y., Balazinska, M., Howe, B., et al.: SkewTune in action: Mitigating skew in MapReduce applications. Proc. VLDB Endow. 2012 5(12), 1934–1937 (2012)

    Article  Google Scholar 

  23. Hammoud, M., Rehman, S., Sakr, M.: A data locality and skew aware task scheduler for MapReduce in cloud computing. Bloomsbury Qatar Found. J. 2011, 1 (2011)

    Google Scholar 

  24. Yu, X., Kostamaa, P.: Efficient outer join data skew handling in parallel DBMS. Proc. VLDB Endow. 2(2), 1390–1396 (2009)

    Article  Google Scholar 

  25. Kwon, Y.C., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. SoCC 2010, 75–86 (2010)

    Google Scholar 

  26. Pericini, M.H., Leite, L.G., Carvalho-Junior, D., Francisco, H., Machado, J.C., Rezende, C.A.: MAPSkew metaheuristic approaches for partitioning skew in MapReduce. Algorithms 12(1), 5 (2019)

    Article  Google Scholar 

  27. Wang, B., Jiang, J., Yang, G.: ActCap: Accelerating MapReduce on heterogeneous clusters with capability-aware data placement. INFOCOM 2015, 1328–1336 (2015)

    Google Scholar 

  28. Wang, J., Li, X.: Task scheduling for MapReduce in heterogeneous networks. Clust. Comput. 19(1), 197–210 (2016)

    Article  Google Scholar 

  29. Wang, M., Wu, C.Q., Cao, H., Liu, Y., Wang, Y., Hou, A.: On MapReduce scheduling in hadoop yarn on heterogeneous clusters. TrustCom/BigDataSE 2018, 1747–1754 (2018)

    Google Scholar 

  30. Chen, L., Liu, Z.-H.: Energy- and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter. Serv. Orient. Comput. Appl. 13(4), 297–308 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61602076, 61702072, 62002039, 61976032), the China Postdoctoral Science Foundation funded projects (Grant Nos. 2017M611211, 2017M6211, 2019M661077), the Natural Science Foundation of Liaoning Province (Grant No. 20180540003), CERNET Innovation Project (Grant No. NGII20190902).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xite Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Wang, C., Bai, M. et al. HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce. Distrib Parallel Databases 40, 135–163 (2022). https://doi.org/10.1007/s10619-021-07375-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-021-07375-6

Keywords

  • MapReduce
  • Scheduling
  • Heterogeneous
  • Throughput