Skip to main content
Log in

Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

  • Research
  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

For big data applications, it is important to allocate resources reasonably and schedule tasks effectively. As one of the popular big data processing frameworks, the default scheduling strategy of Spark still suffers from low resource utilization and high resource cost. In this paper, a low-cost task scheduling algorithm for Spark based on heterogeneous cloud environment is proposed to minimize cost while improving resource utilization. First of all, we construct a cost model for Spark based on the hierarchical relationship among applications, jobs, stages, and tasks. Then, based on the model, a low-cost task scheduling algorithm is proposed, which improves the utilization of computational resources by adjusting the task parallelism and achieves task scheduling with priority based on the distribution of data to be computed. We also propose a Reduce task load balancing partitioning algorithm (RTLBPA) based on prior information of data by dispersing and aggregating different keys with the same hash value in the data into different partitions. This algorithm can achieve load balancing and improve the execution time of the job by reducing the whole Reduce phase. Finally, we performed extensive experiments on the proposed algorithm using Hibench’s workloads in the cloud environment. The result shows that the cost can be reduced by at least 22.71% compared with the existing algorithm under different workloads. As a result, the proposed algorithm can improve the cost efficiency of Spark cluster effectively while improving resource utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

  1. Ruan, J., Zheng, Q., Dong, B.: Optimal resource provisioning approach based on cost modeling for spark applications in public clouds. 1–4 (2015). https://doi.org/10.1145/2843966.2843972

  2. Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C., Li, K.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2017). https://doi.org/10.1109/TPDS.2016.2603511

    Article  Google Scholar 

  3. Li, C., Li, L.Y.: Optimal resource provisioning for cloud computing environment. J. Supercomput. 62(2), 989–1022 (2012). https://doi.org/10.1007/s11227-012-0775-9

    Article  Google Scholar 

  4. Xu, F., Zheng, H., Jiang, H., Shao, W., Liu, H., Zhou, Z.: Cost-effective cloud server provisioning for predictable performance of big data analytics. IEEE Trans. Parallel Distrib. Syst. 30(5), 1036–1051 (2019). https://doi.org/10.1109/TPDS.2018.2873397

    Article  Google Scholar 

  5. Lattuada, M., Barbierato, E., Gianniti, E., Ardagna, D.: Optimal resource allocation of cloud-based spark applications. IEEE Trans. Cloud Comput. 10(2), 1301–1316 (2022). https://doi.org/10.1109/TCC.2020.2985682

  6. Cheng, D., Zhou, X., Lama, P., Wu, J., Jiang, C.: Cross-platform resource scheduling for spark and mapreduce on yarn. IEEE Trans. Comput. 66(8), 1341–1353 (2017). https://doi.org/10.1109/TC.2017.2669964

  7. Wang, J., Li, X., Ruiz, R., Yang, J., Chu, D.: Energy utilization task scheduling for mapreduce in heterogeneous clusters. IEEE Trans. Serv. Comput. 15(2), 931–944 (2022). https://doi.org/10.1109/TSC.2020.2966697

    Article  Google Scholar 

  8. Cheng, D., Rao, J., Guo, Y., Jiang, C., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 28(3), 774–786 (2017). https://doi.org/10.1109/TPDS.2016.2594765

    Article  Google Scholar 

  9. Maroulis, S., Zacheilas, N., Kalogeraki, V.: A framework for efficient energy scheduling of spark workloads. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 2614–2615 (2017). https://doi.org/10.1109/ICDCS.2017.179

  10. Zacheilas, N., Kalogeraki, V.: Chess: Cost-effective scheduling across multiple heterogeneous mapreduce clusters. In: 2016 IEEE international conference on autonomic computing (ICAC), pp. 65–74. IEEE (2016)

  11. Xu, Y., Liu, L., Ding, Z.: Dag-aware joint task scheduling and cache management in spark clusters. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 378–387. IEEE (2020)

  12. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: Fair allocation of multiple resource types. In: 8th USENIX symposium on networked systems design and implementation (NSDI 11) (2011)

  13. Dimopoulos, S., Krintz, C., Wolski, R.: Justice: a deadline-aware, fair-share resource allocator for implementing multi-analytics. In: 2017 IEEE international conference on cluster computing (CLUSTER), pp. 233–244. IEEE (2017)

  14. Wang, Y., Xue, G., Qian, S., Li, M.: An online cost-efficient scheduler for requests with deadline constraint in hybrid clouds. In: 2017 international conference on progress in informatics and computing (PIC), pp. 318–322. IEEE (2017)

  15. Wang, G., Xu, J., Liu, R., Huang, S.: A hard real-time scheduler for spark on yarn. In: 2018 18th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), pp. 645–652 IEEE (2018)

  16. Hussain, A., Aleem, M., Iqbal, M.A., Islam, M.A.: Sla-ralba: cost-efficient and resource-aware load balancing algorithm for cloud computing. J. Supercomput. 75(10), 6777–6803 (2019)

    Article  Google Scholar 

  17. Wang, B., Tang, J., Zhang, R., Ding, W., Liu, S., Qi, D.: Energy-efficient data caching framework for spark in hybrid dram/nvm memory architectures. In: 2019 IEEE 21st international conference on high performance computing and communications; IEEE 17th international conference on smart city; IEEE 5th international conference on data science and systems (HPCC/SmartCity/DSS), pp. 305–312. IEEE (2019)

  18. Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Cluster Comput. 23(2), 593–609 (2020)

  19. Li, H., Wei, Y., Xiong, Y., Ma, E., Tian, W.: A frequency-aware and energy-saving strategy based on dvfs for spark. J. Supercomput. 77(10), 11575–11596 (2021)

    Article  Google Scholar 

  20. Hu, Z., Li, B., Qin, Z., Goh, R.S.M.: Job scheduling without prior information in big data processing systems. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 572–582. IEEE (2017)

  21. Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)

    Article  Google Scholar 

  22. Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)

    Article  MATH  Google Scholar 

  23. Cheng, D., Zhou, X., Wang, Y., Jiang, C.: Adaptive scheduling parallel jobs with dynamic batching in spark streaming. IEEE Trans. Parallel Distrib. Syst. 29(12), 2672–2685 (2018)

    Article  Google Scholar 

  24. Tang, Z., Zeng, A., Zhang, X., Yang, L., Li, K.: Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput. 141, 10–22 (2020)

    Article  Google Scholar 

  25. Neciu, L.-F., Pop, F., Apostol, E.-S., Truică, C.-O.: Efficient real-time earliest deadline first based scheduling for apache spark. In: 2021 20th international symposium on parallel and distributed computing (ISPDC), pp. 97–104. IEEE (2021)

  26. Chen, W., Xie, G., Li, R., Li, K.: Execution cost minimization scheduling algorithms for deadline-constrained parallel applications on heterogeneous clouds. Cluster Comput. 24(2), 701–715 (2021)

    Article  Google Scholar 

  27. Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parallel Distrib. Syst. 33(7), 1695–1710 (2021)

    Article  Google Scholar 

  28. Cheng, Y., Wu, C., Liu, Y., Ren, R., Xu, H., Yang, B., Qi, Z.: Ops: Optimized shuffle management system for apache spark. In: 49th international conference on parallel processing-ICPP, pp. 1–11. (2020)

  29. Fu, Z., Tang, Z., Yang, L., Li, K., Li, K.: Imrp: a predictive partition method for data skew alleviation in spark streaming environment. Parallel Comput. 100, 102699 (2020)

    Article  MathSciNet  Google Scholar 

  30. Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between hadoop and spark frameworks using hibench benchmarks. Concurr. Comput. Pract. Exp. 30(12), 4367 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Chongqing Science and Technology Commission Project (Grant No:cstc2018jcyjAX0525), Key Research and Development Projects of Sichuan Science and Technology Department (Grant No: 2019YFG0107).

Funding

This work was supported by Chongqing Science and Technology Commission Project (Grant No: cstc2018jcyjAX0525; Recipient: Hongjian Li), Key Research and Development Projects of Sichuan Science and Technology Department (Grant No: 2019YFG0107;Recipient: Hongjian Li).

Author information

Authors and Affiliations

Authors

Contributions

Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Lisha Zhu: Proposed an idea, Experiment, Wrote the manuscript. Shuaicheng Wang: Helped to wrote also several sections of the manuscript, Proofreading. Lei Wang: Helped to wrote also several sections of the manuscript, Proofreading.

Corresponding author

Correspondence to Hongjian Li.

Ethics declarations

Competing interests

The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Zhu, L., Wang, S. et al. Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment. J Grid Computing 21, 33 (2023). https://doi.org/10.1007/s10723-023-09661-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10723-023-09661-2

Keywords

Navigation