Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

Li, Hongjian; Zhu, Lisha; Wang, Shuaicheng; Wang, Lei

doi:10.1007/s10723-023-09661-2

Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

Research
Published: 22 June 2023

Volume 21, article number 33, (2023)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Hongjian Li¹,
Lisha Zhu¹,
Shuaicheng Wang¹ &
…
Lei Wang¹

112 Accesses
3 Citations
Explore all metrics

Abstract

For big data applications, it is important to allocate resources reasonably and schedule tasks effectively. As one of the popular big data processing frameworks, the default scheduling strategy of Spark still suffers from low resource utilization and high resource cost. In this paper, a low-cost task scheduling algorithm for Spark based on heterogeneous cloud environment is proposed to minimize cost while improving resource utilization. First of all, we construct a cost model for Spark based on the hierarchical relationship among applications, jobs, stages, and tasks. Then, based on the model, a low-cost task scheduling algorithm is proposed, which improves the utilization of computational resources by adjusting the task parallelism and achieves task scheduling with priority based on the distribution of data to be computed. We also propose a Reduce task load balancing partitioning algorithm (RTLBPA) based on prior information of data by dispersing and aggregating different keys with the same hash value in the data into different partitions. This algorithm can achieve load balancing and improve the execution time of the job by reducing the whole Reduce phase. Finally, we performed extensive experiments on the proposed algorithm using Hibench’s workloads in the cloud environment. The result shows that the cost can be reduced by at least 22.71% compared with the existing algorithm under different workloads. As a result, the proposed algorithm can improve the cost efficiency of Spark cluster effectively while improving resource utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Article 31 October 2023

Research on Load Balancing Algorithm Optimization Based on Spark Platform

A Survey of Scheduling Tasks in Big Data: Apache Spark

Data Availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

Ruan, J., Zheng, Q., Dong, B.: Optimal resource provisioning approach based on cost modeling for spark applications in public clouds. 1–4 (2015). https://doi.org/10.1145/2843966.2843972
Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C., Li, K.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2017). https://doi.org/10.1109/TPDS.2016.2603511
Article Google Scholar
Li, C., Li, L.Y.: Optimal resource provisioning for cloud computing environment. J. Supercomput. 62(2), 989–1022 (2012). https://doi.org/10.1007/s11227-012-0775-9
Article Google Scholar
Xu, F., Zheng, H., Jiang, H., Shao, W., Liu, H., Zhou, Z.: Cost-effective cloud server provisioning for predictable performance of big data analytics. IEEE Trans. Parallel Distrib. Syst. 30(5), 1036–1051 (2019). https://doi.org/10.1109/TPDS.2018.2873397
Article Google Scholar
Lattuada, M., Barbierato, E., Gianniti, E., Ardagna, D.: Optimal resource allocation of cloud-based spark applications. IEEE Trans. Cloud Comput. 10(2), 1301–1316 (2022). https://doi.org/10.1109/TCC.2020.2985682
Cheng, D., Zhou, X., Lama, P., Wu, J., Jiang, C.: Cross-platform resource scheduling for spark and mapreduce on yarn. IEEE Trans. Comput. 66(8), 1341–1353 (2017). https://doi.org/10.1109/TC.2017.2669964
Wang, J., Li, X., Ruiz, R., Yang, J., Chu, D.: Energy utilization task scheduling for mapreduce in heterogeneous clusters. IEEE Trans. Serv. Comput. 15(2), 931–944 (2022). https://doi.org/10.1109/TSC.2020.2966697
Article Google Scholar
Cheng, D., Rao, J., Guo, Y., Jiang, C., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 28(3), 774–786 (2017). https://doi.org/10.1109/TPDS.2016.2594765
Article Google Scholar
Maroulis, S., Zacheilas, N., Kalogeraki, V.: A framework for efficient energy scheduling of spark workloads. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 2614–2615 (2017). https://doi.org/10.1109/ICDCS.2017.179
Zacheilas, N., Kalogeraki, V.: Chess: Cost-effective scheduling across multiple heterogeneous mapreduce clusters. In: 2016 IEEE international conference on autonomic computing (ICAC), pp. 65–74. IEEE (2016)
Xu, Y., Liu, L., Ding, Z.: Dag-aware joint task scheduling and cache management in spark clusters. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 378–387. IEEE (2020)
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: Fair allocation of multiple resource types. In: 8th USENIX symposium on networked systems design and implementation (NSDI 11) (2011)
Dimopoulos, S., Krintz, C., Wolski, R.: Justice: a deadline-aware, fair-share resource allocator for implementing multi-analytics. In: 2017 IEEE international conference on cluster computing (CLUSTER), pp. 233–244. IEEE (2017)
Wang, Y., Xue, G., Qian, S., Li, M.: An online cost-efficient scheduler for requests with deadline constraint in hybrid clouds. In: 2017 international conference on progress in informatics and computing (PIC), pp. 318–322. IEEE (2017)
Wang, G., Xu, J., Liu, R., Huang, S.: A hard real-time scheduler for spark on yarn. In: 2018 18th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), pp. 645–652 IEEE (2018)
Hussain, A., Aleem, M., Iqbal, M.A., Islam, M.A.: Sla-ralba: cost-efficient and resource-aware load balancing algorithm for cloud computing. J. Supercomput. 75(10), 6777–6803 (2019)
Article Google Scholar
Wang, B., Tang, J., Zhang, R., Ding, W., Liu, S., Qi, D.: Energy-efficient data caching framework for spark in hybrid dram/nvm memory architectures. In: 2019 IEEE 21st international conference on high performance computing and communications; IEEE 17th international conference on smart city; IEEE 5th international conference on data science and systems (HPCC/SmartCity/DSS), pp. 305–312. IEEE (2019)
Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Cluster Comput. 23(2), 593–609 (2020)
Li, H., Wei, Y., Xiong, Y., Ma, E., Tian, W.: A frequency-aware and energy-saving strategy based on dvfs for spark. J. Supercomput. 77(10), 11575–11596 (2021)
Article Google Scholar
Hu, Z., Li, B., Qin, Z., Goh, R.S.M.: Job scheduling without prior information in big data processing systems. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 572–582. IEEE (2017)
Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)
Article Google Scholar
Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)
Article MATH Google Scholar
Cheng, D., Zhou, X., Wang, Y., Jiang, C.: Adaptive scheduling parallel jobs with dynamic batching in spark streaming. IEEE Trans. Parallel Distrib. Syst. 29(12), 2672–2685 (2018)
Article Google Scholar
Tang, Z., Zeng, A., Zhang, X., Yang, L., Li, K.: Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput. 141, 10–22 (2020)
Article Google Scholar
Neciu, L.-F., Pop, F., Apostol, E.-S., Truică, C.-O.: Efficient real-time earliest deadline first based scheduling for apache spark. In: 2021 20th international symposium on parallel and distributed computing (ISPDC), pp. 97–104. IEEE (2021)
Chen, W., Xie, G., Li, R., Li, K.: Execution cost minimization scheduling algorithms for deadline-constrained parallel applications on heterogeneous clouds. Cluster Comput. 24(2), 701–715 (2021)
Article Google Scholar
Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parallel Distrib. Syst. 33(7), 1695–1710 (2021)
Article Google Scholar
Cheng, Y., Wu, C., Liu, Y., Ren, R., Xu, H., Yang, B., Qi, Z.: Ops: Optimized shuffle management system for apache spark. In: 49th international conference on parallel processing-ICPP, pp. 1–11. (2020)
Fu, Z., Tang, Z., Yang, L., Li, K., Li, K.: Imrp: a predictive partition method for data skew alleviation in spark streaming environment. Parallel Comput. 100, 102699 (2020)
Article MathSciNet Google Scholar
Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between hadoop and spark frameworks using hibench benchmarks. Concurr. Comput. Pract. Exp. 30(12), 4367 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was supported by Chongqing Science and Technology Commission Project (Grant No:cstc2018jcyjAX0525), Key Research and Development Projects of Sichuan Science and Technology Department (Grant No: 2019YFG0107).

Funding

This work was supported by Chongqing Science and Technology Commission Project (Grant No: cstc2018jcyjAX0525; Recipient: Hongjian Li), Key Research and Development Projects of Sichuan Science and Technology Department (Grant No: 2019YFG0107;Recipient: Hongjian Li).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Hongjian Li, Lisha Zhu, Shuaicheng Wang & Lei Wang

Authors

Hongjian Li
View author publications
You can also search for this author in PubMed Google Scholar
Lisha Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Shuaicheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Lisha Zhu: Proposed an idea, Experiment, Wrote the manuscript. Shuaicheng Wang: Helped to wrote also several sections of the manuscript, Proofreading. Lei Wang: Helped to wrote also several sections of the manuscript, Proofreading.

Corresponding author

Correspondence to Hongjian Li.

Ethics declarations

Competing interests

The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, H., Zhu, L., Wang, S. et al. Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment. J Grid Computing 21, 33 (2023). https://doi.org/10.1007/s10723-023-09661-2

Download citation

Received: 21 August 2022
Accepted: 07 April 2023
Published: 22 June 2023
DOI: https://doi.org/10.1007/s10723-023-09661-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

Abstract

Access this article

Similar content being viewed by others

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Research on Load Balancing Algorithm Optimization Based on Spark Platform

A Survey of Scheduling Tasks in Big Data: Apache Spark

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

Abstract

Access this article

Similar content being viewed by others

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Research on Load Balancing Algorithm Optimization Based on Spark Platform

A Survey of Scheduling Tasks in Big Data: Apache Spark

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation