Abstract
For big data applications, it is important to allocate resources reasonably and schedule tasks effectively. As one of the popular big data processing frameworks, the default scheduling strategy of Spark still suffers from low resource utilization and high resource cost. In this paper, a low-cost task scheduling algorithm for Spark based on heterogeneous cloud environment is proposed to minimize cost while improving resource utilization. First of all, we construct a cost model for Spark based on the hierarchical relationship among applications, jobs, stages, and tasks. Then, based on the model, a low-cost task scheduling algorithm is proposed, which improves the utilization of computational resources by adjusting the task parallelism and achieves task scheduling with priority based on the distribution of data to be computed. We also propose a Reduce task load balancing partitioning algorithm (RTLBPA) based on prior information of data by dispersing and aggregating different keys with the same hash value in the data into different partitions. This algorithm can achieve load balancing and improve the execution time of the job by reducing the whole Reduce phase. Finally, we performed extensive experiments on the proposed algorithm using Hibench’s workloads in the cloud environment. The result shows that the cost can be reduced by at least 22.71% compared with the existing algorithm under different workloads. As a result, the proposed algorithm can improve the cost efficiency of Spark cluster effectively while improving resource utilization.
Similar content being viewed by others
Data Availability
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Ruan, J., Zheng, Q., Dong, B.: Optimal resource provisioning approach based on cost modeling for spark applications in public clouds. 1–4 (2015). https://doi.org/10.1145/2843966.2843972
Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C., Li, K.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2017). https://doi.org/10.1109/TPDS.2016.2603511
Li, C., Li, L.Y.: Optimal resource provisioning for cloud computing environment. J. Supercomput. 62(2), 989–1022 (2012). https://doi.org/10.1007/s11227-012-0775-9
Xu, F., Zheng, H., Jiang, H., Shao, W., Liu, H., Zhou, Z.: Cost-effective cloud server provisioning for predictable performance of big data analytics. IEEE Trans. Parallel Distrib. Syst. 30(5), 1036–1051 (2019). https://doi.org/10.1109/TPDS.2018.2873397
Lattuada, M., Barbierato, E., Gianniti, E., Ardagna, D.: Optimal resource allocation of cloud-based spark applications. IEEE Trans. Cloud Comput. 10(2), 1301–1316 (2022). https://doi.org/10.1109/TCC.2020.2985682
Cheng, D., Zhou, X., Lama, P., Wu, J., Jiang, C.: Cross-platform resource scheduling for spark and mapreduce on yarn. IEEE Trans. Comput. 66(8), 1341–1353 (2017). https://doi.org/10.1109/TC.2017.2669964
Wang, J., Li, X., Ruiz, R., Yang, J., Chu, D.: Energy utilization task scheduling for mapreduce in heterogeneous clusters. IEEE Trans. Serv. Comput. 15(2), 931–944 (2022). https://doi.org/10.1109/TSC.2020.2966697
Cheng, D., Rao, J., Guo, Y., Jiang, C., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 28(3), 774–786 (2017). https://doi.org/10.1109/TPDS.2016.2594765
Maroulis, S., Zacheilas, N., Kalogeraki, V.: A framework for efficient energy scheduling of spark workloads. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 2614–2615 (2017). https://doi.org/10.1109/ICDCS.2017.179
Zacheilas, N., Kalogeraki, V.: Chess: Cost-effective scheduling across multiple heterogeneous mapreduce clusters. In: 2016 IEEE international conference on autonomic computing (ICAC), pp. 65–74. IEEE (2016)
Xu, Y., Liu, L., Ding, Z.: Dag-aware joint task scheduling and cache management in spark clusters. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 378–387. IEEE (2020)
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: Fair allocation of multiple resource types. In: 8th USENIX symposium on networked systems design and implementation (NSDI 11) (2011)
Dimopoulos, S., Krintz, C., Wolski, R.: Justice: a deadline-aware, fair-share resource allocator for implementing multi-analytics. In: 2017 IEEE international conference on cluster computing (CLUSTER), pp. 233–244. IEEE (2017)
Wang, Y., Xue, G., Qian, S., Li, M.: An online cost-efficient scheduler for requests with deadline constraint in hybrid clouds. In: 2017 international conference on progress in informatics and computing (PIC), pp. 318–322. IEEE (2017)
Wang, G., Xu, J., Liu, R., Huang, S.: A hard real-time scheduler for spark on yarn. In: 2018 18th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), pp. 645–652 IEEE (2018)
Hussain, A., Aleem, M., Iqbal, M.A., Islam, M.A.: Sla-ralba: cost-efficient and resource-aware load balancing algorithm for cloud computing. J. Supercomput. 75(10), 6777–6803 (2019)
Wang, B., Tang, J., Zhang, R., Ding, W., Liu, S., Qi, D.: Energy-efficient data caching framework for spark in hybrid dram/nvm memory architectures. In: 2019 IEEE 21st international conference on high performance computing and communications; IEEE 17th international conference on smart city; IEEE 5th international conference on data science and systems (HPCC/SmartCity/DSS), pp. 305–312. IEEE (2019)
Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Cluster Comput. 23(2), 593–609 (2020)
Li, H., Wei, Y., Xiong, Y., Ma, E., Tian, W.: A frequency-aware and energy-saving strategy based on dvfs for spark. J. Supercomput. 77(10), 11575–11596 (2021)
Hu, Z., Li, B., Qin, Z., Goh, R.S.M.: Job scheduling without prior information in big data processing systems. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 572–582. IEEE (2017)
Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)
Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)
Cheng, D., Zhou, X., Wang, Y., Jiang, C.: Adaptive scheduling parallel jobs with dynamic batching in spark streaming. IEEE Trans. Parallel Distrib. Syst. 29(12), 2672–2685 (2018)
Tang, Z., Zeng, A., Zhang, X., Yang, L., Li, K.: Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput. 141, 10–22 (2020)
Neciu, L.-F., Pop, F., Apostol, E.-S., Truică, C.-O.: Efficient real-time earliest deadline first based scheduling for apache spark. In: 2021 20th international symposium on parallel and distributed computing (ISPDC), pp. 97–104. IEEE (2021)
Chen, W., Xie, G., Li, R., Li, K.: Execution cost minimization scheduling algorithms for deadline-constrained parallel applications on heterogeneous clouds. Cluster Comput. 24(2), 701–715 (2021)
Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parallel Distrib. Syst. 33(7), 1695–1710 (2021)
Cheng, Y., Wu, C., Liu, Y., Ren, R., Xu, H., Yang, B., Qi, Z.: Ops: Optimized shuffle management system for apache spark. In: 49th international conference on parallel processing-ICPP, pp. 1–11. (2020)
Fu, Z., Tang, Z., Yang, L., Li, K., Li, K.: Imrp: a predictive partition method for data skew alleviation in spark streaming environment. Parallel Comput. 100, 102699 (2020)
Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between hadoop and spark frameworks using hibench benchmarks. Concurr. Comput. Pract. Exp. 30(12), 4367 (2018)
Acknowledgements
This work was supported by Chongqing Science and Technology Commission Project (Grant No:cstc2018jcyjAX0525), Key Research and Development Projects of Sichuan Science and Technology Department (Grant No: 2019YFG0107).
Funding
This work was supported by Chongqing Science and Technology Commission Project (Grant No: cstc2018jcyjAX0525; Recipient: Hongjian Li), Key Research and Development Projects of Sichuan Science and Technology Department (Grant No: 2019YFG0107;Recipient: Hongjian Li).
Author information
Authors and Affiliations
Contributions
Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Lisha Zhu: Proposed an idea, Experiment, Wrote the manuscript. Shuaicheng Wang: Helped to wrote also several sections of the manuscript, Proofreading. Lei Wang: Helped to wrote also several sections of the manuscript, Proofreading.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Zhu, L., Wang, S. et al. Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment. J Grid Computing 21, 33 (2023). https://doi.org/10.1007/s10723-023-09661-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-023-09661-2