Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Li, Chunlin; Cai, Qianqian; Luo, Youlong

doi:10.1007/s11227-021-04000-2

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Published: 02 August 2021

Volume 78, pages 3561–3604, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Chunlin Li^1,2,
Qianqian Cai² &
Youlong Luo²

411 Accesses
4 Citations
Explore all metrics

Abstract

Both data shuffling and cache recovery are essential parts of the Spark system, and they directly affect Spark parallel computing performance. Existing dynamic partitioning schemes to solve the data skewing problem in the data shuffle phase suffer from poor dynamic adaptability and insufficient granularity. To address the above problems, this paper proposes a dynamic balanced partitioning method for the shuffle phase based on reservoir sampling. The method mitigates the impact of data skew on Spark performance by sampling and preprocessing intermediate data, predicting the overall data skew, and giving the overall partitioning strategy executed by the application. In addition, an inappropriate failure recovery strategy increases the recovery overhead and leads to an inefficient data recovery mechanism. To address the above issues, this paper proposes a checkpoint-based fast recovery strategy for the RDD cache. The strategy analyzes the task execution mechanism of the in-memory computing framework and forms a new failure recovery strategy using the failure recovery model plus weight information based on the semantic analysis of the code to obtain detailed information about the task, so as to improve the efficiency of the data recovery mechanism. The experimental results show that the proposed dynamic balanced partitioning approach can effectively optimize the total completion time of the application and improve Spark parallel computing performance. The proposed cache fast recovery strategy can effectively improve the computational speed of data recovery and the computational rate of Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Article 26 December 2022

LCS: An Efficient Data Eviction Strategy for Spark

Article 02 November 2016

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Article 31 October 2023

References

Hilbert M (2016) Big data for development: a review of promises and challenges. Dev Policy Rev 34(1):135–174
Article Google Scholar
Wu C H, Lin F, Chang WY et al. (2016) Big data development platform for engineering applications. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE
Li C, Song M, Yu C, Luo Y (2021) Mobility and marginal gain based content caching and placement for cooperative edge-cloud computing. Inf Sci 548:153–176
Hga B (2020) Big data development of tourism resources based on 5G network and internet of things system. Microprocess Microsyst 80
León C, Rodríguez C, García F et al (2015) A PRAM oriented programming system. Concurr Comput Prac Exp 9(3):163–179
Article Google Scholar
Lecomber DS, Siniolakis CJ, Sujithan KR (2015) PRAM programming: in theory and in practice. Concurr Comput Prac Exp 12(4):211–226
Li C, Tang J, Ma T, Yang X, Luo Y (2020) Load balance based workflow job scheduling algorithm in distributed cloud. J Netw Comput Appl 152
Chen Y, Alspaugh S, Katz R (2012) Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. Proc Vldb Endow 5(12):1802–1813
Article Google Scholar
Kijsanayothin P, Chalumporn G, Hewett R (2019) On using MapReduce to scale algorithms for big data analytics: a case study. J Big Data 6(1)
Li C, Zhang Y, Hao Z, Luo Y (2020) An effective scheduling strategy based on hypergraph partition in geographically distributed datacenters. Comput Netw 170
Huang CQ, Yang SQ, Tang JC et al. (2017) RDDShare: reusing results of spark RDD. In: IEEE International Conference on Data Science in Cyberspace, IEEE
Li C, Bai J, Chen Y, Luo Y (2020) Resource and replica management strategy for optimizing financial cost and user experience in edge cloud computing system. Inf Sci 516
He M, Li G, Huang C et al. (2017) A comparative study of data skew in Hadoop. In: The 2017 VI International Conference
Zhuo T, Zhang X, Li K et al. (2016) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78(1):287–301
Cardoso P, Barcelos P (2018) Dynamic checkpoint architecture for reliability improvement on distributed frameworks. In: IEEE Symposium on Reliable Distributed Systems
Zhang ZL, University NN (2016) Development of cloud computing. J Hunan City Univ Nat Sci
Hayashi S, Kawanishi K, Ujike I et al (2020) Development of cloud computing system for concrete structure inspection by deep learning based infrared thermography method In: 37th International Symposium on Automation and Robotics in Construction
Liu S, Liu J, Wang H et al. (2020) Research on the development of cloud computing. In: 2020 International Conference on Computer Information and Big Data Applications (CIBDA), IEEE
Berni A (2020) Data-intensive systems: principles and fundamentals using Hadoop and Spark. Comput Rev 61(2):59–59
Google Scholar
Caíno-Lores S, Carretero J, Nicolae B et al. (2019) Spark-DIY: a framework for interoperable spark operations with high performance block-based data models. In: 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT), IEEE
Sudsee B, Kaewkasi C (2019) An improvement of a checkpoint-based distributed testing technique on a big data environment. In: 2019 21st International Conference on Advanced Communication Technology (ICACT)
Raj S, Ramesh D, Sethi KK (2020) A Spark-based Apriori algorithm with reduced shuffle overhead. J Supercomput 2020(1)
Hassan M, Bamha M (2015) Towards scalability and data skew handling in GroupBy-joins using MapReduce model. Procedia Comput Sci 51(1):70–79
Article Google Scholar
Liu G, Zhu X, Ji W et al. (2017) SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Gener Comput Syst 86(SEP.):1054–1063
Fu Z, Tang Z, Yang L et al. (2020) ImRP: a predictive partition method for data skew alleviation in spark streaming environment. Parall Comput 100:102699
Tang Z, Lv W, Li K et al. (2018) An intermediate data partition algorithm for skew mitigation in spark computing environment. IEEE Trans Cloud Comput 1–1
Gavagsaz E, Rezaee A, Javadi H (2019) Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput 75(1):228–254
Article Google Scholar
Guo W, Huang C, Tian W (2020) Handling data skew at reduce stage in Spark by ReducePartition. Concurr Comput Prac Exp 32(9)
Alfaia EC, Dusi M, Fiori L et al. (2015) Fault-tolerant streaming computation with BlockMon. In: IEEE GLOBCOM 2015, IEEE
Shen Y (2015) Complex query processing and recovery in distributed systems
Wei Z, Chen H, Fei H (2016) ASC: improving spark driver performance with SPARK automatic checkpoint. In: International Conference on Advanced Communication Technology. IEEE
Zhang YM, Luo Y, Yanchen LI (2017) Optimizing checkpointing performance in Spark
Ying C, Yu J, He JS (2018) Towards fault tolerance optimization based on checkpoints of in-memory framework spark. J Ambient Intell Human Comp
Cardoso PV, Barcelos PP (2018) Definition of an architecture for dynamic and automatic checkpoints on apache spark. In: 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS). IEEE
Tian Y, Shen Q, Zhu Z et al. (2018) Non-authentication based checkpoint fault-tolerant vulnerability in spark streaming. In: 2018 IEEE Symposium on Computers and Communications (ISCC). IEEE Computer Society
Li J (2018) Comparing Spark vs MPI/OpenMP on word count MapReduce
Jiang H (2019) Research and practice of big data analysis process based on hadoop framework. In: 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). IEEE
Yu S, Xu C, Liu H (2018) Zipf's law in 50 languages: its structural pattern, linguistic interpretation, and cognitive motivation
Fernholz RT, Fernholz R (2020) Zipf’s law for atlas models. J Appl Probab 57(4):1276–1297
Article MathSciNet Google Scholar
Sreeyuktha HS, Reddy JG (2019) Partitioning in Apache Spark

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (NSFC) under grants (No. 61873341), Key Research and Development Plan of Hubei Province (No. 2020BAB102), and Open foundation of Chongqing Key Laboratory of Industrial and Information Technology of Electric Vehicle Safety Evaluation. Any opinions, findings, and conclusions are those of the authors and do not necessarily reflect the views of the above agencies.

Author information

Authors and Affiliations

Chongqing Key Laboratory of Industrial and Information Technology of Electric Vehicle Safety Evaluation, China Merchants Testing Certification Vehicle Technology Research Institute Co., Ltd, Chongqing, People’s Republic of China
Chunlin Li
Department of Computer Science, Wuhan University of Technology, Wuhan, 430063, People’s Republic of China
Chunlin Li, Qianqian Cai & Youlong Luo

Authors

Chunlin Li
View author publications
You can also search for this author in PubMed Google Scholar
Qianqian Cai
View author publications
You can also search for this author in PubMed Google Scholar
Youlong Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunlin Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Cai, Q. & Luo, Y. Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment. J Supercomput 78, 3561–3604 (2022). https://doi.org/10.1007/s11227-021-04000-2

Download citation

Accepted: 15 July 2021
Published: 02 August 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-04000-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Abstract

Access this article

Similar content being viewed by others

LPW: an efficient data-aware cache replacement strategy for Apache Spark

LCS: An Efficient Data Eviction Strategy for Spark

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Abstract

Access this article

Similar content being viewed by others

LPW: an efficient data-aware cache replacement strategy for Apache Spark

LCS: An Efficient Data Eviction Strategy for Spark

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation