LPW: an efficient data-aware cache replacement strategy for Apache Spark

Li, Hui; Ji, Shuping; Zhong, Hua; Wang, Wei; Xu, Lijie; Tang, Zhen; Wei, Jun; Huang, Tao

doi:10.1007/s11432-021-3406-5

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Research Paper
Published: 26 December 2022

Volume 66, article number 112104, (2023)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Hui Li^1,2,
Shuping Ji¹,
Hua Zhong¹,
Wei Wang^1,2,3,4,
Lijie Xu^1,2,3,4,
Zhen Tang¹,
Jun Wei^1,2 &
…
Tao Huang¹

206 Accesses
1 Citation
Explore all metrics

Abstract

Caching is one of the most important techniques for the popular distributed big data processing framework Spark. For this big data parallel computing framework, which is designed to support various applications based on in-memory computing, it is not possible to cache every intermediate result due to the memory size limitation. The arbitrariness of cache application programming interface (API) usage, the diversity of application characteristics, and the variability of memory resources constitute challenges to achieving high system execution performance. Inefficient cache replacement strategies may cause different performance problems such as long application execution time, low memory utilization, high replacement frequency, and even program execution failure resulting from out of memory. The cache replacement strategy currently adopted by Spark is the least recently used (LRU) strategy. Although LRU is a classical algorithm and has been widely used, it lacks consideration for the environment and workloads. As a result, it cannot achieve good performance under many scenarios. In this paper, we propose a novel cache replacement algorithm, least partition weight (LPW). LPW takes comprehensive consideration of different factors affecting system performance, such as partition size, computational cost, and reference count. The LPW algorithm was implemented in Spark and compared against the LRU as well as other state-of-the-art mechanisms. Our detailed experiments indicate that LPW obviously outperforms its counterparts and can reduce the execution time by up to 75% under typical workloads. Furthermore, the decreasing eviction frequency also shows the LPW algorithm can generate more reasonable predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Memory Management Approaches in Apache Spark: A Review

LCS: An Efficient Data Eviction Strategy for Spark

Article 02 November 2016

Yuanzhen Geng, Xuanhua Shi, … Wenbin Jiang

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Article 02 August 2021

Chunlin Li, Qianqian Cai & Youlong Luo

References

Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system. In: Proceedings of Symposium on Mass Storage Systems and Technologies, 2010. 1–10
Li H Y, Ghodsi A, Zaharia M, et al. Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, 2014. 1–15
Saha B, Shah H, Seth S, et al. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of International Conference on Management of Data, Melbourne, 2015. 1357–1369
Zaharia M, Chowdhury M, Das T, et al. Fast and interactive analytics over Hadoop data with spark. Adv Comput Syst Assoc, 2012, 37: 45–51
Google Scholar
Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX Conference on Networked Systems Design and Implementation, 2012
Li H, Wang D, Huang T Z, et al. Detecting cache-related bugs in Spark applications. In: Proceedings of International Symposium on Software Testing and Analysis, Virtual Event, 2020. 363–375
Geng Y Z, Shi X H, Pei C, et al. LCS: an efficient data eviction strategy for Spark. Int J Parallel Prog, 2017, 45: 1285–1297
Article Google Scholar
Yu Y H, Wang W, Zhang J, et al. LRC: dependency-aware cache management for data analytics clusters. In: Proceedings of Conference on Computer Communications, Atlanta, 2017. 1–9
Huang S S, Huang J, Dai J Q, et al. The Hibench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of International Conference on Data Engineering Workshop, Long Beach, 2010. 41–51
Li C, Cox A L. GD-Wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the European Conference on Computer Systems, Bordeaux, 2015. 1–15
Liu E, Hashemi M, Swersky K, et al. An imitation learning approach for cache replacement. In: Proceedings of the International Conference on Machine Learning, 2020. 6237–6247
Lee D H, Choi J, Kim J H, et al. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. SIGMETRICS Perform Eval Rev, 1999, 27: 134–143
Article Google Scholar
Swain D, Paikaray B, Swain D. AWRP: adaptive weight ranking policy for improving cache performance. Comput Sci, 2011, 3: 2151–9617
Google Scholar
Yu Y H, Wang W, Zhang J, et al. LERC: coordinated cache management for data-parallel systems. In: Proceedings of Global Communications Conference, 2017. 1–6
Zhao Y X, Wu J. Dache: a data aware caching for big-data applications using the MapReduce framework. In: Proceedings of International Conference on Computer Communications, 2013. 35–39
Yang Z Y, Jia D L, Ioannidis S, et al. Intermediate data caching optimization for multi-stage and parallel big data frameworks. In: Proceedings of International Conference on Cloud Computing, San Francisco, 2018. 277–284
Gonzalez J E, Xin R S, Dave A, et al. GraphX: graph processing in a distributed dataflow framework. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation, Broomfield, 2014. 599–613
Meng X R, Bradley J, Yavuz B, et al. MLlib: machine learning in Apache Spark. J Mach Learn Res, 2016, 17: 1235–1241
MathSciNet Google Scholar
Xu L J, Dou W S, Zhu F, et al. Characterizing and diagnosing out of memory errors in MapReduce applications. J Syst Softw, 2018, 137: 399–414
Article Google Scholar
Ousterhout K, Rasti R, Ratnasamy S, et al. Making sense of performance in data analytics frameworks. In: Proceedings of USENIX Symposium on Networked Systems Design and implementation, Oakland, 2015. 293–307
Xu L, Li M, Zhang L, et al. MemTune: dynamic memory management for in-memory data analytic platforms. In: Proceedings of International Parallel and Distributed Processing Symposium, Chicago, 2016. 383–392
Li S, Amin M T, Ganti R, et al. Stark: optimizing in-memory computing for dynamic dataset collections. In: Proceedings of International Conference on Distributed Computing System, Atlanta, 2017. 103–114
Ananthanarayanan G, Ghodsi A, Warfield A, et al. PACMan: coordinated memory caching for parallel jobs. In: Proceedings of Symposium on Networked Systems Design and Implementation, San Jose, 2012. 267–280
Perez T B, Zhou X B, Chen D Z. Reference-distance eviction and prefetching for cache management in Spark. In: Proceedings of International Conference on Parallel Processing, Eugene, 2018. 1–10
Xu E, Saxena M, Chiu L. Neutrino: revisiting memory caching for iterative data analytics. In: Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems, Denver, 2016. 16–20

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. U20A6003, 61802377, 61872340) and Youth Innovation Promotion Association of Chinese Academy of Sciences.

Author information

Authors and Affiliations

State Key Lab of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Hui Li, Shuping Ji, Hua Zhong, Wei Wang, Lijie Xu, Zhen Tang, Jun Wei & Tao Huang
University of Chinese Academy of Science, Beijing, 100049, China
Hui Li, Wei Wang, Lijie Xu & Jun Wei
Nanjing Institute of Software Technology, Nanjing, 210000, China
Wei Wang & Lijie Xu
University of Chinese Academy of Sciences, Nanjing, 210008, China
Wei Wang & Lijie Xu

Authors

Hui Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuping Ji
View author publications
You can also search for this author in PubMed Google Scholar
Hua Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lijie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Wei
View author publications
You can also search for this author in PubMed Google Scholar
Tao Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hua Zhong or Wei Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, H., Ji, S., Zhong, H. et al. LPW: an efficient data-aware cache replacement strategy for Apache Spark. Sci. China Inf. Sci. 66, 112104 (2023). https://doi.org/10.1007/s11432-021-3406-5

Download citation

Received: 27 April 2021
Revised: 02 November 2021
Accepted: 09 December 2021
Published: 26 December 2022
DOI: https://doi.org/10.1007/s11432-021-3406-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Abstract

Access this article

Similar content being viewed by others

Memory Management Approaches in Apache Spark: A Review

LCS: An Efficient Data Eviction Strategy for Spark

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LPW: an efficient data-aware cache replacement strategy for Apache Spark

Abstract

Access this article

Similar content being viewed by others

Memory Management Approaches in Apache Spark: A Review

LCS: An Efficient Data Eviction Strategy for Spark

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation