Abstract
Caching is one of the most important techniques for the popular distributed big data processing framework Spark. For this big data parallel computing framework, which is designed to support various applications based on in-memory computing, it is not possible to cache every intermediate result due to the memory size limitation. The arbitrariness of cache application programming interface (API) usage, the diversity of application characteristics, and the variability of memory resources constitute challenges to achieving high system execution performance. Inefficient cache replacement strategies may cause different performance problems such as long application execution time, low memory utilization, high replacement frequency, and even program execution failure resulting from out of memory. The cache replacement strategy currently adopted by Spark is the least recently used (LRU) strategy. Although LRU is a classical algorithm and has been widely used, it lacks consideration for the environment and workloads. As a result, it cannot achieve good performance under many scenarios. In this paper, we propose a novel cache replacement algorithm, least partition weight (LPW). LPW takes comprehensive consideration of different factors affecting system performance, such as partition size, computational cost, and reference count. The LPW algorithm was implemented in Spark and compared against the LRU as well as other state-of-the-art mechanisms. Our detailed experiments indicate that LPW obviously outperforms its counterparts and can reduce the execution time by up to 75% under typical workloads. Furthermore, the decreasing eviction frequency also shows the LPW algorithm can generate more reasonable predictions.
Similar content being viewed by others
References
Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system. In: Proceedings of Symposium on Mass Storage Systems and Technologies, 2010. 1–10
Li H Y, Ghodsi A, Zaharia M, et al. Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, 2014. 1–15
Saha B, Shah H, Seth S, et al. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of International Conference on Management of Data, Melbourne, 2015. 1357–1369
Zaharia M, Chowdhury M, Das T, et al. Fast and interactive analytics over Hadoop data with spark. Adv Comput Syst Assoc, 2012, 37: 45–51
Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX Conference on Networked Systems Design and Implementation, 2012
Li H, Wang D, Huang T Z, et al. Detecting cache-related bugs in Spark applications. In: Proceedings of International Symposium on Software Testing and Analysis, Virtual Event, 2020. 363–375
Geng Y Z, Shi X H, Pei C, et al. LCS: an efficient data eviction strategy for Spark. Int J Parallel Prog, 2017, 45: 1285–1297
Yu Y H, Wang W, Zhang J, et al. LRC: dependency-aware cache management for data analytics clusters. In: Proceedings of Conference on Computer Communications, Atlanta, 2017. 1–9
Huang S S, Huang J, Dai J Q, et al. The Hibench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of International Conference on Data Engineering Workshop, Long Beach, 2010. 41–51
Li C, Cox A L. GD-Wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the European Conference on Computer Systems, Bordeaux, 2015. 1–15
Liu E, Hashemi M, Swersky K, et al. An imitation learning approach for cache replacement. In: Proceedings of the International Conference on Machine Learning, 2020. 6237–6247
Lee D H, Choi J, Kim J H, et al. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. SIGMETRICS Perform Eval Rev, 1999, 27: 134–143
Swain D, Paikaray B, Swain D. AWRP: adaptive weight ranking policy for improving cache performance. Comput Sci, 2011, 3: 2151–9617
Yu Y H, Wang W, Zhang J, et al. LERC: coordinated cache management for data-parallel systems. In: Proceedings of Global Communications Conference, 2017. 1–6
Zhao Y X, Wu J. Dache: a data aware caching for big-data applications using the MapReduce framework. In: Proceedings of International Conference on Computer Communications, 2013. 35–39
Yang Z Y, Jia D L, Ioannidis S, et al. Intermediate data caching optimization for multi-stage and parallel big data frameworks. In: Proceedings of International Conference on Cloud Computing, San Francisco, 2018. 277–284
Gonzalez J E, Xin R S, Dave A, et al. GraphX: graph processing in a distributed dataflow framework. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation, Broomfield, 2014. 599–613
Meng X R, Bradley J, Yavuz B, et al. MLlib: machine learning in Apache Spark. J Mach Learn Res, 2016, 17: 1235–1241
Xu L J, Dou W S, Zhu F, et al. Characterizing and diagnosing out of memory errors in MapReduce applications. J Syst Softw, 2018, 137: 399–414
Ousterhout K, Rasti R, Ratnasamy S, et al. Making sense of performance in data analytics frameworks. In: Proceedings of USENIX Symposium on Networked Systems Design and implementation, Oakland, 2015. 293–307
Xu L, Li M, Zhang L, et al. MemTune: dynamic memory management for in-memory data analytic platforms. In: Proceedings of International Parallel and Distributed Processing Symposium, Chicago, 2016. 383–392
Li S, Amin M T, Ganti R, et al. Stark: optimizing in-memory computing for dynamic dataset collections. In: Proceedings of International Conference on Distributed Computing System, Atlanta, 2017. 103–114
Ananthanarayanan G, Ghodsi A, Warfield A, et al. PACMan: coordinated memory caching for parallel jobs. In: Proceedings of Symposium on Networked Systems Design and Implementation, San Jose, 2012. 267–280
Perez T B, Zhou X B, Chen D Z. Reference-distance eviction and prefetching for cache management in Spark. In: Proceedings of International Conference on Parallel Processing, Eugene, 2018. 1–10
Xu E, Saxena M, Chiu L. Neutrino: revisiting memory caching for iterative data analytics. In: Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems, Denver, 2016. 16–20
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. U20A6003, 61802377, 61872340) and Youth Innovation Promotion Association of Chinese Academy of Sciences.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Li, H., Ji, S., Zhong, H. et al. LPW: an efficient data-aware cache replacement strategy for Apache Spark. Sci. China Inf. Sci. 66, 112104 (2023). https://doi.org/10.1007/s11432-021-3406-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-021-3406-5