Skip to main content
Log in

LPW: an efficient data-aware cache replacement strategy for Apache Spark

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Caching is one of the most important techniques for the popular distributed big data processing framework Spark. For this big data parallel computing framework, which is designed to support various applications based on in-memory computing, it is not possible to cache every intermediate result due to the memory size limitation. The arbitrariness of cache application programming interface (API) usage, the diversity of application characteristics, and the variability of memory resources constitute challenges to achieving high system execution performance. Inefficient cache replacement strategies may cause different performance problems such as long application execution time, low memory utilization, high replacement frequency, and even program execution failure resulting from out of memory. The cache replacement strategy currently adopted by Spark is the least recently used (LRU) strategy. Although LRU is a classical algorithm and has been widely used, it lacks consideration for the environment and workloads. As a result, it cannot achieve good performance under many scenarios. In this paper, we propose a novel cache replacement algorithm, least partition weight (LPW). LPW takes comprehensive consideration of different factors affecting system performance, such as partition size, computational cost, and reference count. The LPW algorithm was implemented in Spark and compared against the LRU as well as other state-of-the-art mechanisms. Our detailed experiments indicate that LPW obviously outperforms its counterparts and can reduce the execution time by up to 75% under typical workloads. Furthermore, the decreasing eviction frequency also shows the LPW algorithm can generate more reasonable predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system. In: Proceedings of Symposium on Mass Storage Systems and Technologies, 2010. 1–10

  2. Li H Y, Ghodsi A, Zaharia M, et al. Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, Seattle, 2014. 1–15

  3. Saha B, Shah H, Seth S, et al. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of International Conference on Management of Data, Melbourne, 2015. 1357–1369

  4. Zaharia M, Chowdhury M, Das T, et al. Fast and interactive analytics over Hadoop data with spark. Adv Comput Syst Assoc, 2012, 37: 45–51

    Google Scholar 

  5. Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX Conference on Networked Systems Design and Implementation, 2012

  6. Li H, Wang D, Huang T Z, et al. Detecting cache-related bugs in Spark applications. In: Proceedings of International Symposium on Software Testing and Analysis, Virtual Event, 2020. 363–375

  7. Geng Y Z, Shi X H, Pei C, et al. LCS: an efficient data eviction strategy for Spark. Int J Parallel Prog, 2017, 45: 1285–1297

    Article  Google Scholar 

  8. Yu Y H, Wang W, Zhang J, et al. LRC: dependency-aware cache management for data analytics clusters. In: Proceedings of Conference on Computer Communications, Atlanta, 2017. 1–9

  9. Huang S S, Huang J, Dai J Q, et al. The Hibench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of International Conference on Data Engineering Workshop, Long Beach, 2010. 41–51

  10. Li C, Cox A L. GD-Wheel: a cost-aware replacement policy for key-value stores. In: Proceedings of the European Conference on Computer Systems, Bordeaux, 2015. 1–15

  11. Liu E, Hashemi M, Swersky K, et al. An imitation learning approach for cache replacement. In: Proceedings of the International Conference on Machine Learning, 2020. 6237–6247

  12. Lee D H, Choi J, Kim J H, et al. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. SIGMETRICS Perform Eval Rev, 1999, 27: 134–143

    Article  Google Scholar 

  13. Swain D, Paikaray B, Swain D. AWRP: adaptive weight ranking policy for improving cache performance. Comput Sci, 2011, 3: 2151–9617

    Google Scholar 

  14. Yu Y H, Wang W, Zhang J, et al. LERC: coordinated cache management for data-parallel systems. In: Proceedings of Global Communications Conference, 2017. 1–6

  15. Zhao Y X, Wu J. Dache: a data aware caching for big-data applications using the MapReduce framework. In: Proceedings of International Conference on Computer Communications, 2013. 35–39

  16. Yang Z Y, Jia D L, Ioannidis S, et al. Intermediate data caching optimization for multi-stage and parallel big data frameworks. In: Proceedings of International Conference on Cloud Computing, San Francisco, 2018. 277–284

  17. Gonzalez J E, Xin R S, Dave A, et al. GraphX: graph processing in a distributed dataflow framework. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation, Broomfield, 2014. 599–613

  18. Meng X R, Bradley J, Yavuz B, et al. MLlib: machine learning in Apache Spark. J Mach Learn Res, 2016, 17: 1235–1241

    MathSciNet  Google Scholar 

  19. Xu L J, Dou W S, Zhu F, et al. Characterizing and diagnosing out of memory errors in MapReduce applications. J Syst Softw, 2018, 137: 399–414

    Article  Google Scholar 

  20. Ousterhout K, Rasti R, Ratnasamy S, et al. Making sense of performance in data analytics frameworks. In: Proceedings of USENIX Symposium on Networked Systems Design and implementation, Oakland, 2015. 293–307

  21. Xu L, Li M, Zhang L, et al. MemTune: dynamic memory management for in-memory data analytic platforms. In: Proceedings of International Parallel and Distributed Processing Symposium, Chicago, 2016. 383–392

  22. Li S, Amin M T, Ganti R, et al. Stark: optimizing in-memory computing for dynamic dataset collections. In: Proceedings of International Conference on Distributed Computing System, Atlanta, 2017. 103–114

  23. Ananthanarayanan G, Ghodsi A, Warfield A, et al. PACMan: coordinated memory caching for parallel jobs. In: Proceedings of Symposium on Networked Systems Design and Implementation, San Jose, 2012. 267–280

  24. Perez T B, Zhou X B, Chen D Z. Reference-distance eviction and prefetching for cache management in Spark. In: Proceedings of International Conference on Parallel Processing, Eugene, 2018. 1–10

  25. Xu E, Saxena M, Chiu L. Neutrino: revisiting memory caching for iterative data analytics. In: Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems, Denver, 2016. 16–20

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. U20A6003, 61802377, 61872340) and Youth Innovation Promotion Association of Chinese Academy of Sciences.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hua Zhong or Wei Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Ji, S., Zhong, H. et al. LPW: an efficient data-aware cache replacement strategy for Apache Spark. Sci. China Inf. Sci. 66, 112104 (2023). https://doi.org/10.1007/s11432-021-3406-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3406-5

Keywords

Navigation