A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks

  • Bo Wang
  • Jie TangEmail author
  • Rui Zhang
  • Wei Ding
  • Deyu Qi


Artificial intelligence applications that greatly depend on deep learning and compute vision processing becomes popular. Their strong demands for low-latency or real-time services make Spark, an in-memory big data computing framework, the best choice in taking place of previous disk-based big data computing. As an in-memory framework, reasonable data arrangement in storage is the key factor of performance. However, the existing cache replacement strategy and storage selection mechanism based optimizations all rely on an imprecise available memory model and will lead to negative decision. To address this issue, we propose an available memory model to capture the accurate information of to be freed memory space by sensing the dependencies between the data. And we also propose a maximum memory requirement model for execution prediction to exclude the redundancy from inactive blocks. With such two models, we build DASS, a dependency-aware storage selection mechanism for Spark to make dynamic and fine-grained storage decision. Our experiments show that compared with previous methods the DASS could effectively reduce the cost of garbage collection and RDD blocks re-computing, give better computing performance by 77.4%.


Big data In-memory computing frameworks Storage schema Performance optimization 



Jie Tang is the corresponding author of this paper. This work is supported by South China University of Technology Start-up Grant No. D61600470, Guangzhou Technology Grant No. 201707010148, the Fundamental Research Funds for the Central Universities Grant No. 2017MS057, and National Science Foundation of China under Grant No. 61370062.


  1. 1.
    Yu, Y., Wang, W., Zhang, J., Letaief, K.B.: LRC: dependency-aware cache management for data analytics clusters (2017)Google Scholar
  2. 2.
    Liu, Z., Ng, T.S.E.: Leaky buffer: a novel abstraction for relieving memory pressure from cluster data processing frameworks. IEEE Trans. Parallel Distrib. Syst. 28(1), 128–140 (2017)CrossRefGoogle Scholar
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
    Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1357–1369 (2015).
  9. 9.
  10. 10.
    Nicolae, B., Costa, C.H.A., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib. Syst. 28(6), 1663–1674 (2017)CrossRefGoogle Scholar
  11. 11.
    Mattson, R.L., et al.: Evaluation techniques for storage hierarchies. IBM Syst. J. 9(2), 78–117 (1970). CrossRefGoogle Scholar
  12. 12.
    Aho, A.V., et al.: Principles of optimal page replacement. J. ACM 18(1), 80–93 (1971). MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Nguyen, K., Fang, L., Xu, G., Demsky, B.: Speculative region-based memory management for big data systems. In: Proceedings of the 8th workshop on programming languages and operating systems, pp. 27–32 (2015).
  14. 14.
    Nguyen, K., Wang, K., Bu, Y., Fang, L., Hu, J., Xu, G.: Facade: a compiler and runtime for (almost) object-bounded big data applications. SIGPLAN Not. 50(4), 675–690 (2015)CrossRefGoogle Scholar
  15. 15.
    Koliopoulos, A.K., Yiapanis, P., Tekiner, F., Nenadic, G., Keane, J.: Towards automatic memory tuning for in-memory big data analytics in clusters. In: Proceedings 2016 IEEE international congress on big data (BigData congress), pp. 353–356 (2016)Google Scholar
  16. 16.
    Wang, B., Tang, J., Zhang, R., Gu, Z.: CSAS: cost-based storage auto-selection, a fine grained storage selection mechanism for spark. In: Proceedings network and parallel computing: 14th IFIP WG 10.3 international conference (NPC 2017), pp. 150–154 (2017). Google Scholar
  17. 17.
    Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings the 12th ACM international conference on computing frontiers, pp. 1–8 (2015).
  18. 18.
    Zaharia, M., Chowdhury, M., Das, T., Dave, Ma, AJ., Mccauley, M., Franklin, MJ., Shenker, S., Stoica, I. : Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings the 9th USENIX conference on networked systems design and im-plementation, pp. 2 (2012)Google Scholar
  19. 19.
  20. 20.
    Chen, Q.A., et al.: Parameter optimization for spark jobs based on runtime data analysis. China Comput. Eng. Sci. 38(1), 11–19 (2016)Google Scholar
  21. 21.
    Khan, M., et al.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurr. Comput. Pract. Exp. 29(3), e3786 (2017) CrossRefGoogle Scholar
  22. 22.
    Wang, G.L. et al.: A performance automatic optimization method for spark, Patent CN 105868019 A (2016)Google Scholar
  23. 23.
    Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In: Proceedings of the VLDB, pp. 1111–1122 (2011)Google Scholar
  24. 24.
    Geng, Y., Shi, X., Pei, C., Jin, H., Jiang, W.: LCS: an efficient data eviction strategy for Spark. Int. J. Parallel Program. 45, 1–13 (2016)Google Scholar
  25. 25.
    Duan, M., et al.: Selection and replacement algorithms for memory performance improvement in spark. Concurr. Comput. Pract. Exp. 28(8), 2473–2486 (2016)CrossRefGoogle Scholar
  26. 26.
    Zhao, Y., et al.: An adaptive tuning strategy on spark based on in-memory computation characteristics. In: Proceedings ICACT, pp. 484–488 (2016)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Bo Wang
    • 1
  • Jie Tang
    • 2
    Email author
  • Rui Zhang
    • 3
  • Wei Ding
    • 4
  • Deyu Qi
    • 2
  1. 1.Beijing Institute of TechnologyBeijingChina
  2. 2.South China University of TechnologyGuangzhouChina
  3. 3.Yan’an UniversityYan’anChina
  4. 4.Henan University of TechnologyZhengzhouChina

Personalised recommendations