Abstract
In-memory cluster computing systems based MapReduce, such as Spark, have made a great impact in addressing all kinds of big data problems. Given the overuse of memory speed, which stems from avoiding the latency caused by disk I/O operations, some process designs may cause resource inefficiency in traditional high performance computing (HPC) systems. Hash-based shuffle, particularly large-scale shuffle, can significantly affect job performance through excessive file operations and unreasonable use of memory. Some intermediate data unnecessarily overflow to the disk when memory usage is unevenly distributed or when memory runs out. Thus, in this study, Write Handle Reusing is proposed to fully utilize memory in shuffle file writing and reading. Load Balancing Optimizer is introduced to ensure the even distribution of data processing across all worker nodes, and Memory-Aware Task Scheduler that coordinates concurrency level and memory usage is also developed to prevent memory spilling. Experimental results on representative workloads demonstrate that the proposed approaches can decrease the overall job execution time and improve memory efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)
Apache hadoop. http://apache.hadoop.org
Rasmussen, A., Porter, G., Conley, M., Madhyastha, H.V., Mysore, R.N., Pucher, A., Vahdat, A.: Tritonsort: a balanced large-scale sorting system. In: Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 29–42 (2011)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endowment 3(1–2), 285–296 (2010)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), pp. 10–10 (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 2–2 (2012)
Shi, X., Chen, M., He, L., Xie, X., Jin, H., Chen, Y., Wu, S.: Mammoth: gearing hadoop towards memory-intensive MapReduce applications. IEEE Trans. Parallel Distrib. Syst. 26(8), 2300–2315 (2015)
Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of memory-resident MapReduce on HPC systems. In: Proceedings of 2014 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 799–808 (2014)
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys), pp. 265–278 (2010)
Davidson, A., Or, A.: Optimizing shuffle performance in spark. Technical report, University of California, Berkeley-Department of Electrical Engineering and Computer Sciences (2013)
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: ShuffleWatcher: shuffle-aware scheduling in multi-tenant MapReduce clusters. In: Proceedings of the 2014 USENIX Annual Technical Conference (ATC), pp. 1–12 (2014)
Polo, J., Castillo, C., Carrera, D., Becerra, Y., Whalley, I., Steinder, M., Torres, J., Ayguadé, E.: Resource-aware adaptive scheduling for MapReduce clusters. In: Kon, F., Kermarrec, A.-M. (eds.) Middleware 2011. LNCS, vol. 7049, pp. 187–207. Springer, Heidelberg (2011)
Acknowledgments
This paper is partly supported by the NSFC under grant No. 61433019 and No. 61370104, International Science & Technology Cooperation Program of China under grant No. 2015DFE12860, and Chinese Universities Scientific Fund under grant No. 2014TS008.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pei, C., Shi, X., Jin, H. (2015). Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9528. Springer, Cham. https://doi.org/10.1007/978-3-319-27119-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-27119-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27118-7
Online ISBN: 978-3-319-27119-4
eBook Packages: Computer ScienceComputer Science (R0)