Skip to main content

Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems

  • Conference paper
  • First Online:
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9528))

Abstract

In-memory cluster computing systems based MapReduce, such as Spark, have made a great impact in addressing all kinds of big data problems. Given the overuse of memory speed, which stems from avoiding the latency caused by disk I/O operations, some process designs may cause resource inefficiency in traditional high performance computing (HPC) systems. Hash-based shuffle, particularly large-scale shuffle, can significantly affect job performance through excessive file operations and unreasonable use of memory. Some intermediate data unnecessarily overflow to the disk when memory usage is unevenly distributed or when memory runs out. Thus, in this study, Write Handle Reusing is proposed to fully utilize memory in shuffle file writing and reading. Load Balancing Optimizer is introduced to ensure the even distribution of data processing across all worker nodes, and Memory-Aware Task Scheduler that coordinates concurrency level and memory usage is also developed to prevent memory spilling. Experimental results on representative workloads demonstrate that the proposed approaches can decrease the overall job execution time and improve memory efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)

    Google Scholar 

  3. Apache hadoop. http://apache.hadoop.org

  4. Rasmussen, A., Porter, G., Conley, M., Madhyastha, H.V., Mysore, R.N., Pucher, A., Vahdat, A.: Tritonsort: a balanced large-scale sorting system. In: Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 29–42 (2011)

    Google Scholar 

  5. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endowment 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  6. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), pp. 10–10 (2010)

    Google Scholar 

  7. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 2–2 (2012)

    Google Scholar 

  8. Shi, X., Chen, M., He, L., Xie, X., Jin, H., Chen, Y., Wu, S.: Mammoth: gearing hadoop towards memory-intensive MapReduce applications. IEEE Trans. Parallel Distrib. Syst. 26(8), 2300–2315 (2015)

    Article  Google Scholar 

  9. Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of memory-resident MapReduce on HPC systems. In: Proceedings of 2014 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 799–808 (2014)

    Google Scholar 

  10. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys), pp. 265–278 (2010)

    Google Scholar 

  11. Davidson, A., Or, A.: Optimizing shuffle performance in spark. Technical report, University of California, Berkeley-Department of Electrical Engineering and Computer Sciences (2013)

    Google Scholar 

  12. Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: ShuffleWatcher: shuffle-aware scheduling in multi-tenant MapReduce clusters. In: Proceedings of the 2014 USENIX Annual Technical Conference (ATC), pp. 1–12 (2014)

    Google Scholar 

  13. Polo, J., Castillo, C., Carrera, D., Becerra, Y., Whalley, I., Steinder, M., Torres, J., Ayguadé, E.: Resource-aware adaptive scheduling for MapReduce clusters. In: Kon, F., Kermarrec, A.-M. (eds.) Middleware 2011. LNCS, vol. 7049, pp. 187–207. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

Download references

Acknowledgments

This paper is partly supported by the NSFC under grant No. 61433019 and No. 61370104, International Science & Technology Cooperation Program of China under grant No. 2015DFE12860, and Chinese Universities Scientific Fund under grant No. 2014TS008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuanhua Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Pei, C., Shi, X., Jin, H. (2015). Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9528. Springer, Cham. https://doi.org/10.1007/978-3-319-27119-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27119-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27118-7

  • Online ISBN: 978-3-319-27119-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics