Camel: collective-aware message logging

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The continuous progress in the performance of supercomputers has made possible the understanding of many fundamental problems in science. Simulation, the third scientific pillar, constantly demands more powerful machines to use algorithms that would otherwise be unviable. That will inevitably lead to the deployment of an exascale machine during the next decade. However, fault tolerance is a major challenge that has to be overcome to make such a machine usable. With an unprecedented number of parts, machines at extreme scale will have a small mean-time-between-failures. The popular checkpoint/restart mechanism used in today’s machines may not be effective at that scale. One promising way to revamp checkpoint/restart is to use message-logging techniques. By storing messages during execution and replaying them in case of a failure, message logging is able to shorten recovery time and save a substantial amount of energy. The downside of message logging is that memory footprint may grow to unsustainable levels. This paper presents a technique that decreases the memory pressure in message-logging protocols by only storing the necessary messages in collective-communication operations. We introduce Camel, a protocol that has a low memory overhead for multicast and reduction operations. Our results show that Camel can reduce memory footprint in a molecular dynamics benchmark for more than 95 % on 16,384 cores.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    Alvisi L, Hoppe B, Marzullo K (1993) Nonblocking and orphan-free message logging protocols. In: FTCS, pp 145–154

  2. 2.

    Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. International conference on distributed computing systems, pp 229–236

  3. 3.

    Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput Pract Exp 22(16):2196–2211

    Article  Google Scholar 

  4. 4.

    Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Collective operations in application-level fault-tolerant MPI. In: Proceedings of the 17th annual international conference on supercomputing, ICS ’03ACM, New York, NY, USA, pp 234–243

  5. 5.

    Cappello F (2009) Fault tolerance in petascale/ exascale systems: current knowledge, challenges and research opportunities. IJHPCA 23(3):212–226

    Google Scholar 

  6. 6.

    Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388. doi:10.1177/1094342009347767

    Article  Google Scholar 

  7. 7.

    Chakravorty S, Kale LV (2007) A fault tolerance protocol with fast fault recovery. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium. IEEE Press

  8. 8.

    Chandy KM, Lamport L (1985) Distributed snapshots : determining global states of distributed systems. ACM transactions on computer systems

  9. 9.

    Elnozahy EN, Bianchini R, El-Ghazawi T, Fox A, Godfrey F, Hoisie A, McKinley K, Melhem R, Plank JS, Ranganathan P, Simons J (2008) System resilience at extreme scale. Defense Advanced Research Project Agency (DARPA), Tech. Rep

  10. 10.

    Elnozahy EN, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408

    Article  Google Scholar 

  11. 11.

    Elnozahy EN, Zwaenepoel W (1992) Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5):526–531. doi:10.1109/12.142678

    Article  Google Scholar 

  12. 12.

    Ferreira K, Stearley J, Laros III JH, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Supercomputing, ACM, New York, pp 44:1–44:12

  13. 13.

    Guermouche A, Ropars T, Brunet E, Snir M, Cappello F(2011) Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: IPDPS, pp 989–1000

  14. 14.

    Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. In: SciDAC

  15. 15.

    Hursey J, Graham RL (2011) Preserving collective performance across process failure for a fault tolerant MPI. In: Proceedings of the 2011 IEEE international symposium on parallel and distributed processing workshops and PhD forum., IPDPSW ’11IEEE Computer Society, Washington, DC, USA, pp 1208–1215

  16. 16.

    Johnson DB, Zwaenepoel W (1987) Sender-based message logging. In: In digest of papers: 17 annual international symposium on fault-tolerant computing, IEEE Computer Society, pp 14–19

  17. 17.

    Jonathan Lifflander EM, Menon H, Miller P, Krishnamoorthy S, Kale L (2014) Scalable replay with partial-order dependencies for message-logging fault tolerance. In: Proceedings of IEEE Cluster 2014. Madrid, Spain

  18. 18.

    Kalé L, Krishnan S (1993) Charm++ : a portable concurrent object oriented system based on C++. In: Proceedings of the conference on object oriented programming systems, languages and applications

  19. 19.

    Kogge P, Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hiller J, Karp S, Keckler S, Klein D, Lucas R, Richards M, Scarpelli A, Scott S, Snavely A, Sterling T, Williams RS, Yelick K (2008) Exascale computing study: technology challenges in achieving exascale systems

  20. 20.

    Meneses E, Bronevetsky G, Kale LV (2011) Evaluation of simple causal message logging for large-scale fault tolerant HPC systems. In: 16th IEEE workshop on dependable parallel, distributed and network-centric systems in 25th IEEE international parallel and distributed processing symposium (IPDPS 2011)

  21. 21.

    Meneses E, Mendes CL, Kale LV (2010) Team-based message logging: preliminary results. In: 3rd workshop on resiliency in high performance computing (Resilience) in clusters, clouds, and grids (CCGRID 2010)

  22. 22.

    Meneses E, Ni X, Kale LV (2011) Design and analysis of a message logging protocol for fault tolerant multicore systems. Tech. Rep. 11–30, Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign

  23. 23.

    Meneses E, Ni X, Zheng G, Mendes CL, Kale LV (2014) Using migratable objects to enhance fault tolerance schemes in supercomputers. In: IEEE transactions on parallel and distributed systems

  24. 24.

    Meneses E, Sarood O, Kale LV (2014) Energy profile of rollback-recovery strategies in high performance computing. Parallel Computing 40(9), 536–547 (2014). doi:10.1016/j.parco.2014.03.005. http://www.sciencedirect.com/science/article/pii/S0167819114000350

  25. 25.

    Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC, pp 1–11

  26. 26.

    Ropars T, Guermouche A, Uçar B, Meneses E, Kalé LV, Cappello F (2011) On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications. Euro-Par 1:567–578

    Google Scholar 

  27. 27.

    Snir M, Gropp W, Kogge P (2011) Exascale research: preparing for the post moore era. https://www.ideals.illinois.edu/bitstream/handle/2142/25468/Exascale%20Research.pdf

  28. 28.

    Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1), 49–66 (Spring 2005). doi:10.1177/1094342005051521

  29. 29.

    Zheng G, Shi L, Kalé LV (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE Cluster, San Diego, CA, pp 93–103

Download references

Acknowledgments

This research was supported in part by the US Department of Energy under Grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N. This work also used machine resources from PARTS project and Directors discretionary allocation on Intrepid at ANL for which authors thank the ALCF and ANL staff.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Esteban Meneses.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Meneses, E., Kalé, L.V. Camel: collective-aware message logging. J Supercomput 71, 2516–2538 (2015). https://doi.org/10.1007/s11227-015-1402-3

Download citation

Keywords

  • Fault tolerance
  • Resilience
  • Message logging
  • Collective-communication operations