The continuous progress in the performance of supercomputers has made possible the understanding of many fundamental problems in science. Simulation, the third scientific pillar, constantly demands more powerful machines to use algorithms that would otherwise be unviable. That will inevitably lead to the deployment of an exascale machine during the next decade. However, fault tolerance is a major challenge that has to be overcome to make such a machine usable. With an unprecedented number of parts, machines at extreme scale will have a small mean-time-between-failures. The popular checkpoint/restart mechanism used in today’s machines may not be effective at that scale. One promising way to revamp checkpoint/restart is to use message-logging techniques. By storing messages during execution and replaying them in case of a failure, message logging is able to shorten recovery time and save a substantial amount of energy. The downside of message logging is that memory footprint may grow to unsustainable levels. This paper presents a technique that decreases the memory pressure in message-logging protocols by only storing the necessary messages in collective-communication operations. We introduce Camel, a protocol that has a low memory overhead for multicast and reduction operations. Our results show that Camel can reduce memory footprint in a molecular dynamics benchmark for more than 95 % on 16,384 cores.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Alvisi L, Hoppe B, Marzullo K (1993) Nonblocking and orphan-free message logging protocols. In: FTCS, pp 145–154
Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. International conference on distributed computing systems, pp 229–236
Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput Pract Exp 22(16):2196–2211
Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Collective operations in application-level fault-tolerant MPI. In: Proceedings of the 17th annual international conference on supercomputing, ICS ’03ACM, New York, NY, USA, pp 234–243
Cappello F (2009) Fault tolerance in petascale/ exascale systems: current knowledge, challenges and research opportunities. IJHPCA 23(3):212–226
Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388. doi:10.1177/1094342009347767
Chakravorty S, Kale LV (2007) A fault tolerance protocol with fast fault recovery. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium. IEEE Press
Chandy KM, Lamport L (1985) Distributed snapshots : determining global states of distributed systems. ACM transactions on computer systems
Elnozahy EN, Bianchini R, El-Ghazawi T, Fox A, Godfrey F, Hoisie A, McKinley K, Melhem R, Plank JS, Ranganathan P, Simons J (2008) System resilience at extreme scale. Defense Advanced Research Project Agency (DARPA), Tech. Rep
Elnozahy EN, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408
Elnozahy EN, Zwaenepoel W (1992) Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5):526–531. doi:10.1109/12.142678
Ferreira K, Stearley J, Laros III JH, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Supercomputing, ACM, New York, pp 44:1–44:12
Guermouche A, Ropars T, Brunet E, Snir M, Cappello F(2011) Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: IPDPS, pp 989–1000
Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. In: SciDAC
Hursey J, Graham RL (2011) Preserving collective performance across process failure for a fault tolerant MPI. In: Proceedings of the 2011 IEEE international symposium on parallel and distributed processing workshops and PhD forum., IPDPSW ’11IEEE Computer Society, Washington, DC, USA, pp 1208–1215
Johnson DB, Zwaenepoel W (1987) Sender-based message logging. In: In digest of papers: 17 annual international symposium on fault-tolerant computing, IEEE Computer Society, pp 14–19
Jonathan Lifflander EM, Menon H, Miller P, Krishnamoorthy S, Kale L (2014) Scalable replay with partial-order dependencies for message-logging fault tolerance. In: Proceedings of IEEE Cluster 2014. Madrid, Spain
Kalé L, Krishnan S (1993) Charm++ : a portable concurrent object oriented system based on C++. In: Proceedings of the conference on object oriented programming systems, languages and applications
Kogge P, Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hiller J, Karp S, Keckler S, Klein D, Lucas R, Richards M, Scarpelli A, Scott S, Snavely A, Sterling T, Williams RS, Yelick K (2008) Exascale computing study: technology challenges in achieving exascale systems
Meneses E, Bronevetsky G, Kale LV (2011) Evaluation of simple causal message logging for large-scale fault tolerant HPC systems. In: 16th IEEE workshop on dependable parallel, distributed and network-centric systems in 25th IEEE international parallel and distributed processing symposium (IPDPS 2011)
Meneses E, Mendes CL, Kale LV (2010) Team-based message logging: preliminary results. In: 3rd workshop on resiliency in high performance computing (Resilience) in clusters, clouds, and grids (CCGRID 2010)
Meneses E, Ni X, Kale LV (2011) Design and analysis of a message logging protocol for fault tolerant multicore systems. Tech. Rep. 11–30, Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign
Meneses E, Ni X, Zheng G, Mendes CL, Kale LV (2014) Using migratable objects to enhance fault tolerance schemes in supercomputers. In: IEEE transactions on parallel and distributed systems
Meneses E, Sarood O, Kale LV (2014) Energy profile of rollback-recovery strategies in high performance computing. Parallel Computing 40(9), 536–547 (2014). doi:10.1016/j.parco.2014.03.005. http://www.sciencedirect.com/science/article/pii/S0167819114000350
Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC, pp 1–11
Ropars T, Guermouche A, Uçar B, Meneses E, Kalé LV, Cappello F (2011) On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications. Euro-Par 1:567–578
Snir M, Gropp W, Kogge P (2011) Exascale research: preparing for the post moore era. https://www.ideals.illinois.edu/bitstream/handle/2142/25468/Exascale%20Research.pdf
Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1), 49–66 (Spring 2005). doi:10.1177/1094342005051521
Zheng G, Shi L, Kalé LV (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE Cluster, San Diego, CA, pp 93–103
This research was supported in part by the US Department of Energy under Grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N. This work also used machine resources from PARTS project and Directors discretionary allocation on Intrepid at ANL for which authors thank the ALCF and ANL staff.
About this article
Cite this article
Meneses, E., Kalé, L.V. Camel: collective-aware message logging. J Supercomput 71, 2516–2538 (2015). https://doi.org/10.1007/s11227-015-1402-3
- Fault tolerance
- Message logging
- Collective-communication operations