Cluster Computing

, Volume 17, Issue 2, pp 303–313 | Cite as

Reverse computation for rollback-based fault tolerance in large parallel systems

Evaluating the potential gains and systems effects
  • Kalyan S. Perumalla
  • Alfred J. Park


Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.


Checkpointing Rollback Reverse computation Performance evaluation Parallel Systems Fault tolerance 



This paper has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the U.S. Department of Energy (DOE). Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory supported by the Office of Science of the DOE.


  1. 1.
    Biswas, B., Mall, R.: Reverse execution of programs. ACM SIGPLAN Not. 34(4), 61–69 (1999) CrossRefGoogle Scholar
  2. 2.
    Bowers, K., Chow, E., Xu, H., Dror, R., Eastwood, M., Gregersen, B., Klepeis, J., Kolossvary, I., Moraes, M., Sacerdoti, F., Salmon, J., Shan, Y., Shaw, D.: Scalable algorithms for molecular dynamics simulations on commodity clusters. In: Proceedings of the ACM/IEEE Conference (SC 2006), p. 43 (2006). doi: 10.1109/SC.2006.54 CrossRefGoogle Scholar
  3. 3.
    Carothers, C., Perumalla, K., Fujimoto, R.: The effect of state-saving in optimistic simulation on a cache-coherent non-uniform memory access architecture. In: Simulation Conference Proceedings (Winter), vol. 2, pp. 1624–1633 (1999). doi: 10.1109/WSC.1999.816902 Google Scholar
  4. 4.
    Carothers, C., Perumalla, K.S., Fujimoto, R.M.: Efficient optimistic parallel simulations using reverse computation. ACM Trans. Model. Comput. Simul. 9(3), 224–253 (1999) CrossRefGoogle Scholar
  5. 5.
    Haile, J.M.: Molecular Dynamics Simulation: Elementary Methods. Wiley Professional Paperback Series. Wiley, New York (1992) Google Scholar
  6. 6.
    Hontalas, P., Beckman, B., DiLorento, M., Blume, L., Reiher, P., Sturdevant, K., Warren, L.V., Wedel, J., Wieland, F., Jefferson, D.R.: Performance of the colliding pucks simulation on the time warp operating system. In: Distributed Simulation (1989) Google Scholar
  7. 7.
    Jefferson, D.R.: Virtual time. ACM Trans. Program. Lang. Syst. 7(3), 404–425 (1985) CrossRefMathSciNetGoogle Scholar
  8. 8.
    Kim, Y., Plank, J.S., Dongarra, J.J.: Fault tolerant matrix operations using checksum and reverse computation. In: Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation (FRONTIERS ’96), p. 70. IEEE Comput. Soc. (1996) Google Scholar
  9. 9.
    L’Ecuyer, P., Andres, T.H.: A random number generator based on the combination of four LCGs. In: Mathematics and Computers in Simulation, pp. 99–107 (1997) Google Scholar
  10. 10.
    Lubachevsky, B.D.: How to simulate billiards and similar systems. J. Comput. Phys. 92(2) (1991) Google Scholar
  11. 11.
    Lubachevsky, B.D.: How to simulate billiards and similar systems. arXiv:cond-mat/0503627v2 (2006)
  12. 12.
    Manivannan, D., Singhal, M.: A low-overhead recovery technique using quasi-synchronous checkpointing. In: Proc. IEEE Int. Conference on Distributed Computing Systems, pp. 100–107 (1996) Google Scholar
  13. 13.
    Miller, S., Luding, S.: Event-driven molecular dynamics in parallel. J. Comput. Phys. 193(1), 306–316 (2004) CrossRefzbMATHGoogle Scholar
  14. 14.
    Perumalla, K.S.: Scaling time warp-based discrete event execution to 104 processors on the blue gene supercomputer. In: International Conference on Computing Frontiers, pp. 69–76 (2007) CrossRefGoogle Scholar
  15. 15.
    Perumalla, K.S., Protopopescu, V.A.: Reversible simulation of elastic collisions. ACM Trans. Model. Comput. Simul. 23(2) (2013). arXiv:1302.1126 [physics.comp-ph]
  16. 16.
    Yokoyama, T.: Reversible computation and reversible programming languages. Electron. Notes Theor. Comput. Sci. 253(6), 71–81 (2010). doi: 10.1016/j.entcs.2010.02.007 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Computational Sciences and Engineering DivisionOak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations