Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?

  • Raghunath Rajachandrasekar
  • Xiangyong Ouyang
  • Xavier Besseron
  • Vilobh Meshram
  • Dhabaleswar K. Panda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)


Given the ever-increasing size of supercomputers, fault resilience and the ability to tolerate faults have become more of a necessity than an option. Checkpoint-Restart protocols have been widely adopted as a practical solution to provide reliability. However, traditional checkpointing mechanisms suffer from heavy I/O bottleneck while dumping process snapshots to a shared filesystem. In this context, we study the benefits of data staging, using a proposed hierarchical and modular data staging framework which reduces the burden of checkpointing on client nodes without penalizing them in terms of performance. During a checkpointing operation in this framework, the compute nodes transmit their process snapshots to a set of dedicated staging I/O servers through a high-throughput RDMA-based data pipeline. Unlike the conventional checkpointing mechanisms that block an application until the checkpoint data has been written to a shared filesystem, we allow the application to resume its execution immediately after the snapshots have been pipelined to the staging I/O servers, while data is simultaneously being moved from these servers to a backend shared filesystem. This framework eases the bottleneck caused by simultaneous writes from multiple clients to the underlying storage subsystem. The staging framework considered in this study is able to reduce the time penalty an application pays to save a checkpoint by 8.3 times.


checkpoint-restart data staging aggregation RDMA 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Filesystem in userspace,
  2. 2.
    IOzone filesystem benchmark,
  3. 3.
    Top 500 supercomputers,
  4. 4.
    Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.: Datastager: Scalable data staging services for petascale applications. In: HPDC (2009)Google Scholar
  5. 5.
    Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M.: PLFS: a checkpoint filesystem for parallel applications. In: SC (2009)Google Scholar
  6. 6.
    Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi protocols. Future Generation Computer Systems (2008)Google Scholar
  7. 7.
    Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. IJHPCA (2009)Google Scholar
  8. 8.
    Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP (2006)Google Scholar
  9. 9.
    Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Panda, D.K., Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant systems. In: ICPP (2009)Google Scholar
  10. 10.
    Hargrove, P.H., Duell, J.C.: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. In: SciDAC (2006)Google Scholar
  11. 11.
    Hursey, J., Lumsdaine, A.: A composable runtime recovery policy framework supporting resilient hpc applications. Tech. rep., University of Tennessee (2010)Google Scholar
  12. 12.
    Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS (2007)Google Scholar
  13. 13.
    InfiniBand Trade Association: The InfiniBand Architecture,
  14. 14.
    Isaila, F., Garcia Blas, J., Carretero, J., Latham, R., Ross, R.: Design and evaluation of multiple-level data staging for blue gene systems. TPDS (2011)Google Scholar
  15. 15.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)Google Scholar
  16. 16.
    Ouyang, X., Gopalakrishnan, K., Gangadharappa, T., Panda, D.K.: Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture. HiPC (2009)Google Scholar
  17. 17.
    Ouyang, X., Rajachandrasekhar, R., Besseron, X., Wang, H., Huang, J., Panda, D.K.: CRFS: A lightweight user-level filesystem for generic checkpoint/restart. In: ICPP (2011) (to appear)Google Scholar
  18. 18.
    Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing the performance of checkpointing systems. In: Software: Practice and Experience (1999)Google Scholar
  19. 19.
    Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Journal of Physics: Conference Series (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Raghunath Rajachandrasekar
    • 1
  • Xiangyong Ouyang
    • 1
  • Xavier Besseron
    • 1
  • Vilobh Meshram
    • 1
  • Dhabaleswar K. Panda
    • 1
  1. 1.Network-Based Computing Laboratory, Department of Computer Science and EngineeringThe Ohio State UniversityUSA

Personalised recommendations