International Journal of Parallel Programming

, Volume 41, Issue 6, pp 782–805 | Cite as

Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience

  • Gabriel Rodríguez
  • María J. Martín
  • Patricia González
  • Juan Touriño
  • Ramón Doallo
Article

Abstract

With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.

Keywords

Fault tolerance Checkpointing Parallel programming  Message-passing Compiler support Cetus  LLVM 

References

  1. 1.
    Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, & Tools, pp. 632–638. Pearson Education, Upper Saddle River (2007)Google Scholar
  2. 2.
    Arenaz, M., Touriño, J., Doallo, R.: XARK: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst. 30(6), 32:1–32:56 (2008)CrossRefGoogle Scholar
  3. 3.
    Baratloo, A., Dasgupta, P., Kedem, Z.M.: CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms. In: Proceedings of the 4th IEEE International Symposium on High Performance, Distributed Computing (HPDC-4), pp. 122–129 (1995)Google Scholar
  4. 4.
    Beguelin, A., Seligman, E., Stephan, P.: Application level fault tolerance in heterogeneous networks of workstations. J. Parallel Distrib. Comput. 43(2), 147–155 (1997)CrossRefGoogle Scholar
  5. 5.
    Bouteiller, A., Capello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault-tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of the 15th ACM/IEEE Conference on Supercomputing (SC’03), pp. 25–42 (2003)Google Scholar
  6. 6.
    Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: C\(^{\text{3}}\): A system for automating application-level checkpointing of MPI programs. In: Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC’03), pp. 357–373 (2003)Google Scholar
  7. 7.
    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)CrossRefGoogle Scholar
  8. 8.
    Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. IEEE Comput. 42(12), 36–42 (2009)CrossRefGoogle Scholar
  9. 9.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  10. 10.
    Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers. CTWatch Q. 3(4), 4–10 (2007)Google Scholar
  11. 11.
    Landau, C.R.: The checkpoint mechanism in KeyKOS. In: Proceedings of the 2nd International Workshop on Object Orientation on Operating Systems (I-WOOOS’92), pp. 86–91 (1992)Google Scholar
  12. 12.
    Lattner, C., Adve, V.S.: LLVM: A compilation framework for lifelong program analysis. In: Proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization (CGO’04), pp. 75–88 (2004)Google Scholar
  13. 13.
    Li, C.C.J., Stewart, E.M., Fuchs, W.K.: Compiler-assisted full checkpointing. Softw. Pract. Exp. 24(10), 871–886 (1994)CrossRefGoogle Scholar
  14. 14.
    National Aeronautics and Space Administration: The NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. Retrieved December 2011
  15. 15.
    Ousterhout, J.K., Cherenson, A.R., Douglis, F., Nelson, M.N., Welch, B.B.: The Sprite network operating system. IEEE Comput. 21(2), 23–36 (1988)CrossRefGoogle Scholar
  16. 16.
    Parr, T.J., Quong, R.W.: ANTLR: a predicated-LL(k) parser generator. Softw. Pract. Exp. 25(7), 789–810 (1995)CrossRefGoogle Scholar
  17. 17.
    Plank, J.S., Beck, M., Kingsley, G.: Compiler-assisted memory exclusion for fast checkpointing. IEEE Tech. Comm. Oper. Syst. Appl. Environ. 7(4), 10–14 (1995)Google Scholar
  18. 18.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223 (1995)Google Scholar
  19. 19.
    Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogeneous architectures. In: Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS’97), pp. 58–67 (1997)Google Scholar
  20. 20.
    Rodríguez, G., Martín, M.J., González, P., Touriño, J.: Controller/precompiler for portable checkpointing. IEICE Trans. Inf. Syst. E89–D(2), 408–417 (2006)CrossRefGoogle Scholar
  21. 21.
    Rodríguez, G., Martín, M.J., González, P., Touriño, J.: A heuristic approach for the automatic insertion of checkpoints in message-passing codes. J. Univers. Comput. Sci. 15(14), 2894–2911 (2009)Google Scholar
  22. 22.
    Rodríguez, G., Martín, M.J., González, P., Touriño, J.: Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study. Comput. J. 54(11), 1821–1837 (2011)CrossRefGoogle Scholar
  23. 23.
    Rodríguez, G., Martín, M.J., González, P., Touriño, J., Doallo, R.: CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr. Comput. Pract. Exp. 22(6), 749–766 (2010)Google Scholar
  24. 24.
    Russinovich, M., Segall, Z.: Fault-tolerance for off-the-shelf applications and hardware. In: Proceedings of the 25th International Symposium on Fault-Tolerant Computing (FTCS’95), pp. 67–71 (1995)Google Scholar
  25. 25.
    Shires, D., Pollock, L., Sprenkle, S.: Program flow graph construction for static analysis of MPI programs. In: Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’99), pp. 1847–1853 (1999)Google Scholar
  26. 26.
    Woo, N., Jung, H., Yeom, H.Y., Park, T., Park, H.: MPICH-GF: transparent checkpointing and rollback-recovery for Grid-enabled MPI processes. IEICE Trans. Inf. Syst. E87–D(7), 1820–1828 (2004)Google Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Gabriel Rodríguez
    • 1
  • María J. Martín
    • 1
  • Patricia González
    • 1
  • Juan Touriño
    • 1
  • Ramón Doallo
    • 1
  1. 1.Computer Architecture Group, Department of Electronics and SystemsUniversity of A CoruñaA CoruñaSpain

Personalised recommendations