Skip to main content
Log in

Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

With the evolution of high-performance computing, parallel applications have developed an increasing necessity for fault tolerance, most commonly provided by checkpoint and restart techniques. Checkpointing tools are typically implemented at one of two different abstraction levels: at the system level or at the application level. The latter has become an interesting alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). This paper examines the transformations required to enable automatic checkpointing of parallel applications in the CPPC application-level checkpointing framework. These transformations have been implemented on two very different compiler infrastructures: Cetus and LLVM. Cetus is a Java-based compiler infrastructure aiming to provide an easy to use and clean IR and API for program transformation. LLVM is a low-level, SSA-based toolchain. The fundamental differences of both approaches are analyzed from the structural, behavioral and performance perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, & Tools, pp. 632–638. Pearson Education, Upper Saddle River (2007)

    Google Scholar 

  2. Arenaz, M., Touriño, J., Doallo, R.: XARK: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst. 30(6), 32:1–32:56 (2008)

    Article  Google Scholar 

  3. Baratloo, A., Dasgupta, P., Kedem, Z.M.: CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms. In: Proceedings of the 4th IEEE International Symposium on High Performance, Distributed Computing (HPDC-4), pp. 122–129 (1995)

  4. Beguelin, A., Seligman, E., Stephan, P.: Application level fault tolerance in heterogeneous networks of workstations. J. Parallel Distrib. Comput. 43(2), 147–155 (1997)

    Article  Google Scholar 

  5. Bouteiller, A., Capello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault-tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of the 15th ACM/IEEE Conference on Supercomputing (SC’03), pp. 25–42 (2003)

  6. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: C\(^{\text{3}}\): A system for automating application-level checkpointing of MPI programs. In: Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC’03), pp. 357–373 (2003)

  7. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  8. Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. IEEE Comput. 42(12), 36–42 (2009)

    Article  Google Scholar 

  9. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  10. Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers. CTWatch Q. 3(4), 4–10 (2007)

    Google Scholar 

  11. Landau, C.R.: The checkpoint mechanism in KeyKOS. In: Proceedings of the 2nd International Workshop on Object Orientation on Operating Systems (I-WOOOS’92), pp. 86–91 (1992)

  12. Lattner, C., Adve, V.S.: LLVM: A compilation framework for lifelong program analysis. In: Proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization (CGO’04), pp. 75–88 (2004)

  13. Li, C.C.J., Stewart, E.M., Fuchs, W.K.: Compiler-assisted full checkpointing. Softw. Pract. Exp. 24(10), 871–886 (1994)

    Article  Google Scholar 

  14. National Aeronautics and Space Administration: The NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. Retrieved December 2011

  15. Ousterhout, J.K., Cherenson, A.R., Douglis, F., Nelson, M.N., Welch, B.B.: The Sprite network operating system. IEEE Comput. 21(2), 23–36 (1988)

    Article  Google Scholar 

  16. Parr, T.J., Quong, R.W.: ANTLR: a predicated-LL(k) parser generator. Softw. Pract. Exp. 25(7), 789–810 (1995)

    Article  Google Scholar 

  17. Plank, J.S., Beck, M., Kingsley, G.: Compiler-assisted memory exclusion for fast checkpointing. IEEE Tech. Comm. Oper. Syst. Appl. Environ. 7(4), 10–14 (1995)

    Google Scholar 

  18. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223 (1995)

  19. Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogeneous architectures. In: Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS’97), pp. 58–67 (1997)

  20. Rodríguez, G., Martín, M.J., González, P., Touriño, J.: Controller/precompiler for portable checkpointing. IEICE Trans. Inf. Syst. E89–D(2), 408–417 (2006)

    Article  Google Scholar 

  21. Rodríguez, G., Martín, M.J., González, P., Touriño, J.: A heuristic approach for the automatic insertion of checkpoints in message-passing codes. J. Univers. Comput. Sci. 15(14), 2894–2911 (2009)

    Google Scholar 

  22. Rodríguez, G., Martín, M.J., González, P., Touriño, J.: Analysis of performance-impacting factors on checkpointing frameworks: the CPPC case study. Comput. J. 54(11), 1821–1837 (2011)

    Article  Google Scholar 

  23. Rodríguez, G., Martín, M.J., González, P., Touriño, J., Doallo, R.: CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr. Comput. Pract. Exp. 22(6), 749–766 (2010)

    Google Scholar 

  24. Russinovich, M., Segall, Z.: Fault-tolerance for off-the-shelf applications and hardware. In: Proceedings of the 25th International Symposium on Fault-Tolerant Computing (FTCS’95), pp. 67–71 (1995)

  25. Shires, D., Pollock, L., Sprenkle, S.: Program flow graph construction for static analysis of MPI programs. In: Proceedings of the 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’99), pp. 1847–1853 (1999)

  26. Woo, N., Jung, H., Yeom, H.Y., Park, T., Park, H.: MPICH-GF: transparent checkpointing and rollback-recovery for Grid-enabled MPI processes. IEICE Trans. Inf. Syst. E87–D(7), 1820–1828 (2004)

Download references

Acknowledgments

This research was supported by the Galician Government (Project 10PXIB105180PR and Consolidation of Competitive Research Groups, Xunta de Galicia ref. 2010/6) and by the Ministry of Science and Innovation of Spain (Project TIN2010-16735).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Rodríguez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodríguez, G., Martín, M.J., González, P. et al. Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience. Int J Parallel Prog 41, 782–805 (2013). https://doi.org/10.1007/s10766-012-0231-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-012-0231-8

Keywords

Navigation