Skip to main content

Compiler-Assisted Type-Safe Checkpointing

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12321))

Included in the following conference series:

Abstract

TyCart is a tool for type-safe checkpoint/restart and extends the memory allocation sanitizer tool TypeART with type asserts. Type asserts let the developer specify type requirements on memory regions, and, in our example implementation, they are used to implement a type-safe interface for the existing checkpoint libraries FTI and VeloC. We evaluate our approach on a set of mini-apps, and an application from astrophysics. The approach shows runtime and memory overhead below 5% in smaller benchmarks. In the astrophysics application, the runtime overhead reaches 30% and the memory overhead 70%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://llvm.org/docs/LangRef.html.

References

  1. Averick, B., Carter, R., Xue, G.L., More, J.: The MINPACK-2 test problem collection (1992). https://doi.org/10.2172/79972

  2. Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011. ACM (2011). https://doi.org/10.1145/2063384.2063427

  3. Bockhorn, A., Narayanan, S.H.K., Walther, A.: Checkpointing approaches for the computation of adjoints covering resilience issues. In: 2020 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing, pp. 22–31. SIAM (2020). https://doi.org/10.1137/1.9781611976229.3

  4. Cuoq, P., Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J., Yakobowski, B.: Frama-C. In: Eleftherakis, G., Hinchey, M., Holcombe, M. (eds.) SEFM 2012. LNCS, vol. 7504, pp. 233–247. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33826-7_16

    Chapter  Google Scholar 

  5. Drischler, C., Hebeler, K., Schwenk, A.: Chiral interactions up to next-to-next-to-next-to-leading order and nuclear saturation. Phys. Rev. Lett. 122, 042501 (2019). https://doi.org/10.1103/PhysRevLett.122.042501

    Article  Google Scholar 

  6. Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013). https://doi.org/10.1007/s11227-013-0884-0

    Article  Google Scholar 

  7. FTI: FTI public GitHub examples (2017). https://github.com/leobago/fti/tree/master/examples. Accessed Mar 2020

  8. Cao, G., Singhal, M.: On coordinated checkpointing in distributed systems. IEEE Trans. Parallel Distrib. Syst. 9(12), 1213–1225 (1998). https://doi.org/10.1109/71.737697

  9. Hilbrich, T., Protze, J., Schulz, M., de Supinski, B.R., Müller, M.S.: MPI runtime error detection with MUST: advances in deadlock detection. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–10, November 2012. https://doi.org/10.1109/SC.2012.79

  10. Hück, A., et al.: Compiler-aided type tracking for correctness checking of MPI applications. In: 2018 IEEE/ACM 2nd International Workshop on Software Correctness for HPC Applications (Correctness), pp. 51–58, November 2018. https://doi.org/10.1109/Correctness.2018.00011

  11. Karlin, I., Keasler, J., Neely, R.: LULESH 2.0 updates and changes. Technical report LLNL-TR-641973, August 2013

    Google Scholar 

  12. Lehr, J.P.: Conway’s game of life (2016). https://github.com/jplehr/GameOfLife/tree/master/serial_template. Accessed Mar 2020

  13. Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In: 2007 IEEE International Conference on Cluster Computing, pp. 452–457 (2007). https://doi.org/10.1109/CLUSTR.2007.4629264

  14. Lotz, J., Naumann, U., Mitra, S.: Mixed integer programming for call tree reversal. In: 2016 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing, pp. 83–91. SIAM (2016). https://doi.org/10.1137/1.9781611974690.ch9

  15. Maroñas, M., Mateo, S., Beltran, V., Ayguadé, E.: A directive-based approach to perform persistent checkpoint/restart. In: 2017 International Conference on High Performance Computing Simulation (HPCS), pp. 442–451 (2017). https://doi.org/10.1109/HPCS.2017.72

  16. Milewicz, R., Vanka, R., Tuck, J., Quinlan, D., Pirkelbauer, P.: Runtime checking C programs. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, SAC 2015, 2107–2114. ACM (2015). https://doi.org/10.1145/2695664.2695906

  17. Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: Towards high performance adaptive asynchronous checkpointing at large scale. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 911–920, May 2019. https://doi.org/10.1109/IPDPS.2019.00099

  18. Rigger, M., Mayrhofer, R., Schatz, R., Grimmer, M., Mössenböck, H.: Introspection for C and its applications to library robustness. Art Sci. Eng. Program. 2(2), 1–31 (2018). https://doi.org/10.22152/programming-journal.org/2018/2/4

  19. Subasi, O., Kestor, G., Krishnamoorthy, S.: Toward a general theory of optimal checkpoint placement. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 464–474 (2017). https://doi.org/10.1109/CLUSTER.2017.127

Download references

Acknowledgments

We thank Christian Drischler for providing the eos-mbpt application and appreciated discussion. This work was funded by the Hessian LOEWE initiative within the Software-Factory 4.0 project. Calculations for this research were conducted on the Lichtenberg high-performance computer of the TU Darmstadt. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 265191195 – SFB 1194.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan-Patrick Lehr .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lehr, JP., Hück, A., Fischer, M., Bischof, C. (2020). Compiler-Assisted Type-Safe Checkpointing. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds) High Performance Computing. ISC High Performance 2020. Lecture Notes in Computer Science(), vol 12321. Springer, Cham. https://doi.org/10.1007/978-3-030-59851-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59851-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59850-1

  • Online ISBN: 978-3-030-59851-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics