Abstract
TyCart is a tool for type-safe checkpoint/restart and extends the memory allocation sanitizer tool TypeART with type asserts. Type asserts let the developer specify type requirements on memory regions, and, in our example implementation, they are used to implement a type-safe interface for the existing checkpoint libraries FTI and VeloC. We evaluate our approach on a set of mini-apps, and an application from astrophysics. The approach shows runtime and memory overhead below 5% in smaller benchmarks. In the astrophysics application, the runtime overhead reaches 30% and the memory overhead 70%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Averick, B., Carter, R., Xue, G.L., More, J.: The MINPACK-2 test problem collection (1992). https://doi.org/10.2172/79972
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011. ACM (2011). https://doi.org/10.1145/2063384.2063427
Bockhorn, A., Narayanan, S.H.K., Walther, A.: Checkpointing approaches for the computation of adjoints covering resilience issues. In: 2020 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing, pp. 22–31. SIAM (2020). https://doi.org/10.1137/1.9781611976229.3
Cuoq, P., Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J., Yakobowski, B.: Frama-C. In: Eleftherakis, G., Hinchey, M., Holcombe, M. (eds.) SEFM 2012. LNCS, vol. 7504, pp. 233–247. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33826-7_16
Drischler, C., Hebeler, K., Schwenk, A.: Chiral interactions up to next-to-next-to-next-to-leading order and nuclear saturation. Phys. Rev. Lett. 122, 042501 (2019). https://doi.org/10.1103/PhysRevLett.122.042501
Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013). https://doi.org/10.1007/s11227-013-0884-0
FTI: FTI public GitHub examples (2017). https://github.com/leobago/fti/tree/master/examples. Accessed Mar 2020
Cao, G., Singhal, M.: On coordinated checkpointing in distributed systems. IEEE Trans. Parallel Distrib. Syst. 9(12), 1213–1225 (1998). https://doi.org/10.1109/71.737697
Hilbrich, T., Protze, J., Schulz, M., de Supinski, B.R., Müller, M.S.: MPI runtime error detection with MUST: advances in deadlock detection. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–10, November 2012. https://doi.org/10.1109/SC.2012.79
Hück, A., et al.: Compiler-aided type tracking for correctness checking of MPI applications. In: 2018 IEEE/ACM 2nd International Workshop on Software Correctness for HPC Applications (Correctness), pp. 51–58, November 2018. https://doi.org/10.1109/Correctness.2018.00011
Karlin, I., Keasler, J., Neely, R.: LULESH 2.0 updates and changes. Technical report LLNL-TR-641973, August 2013
Lehr, J.P.: Conway’s game of life (2016). https://github.com/jplehr/GameOfLife/tree/master/serial_template. Accessed Mar 2020
Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In: 2007 IEEE International Conference on Cluster Computing, pp. 452–457 (2007). https://doi.org/10.1109/CLUSTR.2007.4629264
Lotz, J., Naumann, U., Mitra, S.: Mixed integer programming for call tree reversal. In: 2016 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing, pp. 83–91. SIAM (2016). https://doi.org/10.1137/1.9781611974690.ch9
Maroñas, M., Mateo, S., Beltran, V., Ayguadé, E.: A directive-based approach to perform persistent checkpoint/restart. In: 2017 International Conference on High Performance Computing Simulation (HPCS), pp. 442–451 (2017). https://doi.org/10.1109/HPCS.2017.72
Milewicz, R., Vanka, R., Tuck, J., Quinlan, D., Pirkelbauer, P.: Runtime checking C programs. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, SAC 2015, 2107–2114. ACM (2015). https://doi.org/10.1145/2695664.2695906
Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: Towards high performance adaptive asynchronous checkpointing at large scale. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 911–920, May 2019. https://doi.org/10.1109/IPDPS.2019.00099
Rigger, M., Mayrhofer, R., Schatz, R., Grimmer, M., Mössenböck, H.: Introspection for C and its applications to library robustness. Art Sci. Eng. Program. 2(2), 1–31 (2018). https://doi.org/10.22152/programming-journal.org/2018/2/4
Subasi, O., Kestor, G., Krishnamoorthy, S.: Toward a general theory of optimal checkpoint placement. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 464–474 (2017). https://doi.org/10.1109/CLUSTER.2017.127
Acknowledgments
We thank Christian Drischler for providing the eos-mbpt application and appreciated discussion. This work was funded by the Hessian LOEWE initiative within the Software-Factory 4.0 project. Calculations for this research were conducted on the Lichtenberg high-performance computer of the TU Darmstadt. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 265191195 – SFB 1194.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Lehr, JP., Hück, A., Fischer, M., Bischof, C. (2020). Compiler-Assisted Type-Safe Checkpointing. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds) High Performance Computing. ISC High Performance 2020. Lecture Notes in Computer Science(), vol 12321. Springer, Cham. https://doi.org/10.1007/978-3-030-59851-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-59851-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59850-1
Online ISBN: 978-3-030-59851-8
eBook Packages: Computer ScienceComputer Science (R0)