Abstract
Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.
Similar content being viewed by others
References
Home of the Xen hypervisor, the powerful open source industry standard for virtualization. http://www.xen.org/ (2012). Accessed 20 Nov 2012
Linux containers. http://lxc.sf.net/ (2012). Accessed 26 Nov 2012
Vmware virtualization software for desktops, servers and virtual machines for public and private cloud solutions. http://www.vmware.com/ (2012). Accessed 20 Nov 2012
Avellino, G., Beco, S., Cantalupo, B., Maraschini, A., Pacini, F., Sottilaro, M., Terracina, A., Colling, D., Giacomini, F., Ronchieri, E., et al.: The datagrid workload management system: challenges and results. J. Grid Computing 2(4), 353–367 (2004)
Bar, M., Maya, A., Asmita, S., Katsubo, D.: The openMosix project. http://openmosix.sourceforge.net/ (2008). Accessed 7 Jan 2010
Barak, A., La’adan, O.: The MOSIX multicomputer operating system for high performance cluster computing. Future Gener. Comput. Syst. 13(4–5), 361–372 (1998)
Barak, A., Laden, O., Yarom, Y.: The now MOSIX and its preemptive process migration scheme. Bulletin of the IEEE Technical Committee on Operating Systems and Application Environments (TCOS) 7(2), 5–11 (1995)
Bernaschi, M., Casadei, F., Tassotti, P.: Sockmi: a solution for migrating tcp/ip connections. In: Proceedings of 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’07), pp. 221–228. IEEE, Piscataway (2007)
Bhattiprolu, S., Biederman, E.W., Hallyn, S., Lezcano, D.: Virtual servers and checkpoint/restart in mainstream Linux. ACM SIGOPS Oper. Syst. Rev. 42(5), 104–113 (2008)
Bovet, D., Cesati, M.: Understanding the Linux kernel. O’Reilly, Paris (2005)
Cao, P., Felten, E.W., Karlin, A.R., Li, K.: A Study of Integrated Prefetching and Caching Strategies, vol. 23. ACM, New York (1995)
Carothers, C.D., Szymanski, B.K.: Linux support for transparent checkpointing of multithreaded programs. Dr. Dobb’s J. 15(8), 45–60 (2002)
Chin, J., Harting, J., Jha, S., Coveney, P.V., Porter, A.R., Pickles, S.M.: Steering in computational science: mesoscale modelling and simulation. Contemp. Phys. 44(5), 417–434 (2003)
Corbet, J., Rubini, A., Kroah-Hartman, G.: Linux Device Drivers. O’Reilly Software Series. O’Reilly, Paris (2005)
De Paoli, D., Goscinski, A.: A copy on reference process migration in rhodos. In: Proceedings of the Second IEEE International Conference on Algorithms and Architectures for Parallel Processing (ICAPP’96), pp. 100–107. IEEE, Piscataway (1996)
Dieter, W.R., Lumpp, J.E. Jr.: User-level checkpointing for linuxthreads programs. In: Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pp. 81–92. USENIX Association, El Cerrito (2001)
Duell, J., Hargrove, P., Roman, E.: The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. Lawrence Berkeley National Laboratory (2005)
Duffey, D.W., Andresen, D.: Dump: Dump User Memory, Please. Department of Computing and Information Sciences, Kansas State University, Technical Report (2002)
Foster, M., Wilson, J.N.: Pursuing the three AP’s to checkpointing with UCLiK. In: Proceedings of the 10th International Linux System Technology Conference (2003)
Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F., Davis, K.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: SC|05 (Supercomputing): Int. Conf. for High Performance Computing, Networking, and Storage, p. 9 (2005)
Hargrove, P.H., Duell, J.C.: Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. In: Proceedings of the Scientific Discovery through Advanced Computing (SciDAC), Journal of Physics: Conference Series, vol. 46, pp. 494–499. IOP Publishing, Bristol (2006)
Hendriks, E.: VMADump. http://bproc.sourceforge.net (2002). Accessed 17 Jan 2011
Ho, R.S.C., Wang, C.L., Lau, F.C.: Lightweight process migration and memory prefetching in openMosix. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08), pp. 1–12. IEEE, Piscataway (2008)
Hucht, F.: University of Duisburg-Essen, minimal homepage of pstree. http://www.thp.uni-duisburg.de/pstree/ (2009). Accessed 12 Jun 2011
Janakiraman, G.J., Santos, J.R., Subhraveti, D., Turner, Y.: Cruz: application-transparent distributed checkpoint-restart on standard operating systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN’05), pp. 260–269. IEEE, Piscataway (2005)
Kerrisk, M.: The Linux programming interface: a Linux and UNIX system programming handbook. No Starch Press (2010)
Laadan, O., Hallyn, S.E.: Linux-cr: transparent application checkpoint-restart in Linux. In: Proceedings of the 12th Annual Ottawa Linux Symposium (OLS) (2010)
Laadan, O., Nieh, J.: Transparent checkpoint-restart of multiple processes on commodity operating systems. In: Proceedings of the 2007 USENIX Annual Technical Conference, pp. 17–22. USENIX Association, El Cerrito (2007)
Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of Unix processes in the Condor distributed processing system. Technical report (1997)
Mauerer, W.: Professional Linux Kernel Architecture. Wrox, Birmingham (2008)
Milojicic, D.S., Douglis, F., Paindaveine, Y., Wheeler, R., Zhou, S.: Process migration. ACM Comput. Surv. 32(3), 241–299 (2000)
Mirkin, A., Kuznetsov, A., Kolyshkin, K.: Containers checkpointing and live migration. In: Proceedings of the Linux Symposium, pp. 85–92 (2008)
Myers, B.A., Rosson, M.B.: Survey on user interface programming. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 195–202. ACM, New York (1992)
Osman, S., Subhraveti, D., Su, G., Nieh, J.: The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Oper. Syst. Rev. 36(SI), 361–376 (2002)
Ouyang, X., Gopalakrishnan, K., Gangadharappa, T., Panda, D.K.: Fast checkpointing by write aggregation with dynamic buffer and interleaving on multicore architecture. In: Proceedings of the International Conference on High Performance Computing (HiPC’09), pp. 99–108. IEEE, Piscataway (2009)
Pinheiro, E.: Truly-transparent checkpointing of parallel applications. Federal University of Rio de Janeiro UFRJ, Technical Report (1998)
Pinheiro, E.: EPCKPT—a checkpoint utility for Linux kernel. http://www.cs.rutgers.edu/~edpin/epckpt/ (2002). Accessed 10 Mar 2011
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Proceedings of the USENIX 1995 Technical Conference, pp. 213–223. USENIX Association, El Cerrito (1995)
Popek, G.: The LOCUS Distributed System Architecture. MIT Press, Cambridge (1985)
Prasad, R.S., Jain, M., Dovrolis, C.: Socket buffer auto-sizing for high-performance data transfers. J. Grid Computing 1(4), 361–376 (2003)
Rago, S.A., Richard Stevens, W.: Advanced Programming in the UNIX Environment, 2nd edn. Addison Wesley Professional, Reading (2005)
Richmond, M., Hitchens, M.: A new process migration algorithm. ACM SIGOPS Oper. Syst. Rev. 31(1), 31–42 (1997)
Rieker, M., Ansel, J., Cooperman, G.: Transparent user-level checkpointing for the native posix thread library for Linux. In: Proceedings of Parallel and Distributed Processing Techniques and Applications (PDPTA’06), pp. 492–498 (2006)
Roman, E.: A Survey of Checkpoint/Restart Implementations. Lawrence Berkeley National Laboratory Technical Report LBNL-54942 (2003)
Rood, B., Lewis, M.J.: Grid resource availability prediction-based scheduling and task replication. J. Grid Computing 7(4), 479–500 (2009)
Sancho, J.C., Petrini, F., Davis, K., Gioiosa, R., Jiang, S.: Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05). IEEE, Piscataway (2005)
Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
Smith, S.L., Mosier, J.N.: Guidelines for Designing User Interface Software. Electronic Systems Division, Air Force Systems Command, Mitre Corp, Bedford, MA (1986)
Stellner, G.: Cocheck: checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS’96), pp. 526–531. IEEE, Piscataway (1996)
Sudakov, O.O., Meshcheriakov, I.S., Boyko, Y.V.: CHPOX: transparent checkpointing system for Linux clusters. In: 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS’07), pp. 159–164. IEEE, Piscataway (2007)
Vallee, G., Lottiaux, R., Margery, D., Morin, C., Berthou, J.Y.: Ghost process: a sound basis to implement process duplication, migration and checkpoint/restart in Linux clusters. In: Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05), pp. 97–104. IEEE, Piscataway (2005)
Zandy, V.: CKPT—A Process Checkpoint Library. http://pages.cs.wisc.edu/~zandy/ckpt/ (2005). Accessed 5 Apr 2011
Zarrabi, A.: A generic process migration algorithm. Int. J. Distrib. Parallel Syst. 3(5), 29–37 (2012)
Zhong, H., Nieh, J.: CRAK: Linux checkpoint/restart as a kernel module. Department of Computer Science, Columbia University, Technical Raport CUCS-014-01 (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zarrabi, A., Samsudin, K. & Wan Adnan, W.A. Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module. J Grid Computing 11, 187–210 (2013). https://doi.org/10.1007/s10723-013-9248-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-013-9248-5