Skip to main content
Log in

Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Home of the Xen hypervisor, the powerful open source industry standard for virtualization. http://www.xen.org/ (2012). Accessed 20 Nov 2012

  2. Linux containers. http://lxc.sf.net/ (2012). Accessed 26 Nov 2012

  3. Vmware virtualization software for desktops, servers and virtual machines for public and private cloud solutions. http://www.vmware.com/ (2012). Accessed 20 Nov 2012

  4. Avellino, G., Beco, S., Cantalupo, B., Maraschini, A., Pacini, F., Sottilaro, M., Terracina, A., Colling, D., Giacomini, F., Ronchieri, E., et al.: The datagrid workload management system: challenges and results. J. Grid Computing 2(4), 353–367 (2004)

    Article  Google Scholar 

  5. Bar, M., Maya, A., Asmita, S., Katsubo, D.: The openMosix project. http://openmosix.sourceforge.net/ (2008). Accessed 7 Jan 2010

  6. Barak, A., La’adan, O.: The MOSIX multicomputer operating system for high performance cluster computing. Future Gener. Comput. Syst. 13(4–5), 361–372 (1998)

    Article  Google Scholar 

  7. Barak, A., Laden, O., Yarom, Y.: The now MOSIX and its preemptive process migration scheme. Bulletin of the IEEE Technical Committee on Operating Systems and Application Environments (TCOS) 7(2), 5–11 (1995)

    Google Scholar 

  8. Bernaschi, M., Casadei, F., Tassotti, P.: Sockmi: a solution for migrating tcp/ip connections. In: Proceedings of 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’07), pp. 221–228. IEEE, Piscataway (2007)

    Chapter  Google Scholar 

  9. Bhattiprolu, S., Biederman, E.W., Hallyn, S., Lezcano, D.: Virtual servers and checkpoint/restart in mainstream Linux. ACM SIGOPS Oper. Syst. Rev. 42(5), 104–113 (2008)

    Article  Google Scholar 

  10. Bovet, D., Cesati, M.: Understanding the Linux kernel. O’Reilly, Paris (2005)

    Google Scholar 

  11. Cao, P., Felten, E.W., Karlin, A.R., Li, K.: A Study of Integrated Prefetching and Caching Strategies, vol. 23. ACM, New York (1995)

    Google Scholar 

  12. Carothers, C.D., Szymanski, B.K.: Linux support for transparent checkpointing of multithreaded programs. Dr. Dobb’s J. 15(8), 45–60 (2002)

    Google Scholar 

  13. Chin, J., Harting, J., Jha, S., Coveney, P.V., Porter, A.R., Pickles, S.M.: Steering in computational science: mesoscale modelling and simulation. Contemp. Phys. 44(5), 417–434 (2003)

    Article  Google Scholar 

  14. Corbet, J., Rubini, A., Kroah-Hartman, G.: Linux Device Drivers. O’Reilly Software Series. O’Reilly, Paris (2005)

    Google Scholar 

  15. De Paoli, D., Goscinski, A.: A copy on reference process migration in rhodos. In: Proceedings of the Second IEEE International Conference on Algorithms and Architectures for Parallel Processing (ICAPP’96), pp. 100–107. IEEE, Piscataway (1996)

    Google Scholar 

  16. Dieter, W.R., Lumpp, J.E. Jr.: User-level checkpointing for linuxthreads programs. In: Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pp. 81–92. USENIX Association, El Cerrito (2001)

    Google Scholar 

  17. Duell, J., Hargrove, P., Roman, E.: The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. Lawrence Berkeley National Laboratory (2005)

  18. Duffey, D.W., Andresen, D.: Dump: Dump User Memory, Please. Department of Computing and Information Sciences, Kansas State University, Technical Report (2002)

  19. Foster, M., Wilson, J.N.: Pursuing the three AP’s to checkpointing with UCLiK. In: Proceedings of the 10th International Linux System Technology Conference (2003)

  20. Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F., Davis, K.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: SC|05 (Supercomputing): Int. Conf. for High Performance Computing, Networking, and Storage, p. 9 (2005)

  21. Hargrove, P.H., Duell, J.C.: Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. In: Proceedings of the Scientific Discovery through Advanced Computing (SciDAC), Journal of Physics: Conference Series, vol. 46, pp. 494–499. IOP Publishing, Bristol (2006)

  22. Hendriks, E.: VMADump. http://bproc.sourceforge.net (2002). Accessed 17 Jan 2011

  23. Ho, R.S.C., Wang, C.L., Lau, F.C.: Lightweight process migration and memory prefetching in openMosix. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08), pp. 1–12. IEEE, Piscataway (2008)

    Chapter  Google Scholar 

  24. Hucht, F.: University of Duisburg-Essen, minimal homepage of pstree. http://www.thp.uni-duisburg.de/pstree/ (2009). Accessed 12 Jun 2011

  25. Janakiraman, G.J., Santos, J.R., Subhraveti, D., Turner, Y.: Cruz: application-transparent distributed checkpoint-restart on standard operating systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN’05), pp. 260–269. IEEE, Piscataway (2005)

  26. Kerrisk, M.: The Linux programming interface: a Linux and UNIX system programming handbook. No Starch Press (2010)

  27. Laadan, O., Hallyn, S.E.: Linux-cr: transparent application checkpoint-restart in Linux. In: Proceedings of the 12th Annual Ottawa Linux Symposium (OLS) (2010)

  28. Laadan, O., Nieh, J.: Transparent checkpoint-restart of multiple processes on commodity operating systems. In: Proceedings of the 2007 USENIX Annual Technical Conference, pp. 17–22. USENIX Association, El Cerrito (2007)

    Google Scholar 

  29. Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of Unix processes in the Condor distributed processing system. Technical report (1997)

  30. Mauerer, W.: Professional Linux Kernel Architecture. Wrox, Birmingham (2008)

    Google Scholar 

  31. Milojicic, D.S., Douglis, F., Paindaveine, Y., Wheeler, R., Zhou, S.: Process migration. ACM Comput. Surv. 32(3), 241–299 (2000)

    Article  Google Scholar 

  32. Mirkin, A., Kuznetsov, A., Kolyshkin, K.: Containers checkpointing and live migration. In: Proceedings of the Linux Symposium, pp. 85–92 (2008)

  33. Myers, B.A., Rosson, M.B.: Survey on user interface programming. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 195–202. ACM, New York (1992)

    Google Scholar 

  34. Osman, S., Subhraveti, D., Su, G., Nieh, J.: The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Oper. Syst. Rev. 36(SI), 361–376 (2002)

    Article  Google Scholar 

  35. Ouyang, X., Gopalakrishnan, K., Gangadharappa, T., Panda, D.K.: Fast checkpointing by write aggregation with dynamic buffer and interleaving on multicore architecture. In: Proceedings of the International Conference on High Performance Computing (HiPC’09), pp. 99–108. IEEE, Piscataway (2009)

    Google Scholar 

  36. Pinheiro, E.: Truly-transparent checkpointing of parallel applications. Federal University of Rio de Janeiro UFRJ, Technical Report (1998)

  37. Pinheiro, E.: EPCKPT—a checkpoint utility for Linux kernel. http://www.cs.rutgers.edu/~edpin/epckpt/ (2002). Accessed 10 Mar 2011

  38. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Proceedings of the USENIX 1995 Technical Conference, pp. 213–223. USENIX Association, El Cerrito (1995)

    Google Scholar 

  39. Popek, G.: The LOCUS Distributed System Architecture. MIT Press, Cambridge (1985)

    Google Scholar 

  40. Prasad, R.S., Jain, M., Dovrolis, C.: Socket buffer auto-sizing for high-performance data transfers. J. Grid Computing 1(4), 361–376 (2003)

    Article  MATH  Google Scholar 

  41. Rago, S.A., Richard Stevens, W.: Advanced Programming in the UNIX Environment, 2nd edn. Addison Wesley Professional, Reading (2005)

    Google Scholar 

  42. Richmond, M., Hitchens, M.: A new process migration algorithm. ACM SIGOPS Oper. Syst. Rev. 31(1), 31–42 (1997)

    Article  Google Scholar 

  43. Rieker, M., Ansel, J., Cooperman, G.: Transparent user-level checkpointing for the native posix thread library for Linux. In: Proceedings of Parallel and Distributed Processing Techniques and Applications (PDPTA’06), pp. 492–498 (2006)

  44. Roman, E.: A Survey of Checkpoint/Restart Implementations. Lawrence Berkeley National Laboratory Technical Report LBNL-54942 (2003)

  45. Rood, B., Lewis, M.J.: Grid resource availability prediction-based scheduling and task replication. J. Grid Computing 7(4), 479–500 (2009)

    Article  Google Scholar 

  46. Sancho, J.C., Petrini, F., Davis, K., Gioiosa, R., Jiang, S.: Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05). IEEE, Piscataway (2005)

    Google Scholar 

  47. Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)

    Article  Google Scholar 

  48. Smith, S.L., Mosier, J.N.: Guidelines for Designing User Interface Software. Electronic Systems Division, Air Force Systems Command, Mitre Corp, Bedford, MA (1986)

    Google Scholar 

  49. Stellner, G.: Cocheck: checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS’96), pp. 526–531. IEEE, Piscataway (1996)

    Google Scholar 

  50. Sudakov, O.O., Meshcheriakov, I.S., Boyko, Y.V.: CHPOX: transparent checkpointing system for Linux clusters. In: 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS’07), pp. 159–164. IEEE, Piscataway (2007)

    Chapter  Google Scholar 

  51. Vallee, G., Lottiaux, R., Margery, D., Morin, C., Berthou, J.Y.: Ghost process: a sound basis to implement process duplication, migration and checkpoint/restart in Linux clusters. In: Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05), pp. 97–104. IEEE, Piscataway (2005)

    Chapter  Google Scholar 

  52. Zandy, V.: CKPT—A Process Checkpoint Library. http://pages.cs.wisc.edu/~zandy/ckpt/ (2005). Accessed 5 Apr 2011

  53. Zarrabi, A.: A generic process migration algorithm. Int. J. Distrib. Parallel Syst. 3(5), 29–37 (2012)

    Article  Google Scholar 

  54. Zhong, H., Nieh, J.: CRAK: Linux checkpoint/restart as a kernel module. Department of Computer Science, Columbia University, Technical Raport CUCS-014-01 (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amirreza Zarrabi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zarrabi, A., Samsudin, K. & Wan Adnan, W.A. Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module. J Grid Computing 11, 187–210 (2013). https://doi.org/10.1007/s10723-013-9248-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-013-9248-5

Keywords

Navigation