Recovery oriented programming: runtime monitoring of safety and liveness

  • Olga Brukman
  • Shlomi Dolev
Regular Paper


We introduce the recovery-oriented programming paradigm. Programs that are designed according to the recovery-oriented programming paradigm include, as an integral part, the important safety and liveness properties that the program should respect and the recovery actions that should be executed upon a violation of these properties. We design a pre-compiler that compiles the properties and recovery actions into a code snippet for monitoring properties and enforcing recovery actions upon property violation. Assuming the restartability property of a given program and the existence of a self-stabilizing software platform, the compiled program is able to recover from safety and liveness violations. We provide a generic correctness proof scheme for recovery-oriented programs, proving that the code, as transformed by the pre-compiler, converges to a legal execution in a finite number of steps after experiencing an arbitrary failure.


Automatic recovery Safety Liveness Pre-compiler Self-stabilization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arora, A., Theimer, M.: On modeling and tolerating incorrect software. Tech. Rep. MSR-TR-2003-27, Microsoft Research (2003)Google Scholar
  2. 2.
    Barringer, H., Goldberg, A., Havelund, K., Sen, K.: Program monitoring with ltl in eagle. In: Proceedings of the Workshop on Parallel and Distributed Systems: Testing and Debugging (PADTAD), p. 264. IEEE Computer Society, Washington (2004)Google Scholar
  3. 3.
    Baumann R.: Soft errors in advanced computer systems. IEEE Des. Test 22(3), 258–266 (2005)CrossRefGoogle Scholar
  4. 4.
    Beck K., Andres C.: Extreme Programming Explained: Embrace Change, 2nd edn. Addison-Wesley, Boston (2004)Google Scholar
  5. 5.
    Bracha, G.: An asynchronous [(n − 1)/3]-resilient consensus protocol. In: Proceedings of the 3d Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 154–162. ACM, New York (1984)Google Scholar
  6. 6.
    Brukman, O., Dolev, S., Haviv, Y., Yagel, R.: Self-stabilization as a foundation for autonomic computing. In: Proceedings of the The 2nd International Conference on Availability, Reliability and Security (ARES), pp. 991–998. IEEE Computer Society, Washington (2007)Google Scholar
  7. 7.
    Brukman, O., Dolev, S., Kolodner, E.K.: Self-stabilizing autonomic recoverer for eventual byzantine software. In: Proceedings of the IEEE International Conference on Software-Science, Technology & Engineering (SWSTE), pp. 20–29 (2003)Google Scholar
  8. 8.
    Burdy L., Cheon Y., Cok D., Ernst M.D., Kiniry J., Leavens G.T., Leino K.R.M., Poll E.: An overview of JML tools and applications. Softw. Tools Technol. Transfer 7(3), 212–232 (2005)CrossRefGoogle Scholar
  9. 9.
    Candea, G., Fox, A.: Crash-only software. In: HOTOS’03: Proceedings of the 9th Conference on Hot Topics in Operating Systems, pp. 12–12. USENIX Association, Berkeley (2003)Google Scholar
  10. 10.
    Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., Fox, A.: Microreboot—a technique for cheap recovery. In: OSDI’04: Proceedings of the 6th Symposium on Operating Systems Design & Implementation, pp. 31–44. USENIX Association, Berkeley (2004)Google Scholar
  11. 11.
    Castro M., Liskov B.: Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst. 20(4), 398–461 (2002)CrossRefGoogle Scholar
  12. 12.
    Chandy K.M., Lamport L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)CrossRefGoogle Scholar
  13. 13.
    Chen, F., Rosu, G.: Java-mop: a monitoring oriented programming environment for java. In: Proceedings of 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS). Lecture Notes in Computer Science, vol. 3440, pp. 546–550. Springer, Berlin (2005)Google Scholar
  14. 14.
    Constable R.L., Knoblock T.B., Bates J.L.: Writing programs that construct proofs. J. Autom. Reason. 1(3), 285–326 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Demsky, B., Rinard, M.: Automatic detection and repair of errors in data structures. In: Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA), pp. 78–95. ACM, New York (2003)Google Scholar
  16. 16.
    Dolev S.: Self-Stabilization. MIT Press, Cambridge (2000)zbMATHGoogle Scholar
  17. 17.
    Dolev S., Haviv Y.A.: Self-stabilizing microprocessor: analyzing and overcoming soft errors. IEEE Trans. Comput. 55(4), 385–399 (2006)CrossRefGoogle Scholar
  18. 18.
    Dolev S., Welch J.L.: Self-stabilizing clock synchronization in the presence of byzantine faults. J. ACM 51(5), 780–799 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Dolev, S., Yagel, R.: Toward self-stabilizing operating systems. In: Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA), pp. 684–688. IEEE Computer Society, Washington (2004)Google Scholar
  20. 20.
    Drusinsky, D.: Monitoring temporal rules combined with time series. In: In CAV03. LNCS, vol. 2725, pp. 114–118. Springer, New York (2003)Google Scholar
  21. 21.
    Easwaran A., Kannan S., Sokolovsky O.: steering of discrete event systems: Control theory approach. Electron. Notes Theor. Comput. Sci. 144(4), 21–39 (2005)CrossRefGoogle Scholar
  22. 22.
    Elkarablieh, B., Khurshid, S.: Juzi: a tool for repairing complex data structures. In: Proceedings of the 30th International Conference on Software Engineering (ICSE), pp. 855–858. ACM, New York (2008)Google Scholar
  23. 23.
    Falcone, Y., Fernandez, J.C., Mounier, L.: Synthesizing enforcement monitors wrt. the safety-progress classification of properties. In: Proceedings of the 4th International Conference on Information Systems Security (ICISS), pp. 41–55. Springer, Berlin (2008)Google Scholar
  24. 24.
    Friedman D.P., Haynes C.T., Wand M.: Essentials of Programming Languages, 2nd edn. Massachusetts Institute of Technology, Cambridge (2001)zbMATHGoogle Scholar
  25. 25.
    Gurevich, Y., Rossman, B., Schulte, W.: Semantic essence of asml. Tech. Rep. MSR-TR-2004-27, Microsoft Research (2004)Google Scholar
  26. 26.
    Havelund K., Havelund K., Havelund K.: An overview of the runtime verification tool java pathexplorer. Formal Methods Syst. Des. 24(2), 189–215 (2004)zbMATHCrossRefGoogle Scholar
  27. 27.
    Haviv, Y.A.: Self-stabilizing fault-resilient embedded systems. Ph.D. thesis, Ben-Gurion University of the Negev, Be’er Sheva, Israel (2006)Google Scholar
  28. 28.
    Kim, M., Kannan, S., Lee, I., Sokolsky, O., Viswanathan, M.: Java-mac: a run-time assurance tool for java programs. In: Proceedings of the Conference on Runtime Verification, volume 55 of ENTCS. Elsevier, Amsterdam (2001)Google Scholar
  29. 29.
    Lamport L., Shostak R.E., Pease M.C.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)zbMATHCrossRefGoogle Scholar
  30. 30.
    Leal, W., Arora, A.: Scalable self-stabilization via composition. Tech. Rep. OSU-CISRC-7/03-TR46, Department of Computer Information Science, The Ohio State University (2003).
  31. 31.
    Lynch N.A.: Distributed Algorithms. Morgan Kaufmann, San Francisco (1996)zbMATHGoogle Scholar
  32. 32.
    McConnell S.: Code Complete, 2nd edn. Microsoft Press, Redmond (2004)Google Scholar
  33. 33.
    Neumann P.G.: Computer-Related Risks. Addison-Wesley, Boston (1994)Google Scholar
  34. 34.
    Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupamn, J., Treuhaft, N.: Recovery oriented computing (roc): motivation, definition, techniques, and case studies. Tech. rep., UC Berkeley (2002)Google Scholar
  35. 35.
    Project, A.: AKKA: Simpler scalability, fault-tolerance, concurrency & remoting through actors (2010).
  36. 36.
    Randell B., Lee P., Treleaven P.C.: Reliability issues in computing system design. ACM Comput. Surv. 10(2), 123–165 (1978)zbMATHCrossRefGoogle Scholar
  37. 37.
    Randell, B., Xu, J.: The evolution of the recovery block concept. In: Software Fault Tolerance, chap. 1, pp. 1–22. Wiley, New York (1994)Google Scholar
  38. 38.
    Rinard, M., Cadar, C., Dumitran, D., Roy, D.M., Leu, T., William S. Beebee, J.: Enhancing server availability and security through failure-oblivious computing. In: Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI), pp. 21–21. USENIX Association, Berkeley (2004)Google Scholar
  39. 39.
    Rinard, M., Cadar, C., Nguyen, H.H.: Exploring the acceptability envelope. In: Companion to the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pp. 21–30. ACM, New York (2005)Google Scholar
  40. 40.
    Rist R., Terwilliger R.: Object-Oriented Programming in Eiffel. Prentice-Hall, Upper Saddle River (1995)Google Scholar
  41. 41.
    Rosen E.C., Beranek B.: Rfc 789: vulnerabilities of network control protocols: an example. Comput. Commun. Rev. 11, 10–16 (1981)CrossRefGoogle Scholar
  42. 42.
    Rothamel, T., Liu, Y.A., Heitmeyer, C.L., Leonard, E.I.: Generating optimized code from SCR specifications. In: Proceedings of the 2006 ACM SIGPLAN/SIGBED Conference on Language, Compilers and Tool Support for Embedded Systems, pp. 135–144. ACM Press, New York (2006)Google Scholar
  43. 43.
    Schneider F.B., Zhou L.: Implementing trustworthy services using replicated state machines. IEEE Secur. Priv. 3, 34–43 (2005)Google Scholar
  44. 44.
    Schulze, M., Gibson, G.A., Katz, R.H., Patterson, D.A.: How reliable is a raid? In: COMPCON, pp. 118–123 (1989)Google Scholar
  45. 45.
    Sen K., Roşu G., Agha G.: Runtime safety analysis of multithreaded programs. SIGSOFT Softw. Eng. Notes 28(5), 337–346 (2003)CrossRefGoogle Scholar
  46. 46.
    Shivakumar, P., Kistler, M., Keckler, S.W., Burger, D., Alvisi, L., Technical, I., Keaty, C.J., Bell, R., Rajamony, R.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings of the International Conference on Dependable Systems and Networks, pp. 389–398 (2002)Google Scholar
  47. 47.
    Verssimo, P.E., Neves, N.F., Correia, M.P.: Intrusion-tolerant architectures: concepts and design. In: Architecting Dependable Systems. Lecture Notes in Computer Science, vol. 2677, pp. 3–36. Springer, New York (2003)Google Scholar
  48. 48.
    Xu, J., Romanovsky, A., Stroud, R.J., Zorzo, A.F.: Rigorous development of a safety-critical system based on coordinated atomic actions. In: Proceedings of the 29th International Symposium on Fault-Tolerant Computing, pp. 68–75. IEEE Computer Society Press (1999)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.Ben-Gurion University of the NegevBeer-ShevaIsrael

Personalised recommendations