Cluster Computing

, Volume 11, Issue 3, pp 247–257 | Cite as

Self healing in System-S

  • Gabriela Jacques-Silva
  • Jim Challenger
  • Lou Degenaro
  • James Giles
  • Rohit Wagle


Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive—enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.

We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.


Fault-tolerance Stream processing systems Distributed recovery 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amini, L., Jain, N., Sehgal, A., Silber, J., Verscheure, O.: Adaptive control of extreme-scale stream processing systems. In: ICDCS ’06: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, p. 71 (2006) Google Scholar
  2. 2.
    Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the Borealis distributed stream processing system. In: Proc. of ACM SIGMOD ’05, New York, NY, USA, pp. 13–24 (2005) Google Scholar
  3. 3.
    Bauer, C., King, G.: Hibernate in Action. Manning Publications, New York (2005) Google Scholar
  4. 4.
    Bohra, A., Neamtiu, I., Sultan, F.: Remote repair of operating system state using backdoors. In: Proc. of ICAC ’04, pp. 256–263. IEEE Computer Society, Washington (2004) Google Scholar
  5. 5.
    Bolour, A.: Notes on the eclipse plug-in architecture.
  6. 6.
    Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Recent advances in checkpoint/recovery systems. In: Workshop on NSF Next Generation Software (2006) Google Scholar
  7. 7.
    Cha, H., Rudnick, E.M., Patel, J.H., Iyer, R.K., Choi, G.S.: A gate-level simulation environment for alpha-particle-induced transient faults. IEEE Trans. Comput. 45(11), 1248–1256 (1996) MATHCrossRefGoogle Scholar
  8. 8.
    Choi, G.S., Iyer, R.K., Saab, D.G.: Fault behavior dictionary for simulation of device-level transients. In: ICCAD ’93: Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design, pp. 6–9. IEEE Computer Society Press, Los Alamitos (1993) Google Scholar
  9. 9.
    Cooper, B.F., Schwan, K.: Distributed stream management using utility-driven self-adaptive middleware. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 3–14 (2005) Google Scholar
  10. 10.
    Douglis, F., Branson, M., Hildrum, K., Rong, B., Ye, F.: Multi-site cooperative data stream analysis. SIGOPS Oper. Syst. Rev. 40(3), 31–37 (2006) CrossRefGoogle Scholar
  11. 11.
    Etsion, Y., Tsafrir, D.: A short survey of commercial batch schedulers. Technical Report 2005-13, Hebrew University (2005) Google Scholar
  12. 12.
    Hansen, J.G., Christiansen, E., Jul, E.: The laundromat model for autonomic cluster computing. In: Proc. of ICAC ’06, June 2006, pp. 114–123 (2006) Google Scholar
  13. 13.
    Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst. 4(3), 214–237 (1986) CrossRefGoogle Scholar
  14. 14.
    Jacques-Silva, G., Challenger, J., Degenaro, L., Giles, J., Wagle, R.: Towards autonomic fault recovery in system-s. In: ICAC ’07: Proceedings of the Fourth International Conference on Autonomic Computing, p. 31. IEEE Computer Society, Washington (2007) CrossRefGoogle Scholar
  15. 15.
    Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proc. of ACM SIGMOD ’06, pp. 431–442. ACM, New York (2006). Google Scholar
  16. 16.
    Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003) Google Scholar
  17. 17.
    Lee, H.-H.S., Gu, G., Mudge, T.N.: An intrusion-tolerant and self-recoverable network service system using a security enhanced chip multiprocessor. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 263–273 (2005) Google Scholar
  18. 18.
    Litzkow, M.J., Livny, M., Mutka, M.W.: Condor–a hunter of idle workstations. In: 8th International Conference on Distributed Computing Systems, pp. 104–111 (1988) Google Scholar
  19. 19.
    Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones, P.H., III, Rennels, D.A., Some, R.: The effects of an armor-based sift environment on the performance and dependability of user applications. IEEE Trans. Softw. Eng. 30(4), 257–277 (2004) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Gabriela Jacques-Silva
    • 1
  • Jim Challenger
    • 2
  • Lou Degenaro
    • 2
  • James Giles
    • 2
  • Rohit Wagle
    • 2
  1. 1.Center for Reliable and High-Performance ComputingUniversity of Illinois at Urbana ChampaignUrbanaUSA
  2. 2.IBM T.J. Watson Research CenterIBM ResearchHawthorneUSA

Personalised recommendations