An Architectural Framework for Detecting Process Hangs/Crashes

  • Nithin Nakka
  • Giacinto Paolo Saggese
  • Zbigniew Kalbarczyk
  • Ravishankar K. Iyer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3463)

Abstract

This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Eveking, H.: SuperScalar DLX Documentation, http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/DlxPdf.zip
  2. 2.
    Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Tech. Rep. CS-1342, Univ of Wisconsin-Madison (June 1997)Google Scholar
  3. 3.
    Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996)MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Gouda, M., McGuire, T.: Accelerated heartbeat protocols. In: Proc. of the Int’l Conf. on Distributed Computing Systems, pp. 202–209 (May 1998)Google Scholar
  5. 5.
    Kalbarczyk, Z., Bagchi, S., Whisnant, K., Iyer, R.K.: Chameleon: A Software Infrastructure for Adaptive Fault Tolerance. IEEE Trans. on PDS 10(6) (June 1999)Google Scholar
  6. 6.
    Murphy, N.: Watchdog Timers. Embedded Systems Programming (November 2000)Google Scholar
  7. 7.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach. Morgan-Kaufmann, San Francisco (1996)MATHGoogle Scholar
  8. 8.
    Yang, Z.: Implementation of Preemptive Control Flow Checking Via Editing of Program Executables. Master’s Thesis, University of Illinois at Urbana-Champaign (December 2002)Google Scholar
  9. 9.
    Li, Y.-T.S., et al.: Performance Estimation of Embedded Software with Instruction Cache Modeling. ACM Trans. on Design Automation of Electronic Systems 4(3), 257–279Google Scholar
  10. 10.
    Felber, P., Defago, X., Guerraoui, R., Oser, P.: Failure Detectors as First Class Objects. In: Proc. of the Int’l Symposium on Distributed Objects and Applications (1999)Google Scholar
  11. 11.
  12. 12.
    Eddon, G., Eddon, H.: Understanding the DCOM Wire Protocol by Analyzing Network Data Packets. Microsoft Systems Journal (March 1998)Google Scholar
  13. 13.
    Sun Cluster 3.1 Concepts Guide, http://docs.sun.com/db/doc/817-0519
  14. 14.
    Chen, W., Toueg, S., Aguilera, M.K.: On the Quality of Service of Failure Detectors. In: Proc. DSN 2000 (2000)Google Scholar
  15. 15.
    Bertier, M., Marin, O., Sens, P.: Implementation and Performance Evaluation of an Adaptable Failure Detector. In: Proc. DSN 2002 (2002)Google Scholar
  16. 16.
    Geist, A., et al.: PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Series. MIT Press, Cambridge (1994)MATHGoogle Scholar
  17. 17.
    Hayashibaral, N., Defago, X., Yared, R., Katayama, T.: The Accrual Failure Detector. IS-RR-2004-010, May 10 (2004)Google Scholar
  18. 18.
    Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K.: An Architectural Framework for Detecting Process Hangs/Crashes, http://www.crhc.uiuc.edu/~nakka/HCDetect.pdf
  20. 20.
    Gu, W., Kalbarczyk, Z., Iyer, R.K.: Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. In: Proc. of DSN 2004, pp. 827–836 (2004)Google Scholar
  21. 21.
    Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones III, P.H., Rennels, D.A., Some, R.: The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications. IEEE Trans. on Software Engg. 30(4), 257–277 (2004)CrossRefGoogle Scholar
  22. 22.
    Lee, I., Iyer, R.K.: Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. In: FTCS 1993 (1993)Google Scholar
  23. 23.
    Beauragard, D.J.: Error-Injection-Based Failure Profile of the IEEE 1394 Bus. Master’s Thesis, University of Illinois at Urbana-Champaign (2003)Google Scholar
  24. 24.
    PWDOG1 - PCI Watchdog for Windows XP, 2000, NT, 98, Linux Kernel (2000), http://www.quancom.de/qprod01/homee.htm
  25. 25.
    AT&T 5ESSTM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm
  26. 26.
    Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems: Design and Evaluation, Ch. 8, 2nd edn.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Nithin Nakka
    • 1
  • Giacinto Paolo Saggese
    • 1
  • Zbigniew Kalbarczyk
    • 1
  • Ravishankar K. Iyer
    • 1
  1. 1.Center for Reliable and High Performance Computing, Coordinated Science LaboratoryUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations