Skip to main content
Log in

A framework for adding real-time distributed software fault detection and isolation to SNMP-based systems management

  • Papers
  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

We consider the problem of fault detection and isolation in systems that consist of real-time distributed cooperating processes. A framework for adding fault detection and isolation capabilities to SNMP-based distributed management systems is presented. The framework revolves around the use of a formal specification model of the cooperating processes which we refer to as “local directed graphs”. We describe the local directed graph model, and a fault monitoring and isolation architecture that implements the framework. In doing so, we address the problem of the size of our formal description and also show that this architecture is suited to the management of internetworks. Lastly, we present an example illustrating the operation of the architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. T. Lin and D. Siewiorek, Error log analysis: Statistical modelling and heuristic trend analysis,IEEE Trans. on Reliability, pp. 419–432, October 1990.

  2. S. Fisher and S. Gibson, Hp to extend snmp scope,Communications Week, January 20, 1992.

  3. R. Dewar and M. Smosna, Sunnet manager to the rescue,Unix Today! pp. 33–34 January 6, 1992.

  4. D. Potter, The need for network management,Computer Communications, pp. 121–125, March 1991.

  5. J. Tsai, K. Fang, and H. Chen, A noninvasive architecture to monitor real-time distributed systems,Computer, pp. 11–23, March 1990.

  6. A. Mahmood and E. McCluskey, Concurrent error detection using watchdog processors — a survey,IEEE Trans. Computers, Vol. 37, pp. 160–74, February 1988.

    Google Scholar 

  7. R. Dssouli and G. Bochmann, Error detection with multiple observers, inProtocol Specification, Testing, and Verification V, pp. 483–494, Elsevier Science Publishers B. V., 1985.

  8. R. Molva, M. Diaz, and J. Ayache, Observer: A run-time checking tool for local area networks, inProtocol Specification, Testing, and Verification V, pp. 495–506, Elsevier Science Publishers B. V., 1985.

  9. I. Frisch, A. Kershenbaum, and M. Post, Network management and control of protocols, inProc. of IEEE GlobalCom, pp. 1508–1519, 1988.

  10. D. Gambhir, I. Frisch, and M. Post, Software fault isolation in wide area networks, inProc. of the Computer Science Conference, 1992.

  11. V. Griswold, Core algorithms for autonomous monitoring of distributed systems, inProc. of the ACM/ONR Workshop on Parallel and Distributed Debugging, 1991.

  12. Y. Yemini, G. Goldszmidt, and S. Yemini, Network management by delegation, inIntegrated Network Management II, North Holland, 1991.

  13. S. Waldbusser, Remote network monitoring mib. Internet Draft, available from nic. mil. ddn as draft-ieft-rmon-mib-01. txt.

  14. D. Brand and P. Zafiropulo, On communicating finite-state machines,Journal of the ACM, Vol. 30, pp. 323–39, April 1983.

    Google Scholar 

  15. Y. Kakuda, Y. Wakahara, and M. Norigoe, An acyclic expansion algorithm for fast protocol validation,IEEE Trans. Software Engineering, Vol. 14, pp. 1059–70, August 1988.

    Google Scholar 

  16. M. Yuang, Parallel Protocol Verification Using the Localized Approach: The Two-Phase Algorithm and Complexity Analysis. P.D. Thesis, Polytechnic University, Brooklyn, New York, 1989.

    Google Scholar 

  17. D. Gambhir, M. Post, and I. Frisch, Local directed graphs: A protocol model with application to automated verification, 1993. Submitted toIEEE Trans. on Networking.

  18. G. Carpenter and B. Wijen, Simple network management protocol distributed program interface. RFC1228.

  19. M. Rose, SNMP mux protocol and mib. RFC1227.

  20. M. Rose and K. McCloghrie, Structure and identification of management information for tcp/ip-based internets. RFC1155.

  21. J. Case, M. Fedor, M. Schoffstall, and J. Davin, Simple network management protocol. RFC1157.

  22. Y. Kakuda and Y. Wakahara, Fault-tolerance for recovery for protocol errors due to design faults in distributed communications software, inDigest of Papers of FTCS-18, pp. 5–8, 1988.

  23. P. Bates, Debugging heterogeneous distributed systems using event-based model of behavior, inWorkshop on Parallel and Distributed Debugging, pp. 11–22, ACM, 1989.

  24. D. Rosenblum, Specifying concurrent systems with tsl,IEEE Software, pp. 52–61, May 1991.

  25. A. Goldberg, A. Gopal, A. Lowery, and R. Storm, Restoring consistent global states of distributed computation, inProc. of the ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 144–154, 1991.

  26. C. Wang and M. Schwartz, Fault detection with multiple observers, inProc. of IEEE Infor-Comm, 1992.

  27. D. Gambhir,Local Directed Graphs. Ph.D. Thesis, Polytechnic University, Brooklyn, New York, 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

The work of the first author was done while he was at Fairleigh Dickinson University, Madison, New Jersey, and Polytechnic University, Brooklyn, New York.

Beltcore, Redbank, New Jersey. The work of the second author was done while he was at Polytechnic University, Brooklyn, New York.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gambhir, D., Post, M. & Frisch, I. A framework for adding real-time distributed software fault detection and isolation to SNMP-based systems management. J Netw Syst Manage 2, 257–282 (1994). https://doi.org/10.1007/BF02139365

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02139365

Key Words

Navigation