Abstract
Limiting the extent of error propagation when faults occur and localising the subsequent error recovery are crucial elements in the design of fault tolerant parallel processing systems. Both activities are made easier if the designer associates fault tolerance mechanisms with the underlying communications of the system. With this in mind, this paper has investigated the design of such systems, which enforces a design concentrating on the modelling and analysis of interprocess communications providing a better match between the needs of the fault-tolerant mechanisms and the communication structures themselves.
Chapter PDF
Similar content being viewed by others
References
Anderson, T. and Knight, J.C. (1983) A framework for software fault tolerance in real-time systems. IEEE Transactions on Software Engineering, 9, 12, 355–364.
Avizienis, A. (1985) The N-version approach to fault-tolerant software, IEEE Transactions on Software Engineering, 11, 12, 1491–1501.
Carpenter, G.F. and Tyrrell, A.M. (1989) The use of GMB in the design of robust software for distributed systems. Software Engineering Journal, 4, 268–282.
Elphick, J.R. Patton, R.J. and Tyrrell, A.M. (1993) Enhanced Distributed Recovery Blocks: A Unified Approach for the Design of Safety-Critical Distributed Systems. IEE Colloquium on Safety Critical Distributed Systems, IEE London, Digest No: 1993 /189.
Jalote, P. and Campbell, R.H. (1986) Atomic actions for fault tolerance using CSP IEEE Transactions on Software Engineering, 12, 1, 59–68.
Kim, K.H. and Welch, H.O. (1989) Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications. IEEE Transactions on Computing, 38, 5, 626–636.
Lee, P.A. and Anderson, T. (1991) Fault Tolerance: Principles and Practice. Springer Verlag.
Mancini, L.V. and Shrivastava, S.K. (1988) Replication within atomic actions and conversations: a case study in fault-tolerance duality. FTCS-I9, Chicago, 454–461.
Randell, B. (1975) System Structure for Software Fault Tolerance. IEEE Transactions on Software Engineering, 1, 220–232.
Scott, R.K. Gault, J.W. and McAllister, D.F. (1987) Fault-tolerant software reliability modelling. IEEE Transactions on Software Engineering, 13, 5, 583–592.
Simpson, H.R. (1994a) Temporal Aspects of Real-Time System Design. IEE Colloquium on Methods and Techniques for Real-Time System Development, IEE Press.
Simpson, H.R. (1994b) Architecture for Computer Based Systems. Proceedings of the 1994 Tutorial and Workshop on Systems Engineering of Computer-Based Systems, Stockholm, 70–82.
Tyrrell, A.M. and Holding, D.J. (1986) Design of reliable software in distributed systems using the conversation scheme. IEEE Transactions on Software Engineering, 12, 7, 921–928.
Tyrrell, A.M. (1994) The Design of Fault Tolerant, High-Performance Control Systems. IEE Colloquium on High-Performance Computing for Advanced Control, IEE London, Digest No: 1994 /241.
Tyrrell, A.M. and Carpenter, G.F. (1995) CSP Methods for Identifying Atomic Actions in the Design of Fault Tolerant Concurrent Systems. IEEE Transactions on Software Engineering, 21, 7 629–639.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 IFIP International Federation for Information Processing
About this chapter
Cite this chapter
Tyrrell, A.M. (1996). Communications are Everything: A Design Methodology for Fault-Tolerant Concurrent Systems. In: Jelly, I., Gorton, I., Croll, P. (eds) Software Engineering for Parallel and Distributed Systems. IFIP Advances in Information and Communication Technology. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-34984-8_4
Download citation
DOI: https://doi.org/10.1007/978-0-387-34984-8_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-5041-2948-0
Online ISBN: 978-0-387-34984-8
eBook Packages: Springer Book Archive