Optimal Design of Checks for Error Detection and Location in Fault Tolerant Multiprocessor Systems

Sitaraman, Ramesh; Jha, Niraj K.

doi:10.1007/978-3-642-76930-6_33

Ramesh Sitaraman³ &
Niraj K. Jha⁴

Part of the book series: Informatik-Fachberichte ((INFORMATIK,volume 283))

70 Accesses
5 Citations

Abstract

Designing checks to detect or locate errors in the data is an important problem and plays an important role in the area of fault tolerance. Our checks are assumed to be of the simplest kind, i.e. a check can operate without any restriction on any non-empty subset of the set of data elements and can reliably detect up to one error in this subset. In this paper, we show how to design the data-check (DC) relationship. For the first time, we give a general procedure for designing checks to locate s errors, given any value for s. We also consider the problem of designing checks to detect s errors in the data. We give the first optimal construction for this problem. The procedure for designing the checks are simple and novel. One can also modify these constructions to produce uniform checks, i.e. checks which are identical and check the same number of data elements. We give procedures for obtaining such checks as well.

Recently, the problem of designing the DC relationship has attracted a lot of attention due to the important role it plays in the design of algorithm-based fault tolerant (ABFT) systems. In this paper, we illustrate the above problem in this context. ABFT schemes have been shown to be a natural paradigm for concurrent error detection/location in multiprocessor systems and systolic array computations. Banerjee and Abraham have shown that an ABPT scheme can be modeled as a tripartite graph consisting of processors (P), data (D) and checks(C). Our constructions can be used along with any general technique for designing fault tolerant PDC graphs, e.g. for designing unit systems [NA89] or for designing ud-systems [VJ91] etc.

This work was supported by DARPA/ONR under Contract no. N00014-88-K-0459.

This work was supported in part by ONR under Contract no. N00014-91-J-1199 and in part by AFOSR under Contract no. AFOSR-90-0144.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. A. Abraham et al., “Fault tolerance techniques for systolic arrays,” IEEE Computer, pp. 65–74, July 1987.
Google Scholar
P. Banerjee et al., “An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor,” in Proc. Int. Symp. Fault Tolerant Comput., Tokyo, pp. 362–367, June 1988.
Google Scholar
P. Banerjee and J. A. Abraham, “Bounds on algorithm-based fault tolerance in multiple processor systems,” IEEE Trans. Comput., vol. C-35, pp. 296–306, Apr. 1986.
Google Scholar
P. Banerjee and J. A. Abraham, “A probabilistic model of algorithm-based fault tolerance in array processors for real-time systems,” in Proc. Real-Time Systems Symp., pp. 72–78, 1986.
Google Scholar
Y-H. Choi and M. Malek, “A fault tolerant FFT processor,” IEEE Trans. Comput, vol. 37, no. 5, pp. 617–621, May 1988.
Article Google Scholar
Y-H. Choi and M. Malek, “A fault tolerant systolic sorter,” IEEE Trans. Comput, vol. 37, no. 5, pp. 621–624, May 1988.
Article Google Scholar
D. Gu, D. J. Rosenkrantz, and S. S. Ravi, “Design and analysis of test schemes for algorithm-based fault tolerance,” in Proc. Int. Symp. Fault Tolerant Comput., pp. 106–113, Newcastle-upon-Tyne, U K., June 1990.
Chapter Google Scholar
K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations” IEEE Trans. Comput., vol. C-33, pp. 518–528, June 1984.
Google Scholar
J.-Y. Jou and J. A. Abraham, “Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures,” Proc. IEEE, vol. 74, no. 5, pp. 732–741, May 1986.
Article Google Scholar
J. Y. Jou and J. A. Abraham, “Fault tolerant FFT networks,” IEEE Trans. Comput., vol. 37, no. 5, pp. 548–561, May 1988.
Article Google Scholar
F. T. Luk and H. Park, “An analysis of algorithm-based fault tolerance techniques,” in Proc. SPIE Adv. Alg. amp; Arch, for Signal Proc., vol. 696, pp. 222–228, Aug. 1986.
Google Scholar
V. S. S. Nair and J. A. Abraham, “A model for the analysis of fault tolerant signal processing architectures,” in Proc. 32nd Int. Tech. Symp. of SPIE, San Diego, pp. 246–257, Aug. 1988.
Google Scholar
V. S. S. Nair and J. A. Abraham, “A model for the analysis, design and comparison of fault-tolerant WSI architectures,” in Proc. Workshop on Wafer Scale Integration, Como, Italy, June 1989.
Google Scholar
V. S. S. Nair and J. A. Abraham, “Hierarchical design and analysis of fault- tolerant multiprocessor systems using concurrent error detection,” in Int. Symp. Fault Tolerant Comput., Newcastle-upon-Tyne, U.K., pp. 130–137, June 1990.
Google Scholar
A. L. N. Reddy and P. Banerjee, “Algorithm-based fault detection for signal processing applications,” IEEE Trans. Comput., vol. 39, pp. 1304–1308, Oct. 1990.
Article Google Scholar
D. J. Rosenkrantz and S. S. Ravi, “Improved upper bounds for algorithm-based fault tolerance,” in Proc. 26th Allerton Conf. Comm. Cont. amp; Comput., Allerton, IL, pp. 388 - 397, Sept. 1988.
Google Scholar
B. Vinnakota and N. K. Jha, “Diagnosability and diagnosis of algorithm-based fault tolerant systems,” in Proc. 32nd Midwest Symp. Circuits & Systems, Urbana, IL, pp. 28–31, Aug. 1989.
Google Scholar
B. Vinnakota and N. K. Jha, “A dependence graph-based approach to the design of algorithm-based fault tolerant systems,” in Proc. Int. Symp. Fault Tolerant Comput., pp. 122–129, Newcastle-upon-Tyne, U.K., June 1990.
Chapter Google Scholar
B. Vinnakota and N. K. Jha, “Design of multiprocessor systems for concurrent error detection and fault diagnosis,” in Proc. Int. Symp. Fault Tolerant Comput., Montreal, June 1991.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Princeton Univ., Princeton, NJ, 08544, USA
Ramesh Sitaraman
Dept. of Electrical Engg., Princeton Univ., Princeton, NJ, 08544, USA
Niraj K. Jha

Authors

Ramesh Sitaraman
View author publications
You can also search for this author in PubMed Google Scholar
Niraj K. Jha
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Mathematische Maschinen und Datenverarbeitung III (Rechnerstrukturen), Universität Erlangen-Nürnberg, Martensstr. 3, W-8520, Erlangen, Germany
Mario Dal Cin & Wolfgang Hohl &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sitaraman, R., Jha, N.K. (1991). Optimal Design of Checks for Error Detection and Location in Fault Tolerant Multiprocessor Systems. In: Cin, M.D., Hohl, W. (eds) Fault-Tolerant Computing Systems. Informatik-Fachberichte, vol 283. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-76930-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-76930-6_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-54545-3
Online ISBN: 978-3-642-76930-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics