Abstract
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.
Similar content being viewed by others
References
D. P. Siewiorek and R. S. Swarz. Reliable Computer Systems Design and Evaluation, 2nd ed. Digital Press, Burlington, MA, 1992.
B. Randell. System Structure for Software Fault Tolerance. IEEE Transactions on Software Engineering, SE-1: 220–232, 1975.
J. W. S. Liu, W. Shih, K. Lin, R. Bettati, and J. Chung. Imprecise Computations. Proceedings of the IEEE, 82(1): 83–93, Jan. 1994.
N. A. Speirs and P. A. Barrett. Using Passive replicates in Delta-4 to Provide Dependable Distributed Computing. Proceedings of the Nineteenth International Symposium on Fault-Tolerant Computing, 1989, pp. 184–190.
A. L. Liestman and R. H. Campbell. A Fault-Tolerant Scheduling Problem. IEEE Transactions on Software Engineering, SE-12: 1089–1095, Nov. 1986.
B. VanVoorst, R. Jha, L. Pires, M. Muhammad. Implementation and Results of Hypothesis Testing from the C3I Parallel Benchmark Suite. Proceedings of the 11th International Parallel Processing Symposium, 1997.
D. A. Castanon and R. Jha. Multi-Hypothesis Tracking (Draft). DARPA Real-Time Benchmarks, Technical Information Report (A006), 1997.
R. Hamza, Honeywell Technology Center. Sonar Adaptive Beamformer (Draft). DARPA Real-Time Benchmarks, Primary Technical Information Report, 1998.
M. Allalouf, J. Chang, G. Durairaj, V. R. Lakamraju, O. S. Unsal, I. Koren, C. M. Krishna. RAPIDS: A Simulator Testbed for Distributed Real-Time Systems. Advanced Simulation and Technology Conference, 1998, pp. 191–196.
C. M. Krishna and K. G. Shin Real-Time Systems, McGraw Hill, New York, NY, 1997.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Haines, J., Lakamraju, V., Koren, I. et al. Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance. The Journal of Supercomputing 16, 53–68 (2000). https://doi.org/10.1023/A:1008181429693
Issue Date:
DOI: https://doi.org/10.1023/A:1008181429693