Abstract
This paper presents how and why to unify load sharing and fault tolerance facilities. A realization of a fault tolerant load sharing facility, GatoStar, is presented and discussed. It is based on the integration of two applications developed on top of Unix: Gatos and Star. Gatos is a load sharing manager which automatically distributes parallel applications among heterogeneous hosts according to multicriteria allocation algorithms. Star is a software fault tolerance manager which automatically recovers processes of faulty machines based on checkpointing and message logging. The main advantage of this approach is to increase fault tolerant performance by taking advantage of the load sharing policies when allocating or recovering processes. This unification not only improves the efficiency of both facilities but avoids many redundancies mechanisms between them. Indeed, each facility needs to manage at least three common features: global knowledge of the running processors, a crash detection mechanism and remote process management. The backbone of this unification is a logical ring of communication for host crash detection and for host related information transfer. Thus, all necessary information is acquired with a relatively low cost of messages compared to the two systems taken independently.
Preview
Unable to display preview. Download preview PDF.
References
R. Alonso and L.L. Cova. Sharing Jobs Among Independently Owned Processors. In Proc. of the 8th International Conference on Distributed Computing Systems, San Jose, California, pp. 365–372, February 1988.
O. Babaoglu, L. Alvisi, A. Amoroso, and R. Davoli. Paralex: An Environment for Parallel Programming in Distributed Systems. In Proc. of International Conference on Supercomputing, Washington D.C., pp. 178–187, July 1992.
A. Barak and O.G. Paradise. MOS — A Load-balancing Unix. In Proc. of the Usenix Technical Conference — Summer 1986, pp. 414–418, 1986.
G. Bernard and M. Simatic. A Decentralized and Efficient Algorithm for Networks of Workstations. In Proc. of the European Conference for Open Systems Spring '91, TromsØ, Norway, pp. 139–148, May 1991.
P. A. Bernstein and N. Goodman. Concurrency Control in Distributed Database Systems. ACM Computing Surveys, 13(2):185–221, June 1981.
B. Bhargava, S-R. Lian, and P-J. Leu. Experimental Evaluation of Concurrent Checkpointing and Rollback-Recovery Algorithms. In Proc. of the International Conference on Data Engineering, pp 182–189, March 1990.
K.P. Birman and T. Joseph. Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5:47–76, February 1987.
A. Borg, W. Blau, and W. Craetsch, F. Herrmann, and W. Oberle. Fault Tolerance under Unix. ACM Transactions on Computer Systems, 7(1):1–24, February 1989.
R. Boutaba and B. Folliot. Load Balancing in Local Area Networks. In Proc. of the Networks'92 International Conference on Computer Networks, Architecture and Applications, Trivandrum, India, pp. 73–89, October 1992.
K.M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.
H. Clark and B. McMillin. DAWGS — A Distributed Compute Server Utilizing Idle Workstations. Journal of Parallel and Distributed Computing, 14:175–186, February 1992.
D.S. Daniels. Distributed Logging for Transaction Processing. PhD Thesis, Technical Report CMU-CS-89-114, Carnegie-Mellon University, Pittsburg, PA (USA), December 1988
F. Douglis and J. Ousterhout. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software — Practice and Experience, 21(8):757–785, 1991.
D. L. Eager, E. D. Lazoska, and J. Zahorjan. Adaptative Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software Engineering, SE12(5):662–675, May 1986.
E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel. The Performance of Consistent Checkpointing. In Proc. of the 11th Symposium on Reliable Distributed Systems, Houston, Texas, October 92.
D. Ferrari and S. Zhou. An Empirical Investigation of Load Indices for Load Balancing Applications. Performances '87, Bruxelles, Belgium, pp. 515–528, December 1987.
R.S. Finlayson. A Log File Service Exploiting Write-once Storage. PhD Thesis, Technical Report STAN-CS-89-1272, Stanford University, Stanford, CA (USA), July 1989.
B. Folliot. Distributed Applications in Heterogeneous Environments. In Proc. of The European Forum for Open Systems, TromsØ, Norway, pp. 149–159, May 1991.
B. Folliot. Méthodes et Outils de Partage de Charge pour la Conception et la Mise en Cuvre d'Applications dans les Systèmes Répartis Hétérogènes. PhD Thesis, Research Report 93-27, IBP, University Paris 6, France, December 1992.
G. A. Geist and V. S. Sunderam. Network Based Concurrent Computing on the PVM System. Journal of Concurrency: Practice and Experience, 4(4):293–311, June 1992.
R.S. Harbus. Dynamic Process Migration: To Migrate or Not To Migrate. Technical Note CSRI-42, University of Toronto, July 1986.
S. Israel and D. Morris. A Non-intrusive Checkpointing Protocol. In Proc. of the Phoenix Conference on Communications and Computers, pp. 413–421, 1989.
D.B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing. Journal of Algorithms, 11(3):462–491, September 1990.
C. Kim and H. Kameda. Optimal Static Load Balancing of Multi-class Jobs in a Distributed Computer System. In Proc. of the 10th International Conference on Distributed Computing Systems, Paris, France, pp. 562–569, May 1990.
R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering, SE-13(1):23–21, January 1987.
M. J. Litzkow, M. Livny, and M. W. Mutka. Condor — A Hunter of Idle Workstations. In Proc. of the 8th International Conference on Distributed Computing Systems, San José, California, pp. 104–111, January 1988.
V.M. Lo. Task Assignement to Minimize Completion Time. In Proc. of the 5th International Conference on Distributed Computing Systems, pp. 239–336, 1985.
D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck. The Delta-4 Approach to Dependability in Open Distributed Computing Systems. In Proc. of the 18th International Symposium on Fault-Tolerant Computing Systems, Tokyo, Japan, pp. 246–251, 1988.
C.C. Price and S. Krishnaprasad. Software Allocation Models for Distributed Computing Systems. In Proc. of the 4th International Conference on Distributed Computing Systems, San Fransisco, pp. 40–48, May 1984.
M. Ruffin. KITLOG: a Generic Logging Service. In Proc. of the 11h Symposium on Reliable Distributed Systems, Houston, Texas, pp. 139–146, October 1992.
P. Sens and B. Folliot. Star: A Fault Tolerant System for Distributed Applications. In Proc of the 5th IEEE Symposium on Parallel and Distributed Processing, Dallas, Texas, pp. 656–660, December 1993.
G. C. Shoja, G. Clarke, and T. Taylor. REM: A Distributed Facility For Utilizing Idle Processing Power of Workstations. In Proc. of the IFIP Conference on Distributed Processing, Amsterdam, October 1987.
H. S. Stone. Multiprocessor Scheduling with the Aid of Network Flow Algorithms. IEEE Transactions on Software Engineering, SE-3(1):85–93, January 1977.
R.E. Strom and S.A. Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.
M. M. Theimer and K. A. Lantz. Finding Idle Machines in a Workstationbased Distributed System. IEEE Transactions on Software Engineering, 15(11):1444–1458, November 1989.
Z. Tong, R.Y. Kain, and W.T. Tsai. A Lower Overhead Checkpointing and Rollback Recovery Scheme for Distributed Systems. In Proc. of the 8th Symposium on Reliable Distributed Systems, pp. 12–20, October 1989.
S. Zhou, X. Zheng, J. Wang, and P. Delisle. Utopia: A Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems. Technical Report 257, Computer Systems Research Institute, Toronto University, Canada, April 1992.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Folliot, B., Sens, P. (1994). GatoStar: A fault tolerant load sharing facility for parallel applications. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_159
Download citation
DOI: https://doi.org/10.1007/3-540-58426-9_159
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive