GatoStar: A fault tolerant load sharing facility for parallel applications

Folliot, Bertil; Sens, Pierre

doi:10.1007/3-540-58426-9_159

Bertil Folliot^1,2 &
Pierre Sens¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

European Dependable Computing Conference

132 Accesses
8 Citations

Abstract

This paper presents how and why to unify load sharing and fault tolerance facilities. A realization of a fault tolerant load sharing facility, GatoStar, is presented and discussed. It is based on the integration of two applications developed on top of Unix: Gatos and Star. Gatos is a load sharing manager which automatically distributes parallel applications among heterogeneous hosts according to multicriteria allocation algorithms. Star is a software fault tolerance manager which automatically recovers processes of faulty machines based on checkpointing and message logging. The main advantage of this approach is to increase fault tolerant performance by taking advantage of the load sharing policies when allocating or recovering processes. This unification not only improves the efficiency of both facilities but avoids many redundancies mechanisms between them. Indeed, each facility needs to manage at least three common features: global knowledge of the running processors, a crash detection mechanism and remote process management. The backbone of this unification is a logical ring of communication for host crash detection and for host related information transfer. Thus, all necessary information is acquired with a relatively low cost of messages compared to the two systems taken independently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Alonso and L.L. Cova. Sharing Jobs Among Independently Owned Processors. In Proc. of the 8th International Conference on Distributed Computing Systems, San Jose, California, pp. 365–372, February 1988.
Google Scholar
O. Babaoglu, L. Alvisi, A. Amoroso, and R. Davoli. Paralex: An Environment for Parallel Programming in Distributed Systems. In Proc. of International Conference on Supercomputing, Washington D.C., pp. 178–187, July 1992.
Google Scholar
A. Barak and O.G. Paradise. MOS — A Load-balancing Unix. In Proc. of the Usenix Technical Conference — Summer 1986, pp. 414–418, 1986.
Google Scholar
G. Bernard and M. Simatic. A Decentralized and Efficient Algorithm for Networks of Workstations. In Proc. of the European Conference for Open Systems Spring '91, TromsØ, Norway, pp. 139–148, May 1991.
Google Scholar
P. A. Bernstein and N. Goodman. Concurrency Control in Distributed Database Systems. ACM Computing Surveys, 13(2):185–221, June 1981.
Article Google Scholar
B. Bhargava, S-R. Lian, and P-J. Leu. Experimental Evaluation of Concurrent Checkpointing and Rollback-Recovery Algorithms. In Proc. of the International Conference on Data Engineering, pp 182–189, March 1990.
Google Scholar
K.P. Birman and T. Joseph. Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5:47–76, February 1987.
Article Google Scholar
A. Borg, W. Blau, and W. Craetsch, F. Herrmann, and W. Oberle. Fault Tolerance under Unix. ACM Transactions on Computer Systems, 7(1):1–24, February 1989.
Article Google Scholar
R. Boutaba and B. Folliot. Load Balancing in Local Area Networks. In Proc. of the Networks'92 International Conference on Computer Networks, Architecture and Applications, Trivandrum, India, pp. 73–89, October 1992.
Google Scholar
K.M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.
Article Google Scholar
H. Clark and B. McMillin. DAWGS — A Distributed Compute Server Utilizing Idle Workstations. Journal of Parallel and Distributed Computing, 14:175–186, February 1992.
Google Scholar
D.S. Daniels. Distributed Logging for Transaction Processing. PhD Thesis, Technical Report CMU-CS-89-114, Carnegie-Mellon University, Pittsburg, PA (USA), December 1988
Google Scholar
F. Douglis and J. Ousterhout. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software — Practice and Experience, 21(8):757–785, 1991.
Google Scholar
D. L. Eager, E. D. Lazoska, and J. Zahorjan. Adaptative Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software Engineering, SE12(5):662–675, May 1986.
Google Scholar
E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel. The Performance of Consistent Checkpointing. In Proc. of the 11th Symposium on Reliable Distributed Systems, Houston, Texas, October 92.
Google Scholar
D. Ferrari and S. Zhou. An Empirical Investigation of Load Indices for Load Balancing Applications. Performances '87, Bruxelles, Belgium, pp. 515–528, December 1987.
Google Scholar
R.S. Finlayson. A Log File Service Exploiting Write-once Storage. PhD Thesis, Technical Report STAN-CS-89-1272, Stanford University, Stanford, CA (USA), July 1989.
Google Scholar
B. Folliot. Distributed Applications in Heterogeneous Environments. In Proc. of The European Forum for Open Systems, TromsØ, Norway, pp. 149–159, May 1991.
Google Scholar
B. Folliot. Méthodes et Outils de Partage de Charge pour la Conception et la Mise en Cuvre d'Applications dans les Systèmes Répartis Hétérogènes. PhD Thesis, Research Report 93-27, IBP, University Paris 6, France, December 1992.
Google Scholar
G. A. Geist and V. S. Sunderam. Network Based Concurrent Computing on the PVM System. Journal of Concurrency: Practice and Experience, 4(4):293–311, June 1992.
Google Scholar
R.S. Harbus. Dynamic Process Migration: To Migrate or Not To Migrate. Technical Note CSRI-42, University of Toronto, July 1986.
Google Scholar
S. Israel and D. Morris. A Non-intrusive Checkpointing Protocol. In Proc. of the Phoenix Conference on Communications and Computers, pp. 413–421, 1989.
Google Scholar
D.B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing. Journal of Algorithms, 11(3):462–491, September 1990.
Article Google Scholar
C. Kim and H. Kameda. Optimal Static Load Balancing of Multi-class Jobs in a Distributed Computer System. In Proc. of the 10th International Conference on Distributed Computing Systems, Paris, France, pp. 562–569, May 1990.
Google Scholar
R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering, SE-13(1):23–21, January 1987.
Google Scholar
M. J. Litzkow, M. Livny, and M. W. Mutka. Condor — A Hunter of Idle Workstations. In Proc. of the 8th International Conference on Distributed Computing Systems, San José, California, pp. 104–111, January 1988.
Google Scholar
V.M. Lo. Task Assignement to Minimize Completion Time. In Proc. of the 5th International Conference on Distributed Computing Systems, pp. 239–336, 1985.
Google Scholar
D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck. The Delta-4 Approach to Dependability in Open Distributed Computing Systems. In Proc. of the 18th International Symposium on Fault-Tolerant Computing Systems, Tokyo, Japan, pp. 246–251, 1988.
Google Scholar
C.C. Price and S. Krishnaprasad. Software Allocation Models for Distributed Computing Systems. In Proc. of the 4th International Conference on Distributed Computing Systems, San Fransisco, pp. 40–48, May 1984.
Google Scholar
M. Ruffin. KITLOG: a Generic Logging Service. In Proc. of the 11h Symposium on Reliable Distributed Systems, Houston, Texas, pp. 139–146, October 1992.
Google Scholar
P. Sens and B. Folliot. Star: A Fault Tolerant System for Distributed Applications. In Proc of the 5th IEEE Symposium on Parallel and Distributed Processing, Dallas, Texas, pp. 656–660, December 1993.
Google Scholar
G. C. Shoja, G. Clarke, and T. Taylor. REM: A Distributed Facility For Utilizing Idle Processing Power of Workstations. In Proc. of the IFIP Conference on Distributed Processing, Amsterdam, October 1987.
Google Scholar
H. S. Stone. Multiprocessor Scheduling with the Aid of Network Flow Algorithms. IEEE Transactions on Software Engineering, SE-3(1):85–93, January 1977.
Google Scholar
R.E. Strom and S.A. Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.
Google Scholar
M. M. Theimer and K. A. Lantz. Finding Idle Machines in a Workstationbased Distributed System. IEEE Transactions on Software Engineering, 15(11):1444–1458, November 1989.
Article Google Scholar
Z. Tong, R.Y. Kain, and W.T. Tsai. A Lower Overhead Checkpointing and Rollback Recovery Scheme for Distributed Systems. In Proc. of the 8th Symposium on Reliable Distributed Systems, pp. 12–20, October 1989.
Google Scholar
S. Zhou, X. Zheng, J. Wang, and P. Delisle. Utopia: A Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems. Technical Report 257, Computer Systems Research Institute, Toronto University, Canada, April 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

MASI Lab/CNRS 818, IBP, University Paris VI, 4 place Jussieu, 75252, Paris Cedex 05, France
Bertil Folliot & Pierre Sens
UFR d'Informatique, University Paris VII, Paris, France
Bertil Folliot

Authors

Bertil Folliot
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Sens
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Folliot, B., Sens, P. (1994). GatoStar: A fault tolerant load sharing facility for parallel applications. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_159

Download citation

DOI: https://doi.org/10.1007/3-540-58426-9_159
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics