Skip to main content

GatoStar: A fault tolerant load sharing facility for parallel applications

  • Session 13: Distributed systems
  • Conference paper
  • First Online:
Dependable Computing — EDCC-1 (EDCC 1994)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

Abstract

This paper presents how and why to unify load sharing and fault tolerance facilities. A realization of a fault tolerant load sharing facility, GatoStar, is presented and discussed. It is based on the integration of two applications developed on top of Unix: Gatos and Star. Gatos is a load sharing manager which automatically distributes parallel applications among heterogeneous hosts according to multicriteria allocation algorithms. Star is a software fault tolerance manager which automatically recovers processes of faulty machines based on checkpointing and message logging. The main advantage of this approach is to increase fault tolerant performance by taking advantage of the load sharing policies when allocating or recovering processes. This unification not only improves the efficiency of both facilities but avoids many redundancies mechanisms between them. Indeed, each facility needs to manage at least three common features: global knowledge of the running processors, a crash detection mechanism and remote process management. The backbone of this unification is a logical ring of communication for host crash detection and for host related information transfer. Thus, all necessary information is acquired with a relatively low cost of messages compared to the two systems taken independently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Alonso and L.L. Cova. Sharing Jobs Among Independently Owned Processors. In Proc. of the 8th International Conference on Distributed Computing Systems, San Jose, California, pp. 365–372, February 1988.

    Google Scholar 

  2. O. Babaoglu, L. Alvisi, A. Amoroso, and R. Davoli. Paralex: An Environment for Parallel Programming in Distributed Systems. In Proc. of International Conference on Supercomputing, Washington D.C., pp. 178–187, July 1992.

    Google Scholar 

  3. A. Barak and O.G. Paradise. MOS — A Load-balancing Unix. In Proc. of the Usenix Technical Conference — Summer 1986, pp. 414–418, 1986.

    Google Scholar 

  4. G. Bernard and M. Simatic. A Decentralized and Efficient Algorithm for Networks of Workstations. In Proc. of the European Conference for Open Systems Spring '91, TromsØ, Norway, pp. 139–148, May 1991.

    Google Scholar 

  5. P. A. Bernstein and N. Goodman. Concurrency Control in Distributed Database Systems. ACM Computing Surveys, 13(2):185–221, June 1981.

    Article  Google Scholar 

  6. B. Bhargava, S-R. Lian, and P-J. Leu. Experimental Evaluation of Concurrent Checkpointing and Rollback-Recovery Algorithms. In Proc. of the International Conference on Data Engineering, pp 182–189, March 1990.

    Google Scholar 

  7. K.P. Birman and T. Joseph. Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5:47–76, February 1987.

    Article  Google Scholar 

  8. A. Borg, W. Blau, and W. Craetsch, F. Herrmann, and W. Oberle. Fault Tolerance under Unix. ACM Transactions on Computer Systems, 7(1):1–24, February 1989.

    Article  Google Scholar 

  9. R. Boutaba and B. Folliot. Load Balancing in Local Area Networks. In Proc. of the Networks'92 International Conference on Computer Networks, Architecture and Applications, Trivandrum, India, pp. 73–89, October 1992.

    Google Scholar 

  10. K.M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.

    Article  Google Scholar 

  11. H. Clark and B. McMillin. DAWGS — A Distributed Compute Server Utilizing Idle Workstations. Journal of Parallel and Distributed Computing, 14:175–186, February 1992.

    Google Scholar 

  12. D.S. Daniels. Distributed Logging for Transaction Processing. PhD Thesis, Technical Report CMU-CS-89-114, Carnegie-Mellon University, Pittsburg, PA (USA), December 1988

    Google Scholar 

  13. F. Douglis and J. Ousterhout. Transparent Process Migration: Design Alternatives and the Sprite Implementation. Software — Practice and Experience, 21(8):757–785, 1991.

    Google Scholar 

  14. D. L. Eager, E. D. Lazoska, and J. Zahorjan. Adaptative Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software Engineering, SE12(5):662–675, May 1986.

    Google Scholar 

  15. E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel. The Performance of Consistent Checkpointing. In Proc. of the 11th Symposium on Reliable Distributed Systems, Houston, Texas, October 92.

    Google Scholar 

  16. D. Ferrari and S. Zhou. An Empirical Investigation of Load Indices for Load Balancing Applications. Performances '87, Bruxelles, Belgium, pp. 515–528, December 1987.

    Google Scholar 

  17. R.S. Finlayson. A Log File Service Exploiting Write-once Storage. PhD Thesis, Technical Report STAN-CS-89-1272, Stanford University, Stanford, CA (USA), July 1989.

    Google Scholar 

  18. B. Folliot. Distributed Applications in Heterogeneous Environments. In Proc. of The European Forum for Open Systems, TromsØ, Norway, pp. 149–159, May 1991.

    Google Scholar 

  19. B. Folliot. Méthodes et Outils de Partage de Charge pour la Conception et la Mise en Cuvre d'Applications dans les Systèmes Répartis Hétérogènes. PhD Thesis, Research Report 93-27, IBP, University Paris 6, France, December 1992.

    Google Scholar 

  20. G. A. Geist and V. S. Sunderam. Network Based Concurrent Computing on the PVM System. Journal of Concurrency: Practice and Experience, 4(4):293–311, June 1992.

    Google Scholar 

  21. R.S. Harbus. Dynamic Process Migration: To Migrate or Not To Migrate. Technical Note CSRI-42, University of Toronto, July 1986.

    Google Scholar 

  22. S. Israel and D. Morris. A Non-intrusive Checkpointing Protocol. In Proc. of the Phoenix Conference on Communications and Computers, pp. 413–421, 1989.

    Google Scholar 

  23. D.B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing. Journal of Algorithms, 11(3):462–491, September 1990.

    Article  Google Scholar 

  24. C. Kim and H. Kameda. Optimal Static Load Balancing of Multi-class Jobs in a Distributed Computer System. In Proc. of the 10th International Conference on Distributed Computing Systems, Paris, France, pp. 562–569, May 1990.

    Google Scholar 

  25. R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering, SE-13(1):23–21, January 1987.

    Google Scholar 

  26. M. J. Litzkow, M. Livny, and M. W. Mutka. Condor — A Hunter of Idle Workstations. In Proc. of the 8th International Conference on Distributed Computing Systems, San José, California, pp. 104–111, January 1988.

    Google Scholar 

  27. V.M. Lo. Task Assignement to Minimize Completion Time. In Proc. of the 5th International Conference on Distributed Computing Systems, pp. 239–336, 1985.

    Google Scholar 

  28. D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck. The Delta-4 Approach to Dependability in Open Distributed Computing Systems. In Proc. of the 18th International Symposium on Fault-Tolerant Computing Systems, Tokyo, Japan, pp. 246–251, 1988.

    Google Scholar 

  29. C.C. Price and S. Krishnaprasad. Software Allocation Models for Distributed Computing Systems. In Proc. of the 4th International Conference on Distributed Computing Systems, San Fransisco, pp. 40–48, May 1984.

    Google Scholar 

  30. M. Ruffin. KITLOG: a Generic Logging Service. In Proc. of the 11h Symposium on Reliable Distributed Systems, Houston, Texas, pp. 139–146, October 1992.

    Google Scholar 

  31. P. Sens and B. Folliot. Star: A Fault Tolerant System for Distributed Applications. In Proc of the 5th IEEE Symposium on Parallel and Distributed Processing, Dallas, Texas, pp. 656–660, December 1993.

    Google Scholar 

  32. G. C. Shoja, G. Clarke, and T. Taylor. REM: A Distributed Facility For Utilizing Idle Processing Power of Workstations. In Proc. of the IFIP Conference on Distributed Processing, Amsterdam, October 1987.

    Google Scholar 

  33. H. S. Stone. Multiprocessor Scheduling with the Aid of Network Flow Algorithms. IEEE Transactions on Software Engineering, SE-3(1):85–93, January 1977.

    Google Scholar 

  34. R.E. Strom and S.A. Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.

    Google Scholar 

  35. M. M. Theimer and K. A. Lantz. Finding Idle Machines in a Workstationbased Distributed System. IEEE Transactions on Software Engineering, 15(11):1444–1458, November 1989.

    Article  Google Scholar 

  36. Z. Tong, R.Y. Kain, and W.T. Tsai. A Lower Overhead Checkpointing and Rollback Recovery Scheme for Distributed Systems. In Proc. of the 8th Symposium on Reliable Distributed Systems, pp. 12–20, October 1989.

    Google Scholar 

  37. S. Zhou, X. Zheng, J. Wang, and P. Delisle. Utopia: A Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems. Technical Report 257, Computer Systems Research Institute, Toronto University, Canada, April 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Folliot, B., Sens, P. (1994). GatoStar: A fault tolerant load sharing facility for parallel applications. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_159

Download citation

  • DOI: https://doi.org/10.1007/3-540-58426-9_159

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58426-1

  • Online ISBN: 978-3-540-48785-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics