Dynamic Resource Management in a Cluster for High-Availability
In order to execute high performance applications on a cluster, it is highly desirable to provide distributed services that globally manage physical resources distributed over the cluster nodes. However, as a distributed service may use resources located on different nodes, it becomes sensitive to changes in the cluster configuration due to node addition, reboot or failure. In this paper, we propose a generic service performing dynamic resource management in a cluster in order to provide distributed services with high availability. This service has been implemented in the Gobelins cluster operating system. The dynamic resource management service we propose makes node addition and reboot nearly transparent to all distributed services of Gobelins and, as a consequence, fully transparent to applications. In the event of a node failure, applications using resources located on the failed node need to be restarted from a previously saved checkpoint but the availability of the cluster operating system is guaranteed, provided that its distributed services implement reconfiguration features.
KeywordsAdaptation Layer Node Failure Cluster Node Node Addition Directory Entry
Unable to display preview. Download preview PDF.
- 1.C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18–28, 1996.Google Scholar
- 2.A. Barak and O. La’adan. The MOSIX multicomputer operating system for high performance cluster computing. Journal of Future Generation Computer Systems, 13(4-5):361–372, March 1998.Google Scholar
- 3.Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4), November 1989.Google Scholar
- 4.Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, Henry M. Levy, and Chandramohan A. Thekkath. Implementing global memory management in a workstation cluster. In Proc. of the 15th ACM Symposium on Operating Systems Principles, pages 129–140, December 1995.Google Scholar
- 5.Pascal Gallard, Christine Morin, and Renaud Lottiaux. Dynamic resource management in a cluster for scalability and high-availability. Research Report 4347, INRIA, January 2002.Google Scholar
- 6.R. Lottiaux and C. Morin. Containers: A sound basis for a true single system image. In Proceeding of IEEE International Symposium on Cluster Computing and the Grid, pages 66–73, May 2001.Google Scholar
- 7.Thomas E. Anderson, Michael D. Dhalin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. ACM Transactions on Computer Systems, 14(1):41–79, February 1996.Google Scholar