Vigne: Towards a Self-healing Grid Operating System

  • Louis Rilling
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4128)


We consider building a Grid Operating System in order to relieve users and programmers from the burden of dealing with the highly distributed and volatile resources of computational grids. To tolerate the volatility of the nodes, the system should be self-healing, that is continuously adapt to additions, removals, and failures of nodes. We present the self-healing architecture of the Vigne Grid Operating System through three of its services: system membership, application management, and volatile data management. The experimental results obtained show that our approach is feasible.


Application Manager Overlay Network Distribute Hash Table Access Request High Level Service 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)Google Scholar
  2. 2.
    Rilling, L., Morin, C.: A practical transparent data sharing service for the grid. In: Proc. Fifth International Workshop on Distributed Shared Memory (DSM 2005), Held in conjunction with CCGrid 2005, Cardiff, UK (2005)Google Scholar
  3. 3.
    Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  4. 4.
    Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling churn in a DHT. In: Proceedings of the USENIX Annual Technical Conference, pp. 127–140 (2004)Google Scholar
  5. 5.
    Mena, S., Schiper, A., Wojciechowski, P.: A step towards a new generation of group communication systems. In: Endler, M., Schmidt, D.C. (eds.) Middleware 2003. LNCS, vol. 2672, pp. 414–432. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Garbey, M., Ltaief, H.: Fault tolerant domain decomposition for parabolic problems. In: 16th International Conference on Domain Decomposition Methods. Lecture Notes in Computational Science and Engineering, Springer, Heidelberg (to appear, 2005)Google Scholar
  7. 7.
    Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems 7(4), 321–359 (1989)CrossRefGoogle Scholar
  8. 8.
    Rilling, L.: Système d’exploitation à image unique pour une grille de composition dynamique: conception et mise en œuvre de services fiables pour exécuter les applications distribuées partageant des données. PhD thesis, Université de Rennes 1, IRISA, Rennes, France (in French) (2005)Google Scholar
  9. 9.
    Jeanvoine, E., Rilling, L., Morin, C., Leprince, D.: Using overlay networks to build operating system services for large scale grids. In: Proceedings of the fifth International Symposium on Parallel and Distributed Computing (ISPDC 2006), Timisoara, Romania (to appear, 2006)Google Scholar
  10. 10.
    Saroiu, S., Gummadi, P.K., Gribble, S.D.: A measurement study of peer-to-peer file sharing systems. In: Proceedings of Multimedia Computing and Networking (MMCN 2002), San Jose, CA, USA (2002)Google Scholar
  11. 11.
    Grimshaw, A.S., Wulf, W.A., Team, C.T.L.: The legion vision of a worldwide virtual computer. Communications of the ACM 40(1), 39–45 (1997)CrossRefGoogle Scholar
  12. 12.
    Krauter, K., Maheswaran, M.: Architecture for a grid operating system. In: Buyya, R., Baker, M. (eds.) GRID 2000. LNCS, vol. 1971, pp. 65–76. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  13. 13.
    Mirtchovski, A., Simmonds, R., Minnich, R.: Plan 9 – an integrated approach to grid computing. In: 18th International Parallel and Distributed Processing Symposium (IPDPS 2004) - Workshop on High-Performance Grid Computing, Santa Fe, New Mexico, USA, p. 273a. IEEE CS Press, Los Alamitos (2004)Google Scholar
  14. 14.
    Traversat, B., Abdelaziz, M., Pouyoul, E.: Project JXTA: A Loosely-Consistent DHT Rendezvous Walker (2003),
  15. 15.
    Pallickara, S., Fox, G.: NaradaBrokering: A middleware framework and architecture for enabling durable peer-to-peer grids. In: Endler, M., Schmidt, D.C. (eds.) Middleware 2003. LNCS, vol. 2672, pp. 41–61. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  16. 16.
    Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S., Whisnant, K.: Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Transactions on Parallel and Distributed Systems 10(6), 560–579 (1999)CrossRefGoogle Scholar
  17. 17.
    Cappello, F., Djilali, S., Fedak, G., Herault, T., Magniette, F., Néri, V., Lodygensky, O.: Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid. Future Generation Computer Systems 21(3), 417–437 (2005)CrossRefGoogle Scholar
  18. 18.
    Antoniu, G., Deverge, J.F., Monnet, S.: How to bring together fault tolerance and data consistency to enable grid data sharing. In: Concurrency and Computation: Practice and Experience (to appear, 2006)Google Scholar
  19. 19.
    Busca, J.M., Picconi, F., Sens, P.: Pastis: A highly-scalable multi-user peer-to-peer file system. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 1173–1182. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Shafi, H., Speight, E., Bennett, J.K.: Raptor: Integrating checkpoints and thread migration for cluster management. In: Proceedings of the 22nd International Symposium on Reliable Distributed Systems (SRDS 2003), pp. 141–152. IEEE, Los Alamitos (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Louis Rilling
    • 1
  1. 1.Brittany site – PARIS research groupIRISA/Université de Rennes 1/ENS Cachan 

Personalised recommendations