A decentralized fault tolerance model based on level of performance for grid environment


Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. At this scale, computer resources and network failures are no more exceptions, but belong to the normal system behavior. Therefore, one of the most valuable characteristics of grid tools, apart from the performance they can achieve, is fault tolerance, which is a significant and complex issue in grid computing systems. In this paper, we propose a fault tolerant model for grid computing systems namely DCFT. This model is based on dynamic colored graphs without replication of computer resources. The proposed faut tolerance model consists of two stages. In the first stage, each node is described by a state vector. We color each attribute of the state vector as three colors (green, blue and red) based on its level of performance. In the second stage, we classify the nodes of a grid into three categories: the identical computer resources in term of performance, the more efficient ones and the less efficient ones. We used the colors of the nodes to develop a new strategy for fault tolerance based on the level of performance. A simulation of the proposed model using SimGrid simulator and Graphstream is conducted. Experimental results show that the proposed model performs very well in a large grid environment.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21


  1. 1.

    Abbasian, R., Mouhoub, M.: An efficient hierarchical parallel genetic algorithm for graph coloring problem. In: Krasnogor N (ed.) Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO’11), pp. 521–528. ACM, New York (2011)

  2. 2.

    Abbes, H., Cérin, C.: A decentralized and fault-tolerant desktop grid system for distributed applications. Concurr. Comput. Pract. Exp. 22(3), 261–277 (2010)

    Google Scholar 

  3. 3.

    Aliaa, A.A.Y., Atef, Z.G., Mohammed, E.E.D.: An efficient decentralized grid service advertisement approach using multi-agent system. Comput. Inf. Sci. 3(2), 220–228 (2010)

    Google Scholar 

  4. 4.

    Anderson, D.P.: Boinc: a system for public-resource computing and storage. In: GRID 2004: Proceedings of 5th International Workshop on Grid Computing, Pittsburgh, pp. 4–10 (2004)

  5. 5.

    Arora, M., Das, S.K., Biswas, R.: A de-centralized scheduling and load balancing algorithm for heterogeneous grid environments. In: Workshop on Scheduling and Resource Management for Cluster Computing, Vancouver (2002)

  6. 6.

    Balasangameshwara, J., Raju N.: A fault tolerance optimal neighbor load balancing algorithm for grid environment. In: Interantional Conference on Computational Intelligence and Communication Networks, IEEE, pp. 428-433 (2010)

  7. 7.

    Balasangameshwara, J., Raju, N.: A hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. (Elsevier) 35, 412–422 (2012)

    Article  Google Scholar 

  8. 8.

    Braun, T., Siegel, H.J., Beck, N., Boloni, L., Maheswaran, M., Reuther, A., et al.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001)

    Article  MATH  Google Scholar 

  9. 9.

    Budati, K., Sonnek, J.D., Chandra, A., Weissman, J.B.: ’Ridge: combining reliability and performance in open grid platforms’. In: HPDC 2007: Proceedings of 3rd International Symposium on High Performance Computing and Communications, Monterey, pp. 55–64 (2007)

  10. 10.

    Casanova, H., Legrand, A., Quinson, M.: SimGrid: a Generic Framework for Large-Scale Distributed Experimentations. In: Proceedings of the 10th IEEE International Conference on Computer Modelling and Simulation (UKSIM/EUROSIM08) (2008)

  11. 11.

    Chandy, K.M., Misra, J.: Distributed computations on graphs. Com. ACM 25(11), 833–838 (1982)

    MathSciNet  Article  MATH  Google Scholar 

  12. 12.

    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  13. 13.

    Chervenak, A.L., Schuler, R., Ripeanu, M., Amer, M.A., Bharathi, S., Foster, I., Iamnitchi, A., Kesselman, C.: The globus replica location service: design and experience. Trans. Parallel Distrib. Syst. 20(9), 1260–1272 (2009)

    Article  Google Scholar 

  14. 14.

    Dai, Y.S., Pan, Y., Zou, X.: A hierarchical modeling and analysis for grid service reliability. IEEE Trans. Comput. 56, 681–691 (2007)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Delamare, S., Fedak, G., Kondo, D., Lodygensky, O.: SpeQuloS: a QoS service for hybrid and elastic computing infrastructures. Clust. Comput. 17(1), 79–100 (2014)

    Article  Google Scholar 

  16. 16.

    Dìaz, D., Pardo, X. C., Martìn, M. J., González, P.: Application-level fault-tolerance solutions for grid computing. In: Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID’08). IEEE Computer Society, Washington, pp. 554–559 (2008)

  17. 17.

    Dijkstra, E. W.: A note on two problems in connexion with graphs. In: Numerische Mathematik, Mathematisch Centrum, Amsterdam, Vol. 1, pp. 269–271 (1959)

  18. 18.

    Dutot, A., Guinand, F., Olivier, D., Pign, Y.: Graphstream: A tool for bridging the gap between complex systems and dynamic graphs. In: Emergent Properties in Natural and Artificial Complex Systems. Satellite Conference within the 4th European Conference on Complex Systems, ECCS’2007, Dresden (2007)

  19. 19.

    Ebenezer, A.S., Baskaran, K.: Fault tolerant most fitting resource scheduling algorithm (FMFRS) for computational grid. Eur. J. Sci. Res. 86(4), 468–473 (2012)

    Google Scholar 

  20. 20.

    Foster, I., Kesselman, C., Nick, J.M.: Grid services for distributed system integration. Computer 35(6), 37–46 (2002)

    Article  Google Scholar 

  21. 21.

    Garg, R., Singh, A.K.: Fault tolerance grid computing: state of the art and open issues. Int. J. Comput. Sci. Eng. Surv. 2(1), 88–97 (2011)

    Article  Google Scholar 

  22. 22.

    Ghafarian-M., T., Deldari, H., Mohhamad, H., Yaghmaee-M., M.-H.: Proximity-aware resource discovery architecture in peer-to-peer based volunteer computing system. In: 11th IEEE International Conference on Computer and Information Technology, CIT 2011, pp 83–90

  23. 23.

    Ghafarian, T., Deldari, H., Javadi, B., Yaghmaee, M.H., Buyya, R.: CycloidGrid: a proximity-aware P2P-based resource discovery architecture in volunteer computing systems. Future Gener. Comput. Syst. 29, 1583–1595 (2013)

    Article  Google Scholar 

  24. 24.

    Harvey, D.J., Das, S.K., Biswas, R.: Design and performance of a heterogeneous grid partitioner. Algorithmica 45(3), 509–530 (2006)

    Article  MATH  Google Scholar 

  25. 25.

    Huedo, E., Montero, R., Llorente, I.: Evaluating the reliability of computational grids from the end user’s point of view. J. Syst. Archit. 52(12), 727–736 (2006)

    Article  Google Scholar 

  26. 26.

    Iosup, A., Sonmez, O., Anoep, S., Epema, D.: The performance of Bags-of-Tasks in large-scale distributed systems. In: Proceedings of The 17th International Symposium on High Performance Distributed Computing, HPDC, pp. 97108 (2008)

  27. 27.

    Jin, H., Shi, X., Qiang, W., Zou, D.: DRIC: dependable grid computing framework. IEICE Trans. E89–D(2), 612–623 (2006)

    Google Scholar 

  28. 28.

    Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7, 48–50 (1956)

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Kumar, S., Das, S., Biswas, R.: Graph partitioning for parallel applications in heterogeneous grid environments. In: Proceedings of the 16th International Parallel and Distributed Processing Symposium, p. 167 (2002)

  30. 30.

    Levitin, G., Dai, Y.S.: Service reliability and performance in grid system with star topology. Reliab. Eng. Syst. Saf. 92(1), 40–46 (2007)

    Article  Google Scholar 

  31. 31.

    Lieberman, E., Hauert, C., Nowak, M.A.: Evolutionary dynamics on graphs. Nature 433(7023), 312–316 (2005)

    Article  Google Scholar 

  32. 32.

    Liu, N.N., Yang, Q.: Eigenrank: a ranking-oriented approach to collaborative filtering. In: SIGIR 2008: Proceeding of 10th International Conference on Research and Development in Informantion Retrieval, Singapore, pp. 83–90 (2008)

  33. 33.

    Mabrouk, B.B., Hasni, H., Mahjoub, Z.: On a parallel genetic-tabu search based algorithm for solving the graph coloring problem. Eur. J. Oper. Res. 197(3), 1192–1201 (2009)

    Article  MATH  Google Scholar 

  34. 34.

    Malecot, P., Kondo, D., Fedak, G.: Xtremlab: a system for characterizing internet desktop grids. In: HPCC 2006: Proceeding of 2th International Conference on High Performance Computing and Communications, Munich, pp. 357–358 (2006)

  35. 35.

    Marx, D.: Graph coloring Pproblems and their applications in scheduling. In: Proceedings of John von Neumann, PhD Students Conference, pp. 1–2 (2004)

  36. 36.

    Pal, A.J., Sarma, S.S., Ray, B.: CCTP, graph coloring algorithms—soft computing solutions. In: Proceedings of the 6th IEEE International Conference on Cognitive Informatics (COGINF’07). IEEE Computer Society, Washington, DC, pp. 364-372 (2007)

  37. 37.

    Rebbah, M., Slimani, Y., Benyettou, A., Brunie, L.: Dynamic hierarchical model for fault tolerant grid computing. World Appl. Program. J. 1(5), 309–321 (2011)

    Google Scholar 

  38. 38.

    Sonnek, J.D., Chandra, A., Weissman, J.B.: Adaptive reputation-based scheduling on unreliable distributed infrastructures. IEEE Trans. Parallel Distrib. Syst. 18(11), 1551–1564 (2007)

    Article  Google Scholar 

  39. 39.

    Sun, Q., Wang, S., Zou, H., Yang, F.: QSSA: a QoS-aware service selection approach. Int. J. Web Grid Serv. 7(2), 147–169 (2011)

    Article  Google Scholar 

  40. 40.

    The Globus Toolkit. http://www.globus.org/toolkit/. Accessed 20 May 2013

  41. 41.

    Tourino, J., Martin, M.J., Tarrio, J., Arenaz, M.: A grid portal for an undergraduate parallel programming course. IEEE Trans. Educ. 48(3), 391–399 (2005)

    Article  Google Scholar 

  42. 42.

    Xia, Y., Jiang, C., Sun, T., Yang, R.: A novel failure detection algorithm for reliable distributed systems. J. Comput. 6(10), 2013–2020 (2011)

    Article  Google Scholar 

  43. 43.

    Zhang, Y., Huang, G., Liu, X., Mei, H.: Integrating resource consumption and allocation for infrastructure resources on-demand. In: CLOUD 2010 Proceeding of 3th International Conference on Cloud Computing, Miami, pp. 75–82 (2010)

  44. 44.

    Zheng, Z., Zhou, T.C., Lyu, M.R., King, I.: Component ranking for fault-tolerant cloud applications. IEEE Trans. Serv. Comput. 5(4), 540–550 (2010)

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Mohammed Rebbah.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rebbah, M., Slimani, Y., Benyettou, A. et al. A decentralized fault tolerance model based on level of performance for grid environment. Cluster Comput 19, 13–27 (2016). https://doi.org/10.1007/s10586-015-0497-x

Download citation


  • Grid computing
  • Fault tolerance
  • Dynamic colored graph
  • Performances