Cluster Computing

, Volume 19, Issue 1, pp 13–27 | Cite as

A decentralized fault tolerance model based on level of performance for grid environment

  • Mohammed RebbahEmail author
  • Yahya Slimani
  • Abdelkader Benyettou
  • Lionel Brunie


Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. At this scale, computer resources and network failures are no more exceptions, but belong to the normal system behavior. Therefore, one of the most valuable characteristics of grid tools, apart from the performance they can achieve, is fault tolerance, which is a significant and complex issue in grid computing systems. In this paper, we propose a fault tolerant model for grid computing systems namely DCFT. This model is based on dynamic colored graphs without replication of computer resources. The proposed faut tolerance model consists of two stages. In the first stage, each node is described by a state vector. We color each attribute of the state vector as three colors (green, blue and red) based on its level of performance. In the second stage, we classify the nodes of a grid into three categories: the identical computer resources in term of performance, the more efficient ones and the less efficient ones. We used the colors of the nodes to develop a new strategy for fault tolerance based on the level of performance. A simulation of the proposed model using SimGrid simulator and Graphstream is conducted. Experimental results show that the proposed model performs very well in a large grid environment.


Grid computing Fault tolerance  Dynamic colored graph Performances 


  1. 1.
    Abbasian, R., Mouhoub, M.: An efficient hierarchical parallel genetic algorithm for graph coloring problem. In: Krasnogor N (ed.) Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO’11), pp. 521–528. ACM, New York (2011)Google Scholar
  2. 2.
    Abbes, H., Cérin, C.: A decentralized and fault-tolerant desktop grid system for distributed applications. Concurr. Comput. Pract. Exp. 22(3), 261–277 (2010)Google Scholar
  3. 3.
    Aliaa, A.A.Y., Atef, Z.G., Mohammed, E.E.D.: An efficient decentralized grid service advertisement approach using multi-agent system. Comput. Inf. Sci. 3(2), 220–228 (2010)Google Scholar
  4. 4.
    Anderson, D.P.: Boinc: a system for public-resource computing and storage. In: GRID 2004: Proceedings of 5th International Workshop on Grid Computing, Pittsburgh, pp. 4–10 (2004)Google Scholar
  5. 5.
    Arora, M., Das, S.K., Biswas, R.: A de-centralized scheduling and load balancing algorithm for heterogeneous grid environments. In: Workshop on Scheduling and Resource Management for Cluster Computing, Vancouver (2002)Google Scholar
  6. 6.
    Balasangameshwara, J., Raju N.: A fault tolerance optimal neighbor load balancing algorithm for grid environment. In: Interantional Conference on Computational Intelligence and Communication Networks, IEEE, pp. 428-433 (2010)Google Scholar
  7. 7.
    Balasangameshwara, J., Raju, N.: A hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. (Elsevier) 35, 412–422 (2012)CrossRefGoogle Scholar
  8. 8.
    Braun, T., Siegel, H.J., Beck, N., Boloni, L., Maheswaran, M., Reuther, A., et al.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001)CrossRefzbMATHGoogle Scholar
  9. 9.
    Budati, K., Sonnek, J.D., Chandra, A., Weissman, J.B.: ’Ridge: combining reliability and performance in open grid platforms’. In: HPDC 2007: Proceedings of 3rd International Symposium on High Performance Computing and Communications, Monterey, pp. 55–64 (2007)Google Scholar
  10. 10.
    Casanova, H., Legrand, A., Quinson, M.: SimGrid: a Generic Framework for Large-Scale Distributed Experimentations. In: Proceedings of the 10th IEEE International Conference on Computer Modelling and Simulation (UKSIM/EUROSIM08) (2008)Google Scholar
  11. 11.
    Chandy, K.M., Misra, J.: Distributed computations on graphs. Com. ACM 25(11), 833–838 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)CrossRefGoogle Scholar
  13. 13.
    Chervenak, A.L., Schuler, R., Ripeanu, M., Amer, M.A., Bharathi, S., Foster, I., Iamnitchi, A., Kesselman, C.: The globus replica location service: design and experience. Trans. Parallel Distrib. Syst. 20(9), 1260–1272 (2009)CrossRefGoogle Scholar
  14. 14.
    Dai, Y.S., Pan, Y., Zou, X.: A hierarchical modeling and analysis for grid service reliability. IEEE Trans. Comput. 56, 681–691 (2007)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Delamare, S., Fedak, G., Kondo, D., Lodygensky, O.: SpeQuloS: a QoS service for hybrid and elastic computing infrastructures. Clust. Comput. 17(1), 79–100 (2014)CrossRefGoogle Scholar
  16. 16.
    Dìaz, D., Pardo, X. C., Martìn, M. J., González, P.: Application-level fault-tolerance solutions for grid computing. In: Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID’08). IEEE Computer Society, Washington, pp. 554–559 (2008)Google Scholar
  17. 17.
    Dijkstra, E. W.: A note on two problems in connexion with graphs. In: Numerische Mathematik, Mathematisch Centrum, Amsterdam, Vol. 1, pp. 269–271 (1959)Google Scholar
  18. 18.
    Dutot, A., Guinand, F., Olivier, D., Pign, Y.: Graphstream: A tool for bridging the gap between complex systems and dynamic graphs. In: Emergent Properties in Natural and Artificial Complex Systems. Satellite Conference within the 4th European Conference on Complex Systems, ECCS’2007, Dresden (2007)Google Scholar
  19. 19.
    Ebenezer, A.S., Baskaran, K.: Fault tolerant most fitting resource scheduling algorithm (FMFRS) for computational grid. Eur. J. Sci. Res. 86(4), 468–473 (2012)Google Scholar
  20. 20.
    Foster, I., Kesselman, C., Nick, J.M.: Grid services for distributed system integration. Computer 35(6), 37–46 (2002)CrossRefGoogle Scholar
  21. 21.
    Garg, R., Singh, A.K.: Fault tolerance grid computing: state of the art and open issues. Int. J. Comput. Sci. Eng. Surv. 2(1), 88–97 (2011)CrossRefGoogle Scholar
  22. 22.
    Ghafarian-M., T., Deldari, H., Mohhamad, H., Yaghmaee-M., M.-H.: Proximity-aware resource discovery architecture in peer-to-peer based volunteer computing system. In: 11th IEEE International Conference on Computer and Information Technology, CIT 2011, pp 83–90Google Scholar
  23. 23.
    Ghafarian, T., Deldari, H., Javadi, B., Yaghmaee, M.H., Buyya, R.: CycloidGrid: a proximity-aware P2P-based resource discovery architecture in volunteer computing systems. Future Gener. Comput. Syst. 29, 1583–1595 (2013)CrossRefGoogle Scholar
  24. 24.
    Harvey, D.J., Das, S.K., Biswas, R.: Design and performance of a heterogeneous grid partitioner. Algorithmica 45(3), 509–530 (2006)CrossRefzbMATHGoogle Scholar
  25. 25.
    Huedo, E., Montero, R., Llorente, I.: Evaluating the reliability of computational grids from the end user’s point of view. J. Syst. Archit. 52(12), 727–736 (2006)CrossRefGoogle Scholar
  26. 26.
    Iosup, A., Sonmez, O., Anoep, S., Epema, D.: The performance of Bags-of-Tasks in large-scale distributed systems. In: Proceedings of The 17th International Symposium on High Performance Distributed Computing, HPDC, pp. 97108 (2008)Google Scholar
  27. 27.
    Jin, H., Shi, X., Qiang, W., Zou, D.: DRIC: dependable grid computing framework. IEICE Trans. E89–D(2), 612–623 (2006)Google Scholar
  28. 28.
    Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7, 48–50 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Kumar, S., Das, S., Biswas, R.: Graph partitioning for parallel applications in heterogeneous grid environments. In: Proceedings of the 16th International Parallel and Distributed Processing Symposium, p. 167 (2002)Google Scholar
  30. 30.
    Levitin, G., Dai, Y.S.: Service reliability and performance in grid system with star topology. Reliab. Eng. Syst. Saf. 92(1), 40–46 (2007)CrossRefGoogle Scholar
  31. 31.
    Lieberman, E., Hauert, C., Nowak, M.A.: Evolutionary dynamics on graphs. Nature 433(7023), 312–316 (2005)CrossRefGoogle Scholar
  32. 32.
    Liu, N.N., Yang, Q.: Eigenrank: a ranking-oriented approach to collaborative filtering. In: SIGIR 2008: Proceeding of 10th International Conference on Research and Development in Informantion Retrieval, Singapore, pp. 83–90 (2008)Google Scholar
  33. 33.
    Mabrouk, B.B., Hasni, H., Mahjoub, Z.: On a parallel genetic-tabu search based algorithm for solving the graph coloring problem. Eur. J. Oper. Res. 197(3), 1192–1201 (2009)CrossRefzbMATHGoogle Scholar
  34. 34.
    Malecot, P., Kondo, D., Fedak, G.: Xtremlab: a system for characterizing internet desktop grids. In: HPCC 2006: Proceeding of 2th International Conference on High Performance Computing and Communications, Munich, pp. 357–358 (2006)Google Scholar
  35. 35.
    Marx, D.: Graph coloring Pproblems and their applications in scheduling. In: Proceedings of John von Neumann, PhD Students Conference, pp. 1–2 (2004)Google Scholar
  36. 36.
    Pal, A.J., Sarma, S.S., Ray, B.: CCTP, graph coloring algorithms—soft computing solutions. In: Proceedings of the 6th IEEE International Conference on Cognitive Informatics (COGINF’07). IEEE Computer Society, Washington, DC, pp. 364-372 (2007)Google Scholar
  37. 37.
    Rebbah, M., Slimani, Y., Benyettou, A., Brunie, L.: Dynamic hierarchical model for fault tolerant grid computing. World Appl. Program. J. 1(5), 309–321 (2011)Google Scholar
  38. 38.
    Sonnek, J.D., Chandra, A., Weissman, J.B.: Adaptive reputation-based scheduling on unreliable distributed infrastructures. IEEE Trans. Parallel Distrib. Syst. 18(11), 1551–1564 (2007)CrossRefGoogle Scholar
  39. 39.
    Sun, Q., Wang, S., Zou, H., Yang, F.: QSSA: a QoS-aware service selection approach. Int. J. Web Grid Serv. 7(2), 147–169 (2011)CrossRefGoogle Scholar
  40. 40.
    The Globus Toolkit. Accessed 20 May 2013
  41. 41.
    Tourino, J., Martin, M.J., Tarrio, J., Arenaz, M.: A grid portal for an undergraduate parallel programming course. IEEE Trans. Educ. 48(3), 391–399 (2005)CrossRefGoogle Scholar
  42. 42.
    Xia, Y., Jiang, C., Sun, T., Yang, R.: A novel failure detection algorithm for reliable distributed systems. J. Comput. 6(10), 2013–2020 (2011)CrossRefGoogle Scholar
  43. 43.
    Zhang, Y., Huang, G., Liu, X., Mei, H.: Integrating resource consumption and allocation for infrastructure resources on-demand. In: CLOUD 2010 Proceeding of 3th International Conference on Cloud Computing, Miami, pp. 75–82 (2010)Google Scholar
  44. 44.
    Zheng, Z., Zhou, T.C., Lyu, M.R., King, I.: Component ranking for fault-tolerant cloud applications. IEEE Trans. Serv. Comput. 5(4), 540–550 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Mohammed Rebbah
    • 1
    Email author
  • Yahya Slimani
    • 2
  • Abdelkader Benyettou
    • 3
  • Lionel Brunie
    • 4
  1. 1.Faculty of Mathematics and Informatics, Computer Science DepartmentUniversity of Sciences and Technology, Mohamed Boudiaf USTO-MBOranAlgeria
  2. 2.Computer Science DepartmentISAMM Institute of ManoubaOued EllilTunisia
  3. 3.University of Sciences and Technology of Oran - Mohammed BOUDIAFOranAlgeria
  4. 4.INSA Lyon, LIRIS Laboratory, UMR5205Villeurbanne CedexFrance

Personalised recommendations