Modeling Resubmission in Unreliable Grids: The Bottom-Up Approach

  • Vandy Berten
  • Emmanuel Jeannot
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6043)


Failure is an ordinary characteristic of large-scale distributed environments. Resubmission is a general strategy employed to cope with failures in grids. Here, we analytically and experimentally study resubmission in the case of random brokering (jobs are dispatched to a computing elements with a probability proportional to its computing power). We compare two cases when jobs are resubmitted to the broker or to the computing element. Results show that resubmit to the broker is a better strategy. Our approach is different from most existing race-based one as it is a bottom-up one: we start from a simple model of a grid and derive its characteristics.


Failure Probability Global Strategy Local Strategy Error Threshold Probability Factor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Scalable Fault Tolerant Protocol for Parallel Runtime Environments. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 141–149. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Berten, V., Goossens, J., Jeannot, E.: On the Distribution of Sequential Jobs in Random Brokering For Heterogeneous Computational Grids. IEEE Transactions on Parallel and Distributed Systems 17(2), 113–124 (2006)CrossRefGoogle Scholar
  3. 3.
    Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: Mpich-v: a multiprotocol fault tolerant mpi. International Journal of High Performance Computing and Applications (2005)Google Scholar
  4. 4.
    Casanova, H., Legrand, A., Quinson, M.: SimGrid: a Generic Framework for Large-Scale Distributed Experiments. In: 10th IEEE International Conference on Computer Modeling and Simulation (March 2008)Google Scholar
  5. 5.
    Costa, G.D., Dikaiakos, M., Orlando, S.: Analyzing the workload of the south-east federation of the egee grid infrastructure. Tech. Rep. TR-0063, Institute on Knowledge and Data Management, CoreGRID - Network of Excellence (February 2007),
  6. 6.
    Enabling Grids for E-sciencE (EGEE),
  7. 7.
    Iosup, A., Dumitrescu, C., Dick, H.J., Epema, H.L., Wolters, L.: How are real grids used? the analysis of four grid traces and its implications. In: GRID 2006, pp. 262–269 (2006)Google Scholar
  8. 8.
    Jensen, H.T., Leth, J.R.: Automatic Job Resubmission in the Nordugrid Middleware. Tech. rep., Aalborg University (2004),
  9. 9.
    Li, H., Heusdens, R., Muskulus, M., Wolters, L.: Analysis and synthesis of pseudo-periodic job arrivals in grids: A matching pursuit approach. In: CCGRID 2007, pp. 183–196 (2007)Google Scholar
  10. 10.
    Medernach, E.: Workload analysis of a cluster in a grid environment. In: Job scheduling strategies for parallel processing, pp. 36–61 (2005)Google Scholar
  11. 11.
    Rood, B., Lewis, M.J.: Multi-state grid resource availability characterization. In: GRID 2007, pp. 42–49 (2007)Google Scholar
  12. 12.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN 2006, pp. 249–258 (2006)Google Scholar
  13. 13.

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Vandy Berten
    • 1
  • Emmanuel Jeannot
    • 2
  1. 1.Université Libre de Bruxelles 
  2. 2.INRIA, LORIA 

Personalised recommendations