Skip to main content
Log in

Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

It is commonly observed that production Grids are inherently unreliable. The aim of this work is to improve Grid application performances by tuning the job submission system. A stochastic model, capturing the behavior of a complex Grid workload management system is proposed. To instantiate the model, detailed statistics are extracted from dense Grid activity traces. The model is exploited for optimizing a simple job resubmission strategy. It provides quantitative inputs to improve job submission performance and it enables the impact of faults and outliers on Grid operations to be quantified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aparicio, G., Blanquer Espert, I., Hernández García, V.: A highly optimized Grid deployment: the metagenomic analysis example. In: Global Healthgrid: e-Science Meets Biomedical Informatics (Healthgrid’08), pp. 105–115 (2008)

  2. Casanova, H., Legrand, A., Quinson, M.: SimGrid: a generic framework for large-scale distributed experiments. In: 10th IEEE International Conference on Computer Modeling and Simulation (UKSim), pp. 126–131 (2008)

  3. Christodoulopoulos, K., Gkamas, V., Varvarigos, E.A.: Statistical analysis and modeling of jobs in a Grid environment. J. Grid Computing 6(1), 77–101 (2008)

    Article  Google Scholar 

  4. Colling, D., Martyniak, J., McGough, S., Křenek, A., Sitera, J., Mulač, M., Dvořák, F.: Real Time Monitor of Grid job executions. In: Computing in High Energy Physics/Journal of Physics: Conference Series (CHEP) (2009)

  5. Dabrowski, C.: Reliability in Grid computing systems. Concurrency and Computation: Practice & Experience (CCPE) Special issue on Open Grid Forum 21(8), 927–959 (2009)

    Google Scholar 

  6. Feitelson, D.: Workload Modeling for Performance Evaluation, vol. 2459, pp. 114–141. Springer, New York (2002)

    Google Scholar 

  7. Frachtenberg, E., Schwiegelshohn, U.: New challenges of parallel job scheduling. In: 13th Job Scheduling Strategies for Parallel Processing (JSSPP). LNCS, vol. 4942, pp. 1–23 (2008)

  8. Germain, C., Loomis, C., Mościcki, J.T., Texier, R.: Scheduling for responsive Grids. J. Grid Computing 6(1), 15–27 (2008)

    Article  Google Scholar 

  9. Glatard, T., Montagnat, J., Pennec, X.: Optimizing jobs timeouts on clusters and production Grids. In: International Symposium on Cluster Computing and the Grid (CCGrid’07), pp. 100–107 (2007)

  10. Huedo, E., Montero, R.S., Llorente, I.M.: Evaluating the reliability of computational Grids from the end user’s point of view. J. Systems Archit. 52(12), 727–736 (2006)

    Article  Google Scholar 

  11. Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the Grid. J. Grid Computing 1(3), 251–272 (2003)

    Article  MATH  Google Scholar 

  12. Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.: The Grid workloads archive. Future Gener. Comput. Syst. 24(7), 672–686 (2008)

    Article  Google Scholar 

  13. Laure, E., Fisher, S., Frohner, Á., Grandi, C., Kunszt, P.: Programming the Grid with gLite. Comput. Methods Sci. Technol. 12(1), 33–45 (2006)

    Google Scholar 

  14. Li, H., Groep, D., Walters, L.: Workload characteristics of a multi-cluster supercomputer. In: Job Scheduling Strategies for Parallel Processing, pp. 176–193 (2004)

  15. Lingrand, D., Glatard, T., Montagnat, J.: Modeling the latency on production Grids with respect to the execution context. Parallel Comput. (PARCO) 35(10–11), 493–511 (2009a)

    Article  Google Scholar 

  16. Lingrand, D., Montagnat, J., Glatard, T.: Modeling user submission strategies on production Grids. In: International Symposium on High Performance Distributed Computing (HPDC’09), pp. 121–130 (2009b)

  17. Medernach, E.: Workload analysis of a cluster in a Grid environment. In: Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 36–61 (2005)

  18. Nurmi, D., Mandal, A., Brevik, J., Koelbel, C., Wolski, R., Kennedy, K.: Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: Conference on High Performance Networking and Computing (2006)

  19. Pacini, F.: WMS user’s guide. Technical Report EGEE-JRA1-TEC-572489, EGEE (2006)

  20. Swany, M., Wolski, R.: Building performance topologies for computational Grids. Int. J. High Perform. Comput. Appl. 18(2), 255–265 (2004)

    Article  Google Scholar 

  21. Thebe, O., Bunde, D.P., Leung, V.J.: Scheduling restartable jobs with short test runs. In: 14th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP’09), Workshop: IPDPS. LNCS, vol. 5798, pp. 116–137 (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diane Lingrand.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lingrand, D., Montagnat, J., Martyniak, J. et al. Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload. J Grid Computing 8, 305–321 (2010). https://doi.org/10.1007/s10723-010-9151-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-010-9151-2

Keywords

Navigation