Journal of Grid Computing

, Volume 8, Issue 2, pp 305–321 | Cite as

Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload

  • Diane Lingrand
  • Johan Montagnat
  • Janusz Martyniak
  • David Colling
Article

Abstract

It is commonly observed that production Grids are inherently unreliable. The aim of this work is to improve Grid application performances by tuning the job submission system. A stochastic model, capturing the behavior of a complex Grid workload management system is proposed. To instantiate the model, detailed statistics are extracted from dense Grid activity traces. The model is exploited for optimizing a simple job resubmission strategy. It provides quantitative inputs to improve job submission performance and it enables the impact of faults and outliers on Grid operations to be quantified.

Keywords

Production Grid monitoring Submission strategy optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aparicio, G., Blanquer Espert, I., Hernández García, V.: A highly optimized Grid deployment: the metagenomic analysis example. In: Global Healthgrid: e-Science Meets Biomedical Informatics (Healthgrid’08), pp. 105–115 (2008)Google Scholar
  2. 2.
    Casanova, H., Legrand, A., Quinson, M.: SimGrid: a generic framework for large-scale distributed experiments. In: 10th IEEE International Conference on Computer Modeling and Simulation (UKSim), pp. 126–131 (2008)Google Scholar
  3. 3.
    Christodoulopoulos, K., Gkamas, V., Varvarigos, E.A.: Statistical analysis and modeling of jobs in a Grid environment. J. Grid Computing 6(1), 77–101 (2008)CrossRefGoogle Scholar
  4. 4.
    Colling, D., Martyniak, J., McGough, S., Křenek, A., Sitera, J., Mulač, M., Dvořák, F.: Real Time Monitor of Grid job executions. In: Computing in High Energy Physics/Journal of Physics: Conference Series (CHEP) (2009)Google Scholar
  5. 5.
    Dabrowski, C.: Reliability in Grid computing systems. Concurrency and Computation: Practice & Experience (CCPE) Special issue on Open Grid Forum 21(8), 927–959 (2009)Google Scholar
  6. 6.
    Feitelson, D.: Workload Modeling for Performance Evaluation, vol. 2459, pp. 114–141. Springer, New York (2002)Google Scholar
  7. 7.
    Frachtenberg, E., Schwiegelshohn, U.: New challenges of parallel job scheduling. In: 13th Job Scheduling Strategies for Parallel Processing (JSSPP). LNCS, vol. 4942, pp. 1–23 (2008)Google Scholar
  8. 8.
    Germain, C., Loomis, C., Mościcki, J.T., Texier, R.: Scheduling for responsive Grids. J. Grid Computing 6(1), 15–27 (2008)CrossRefGoogle Scholar
  9. 9.
    Glatard, T., Montagnat, J., Pennec, X.: Optimizing jobs timeouts on clusters and production Grids. In: International Symposium on Cluster Computing and the Grid (CCGrid’07), pp. 100–107 (2007)Google Scholar
  10. 10.
    Huedo, E., Montero, R.S., Llorente, I.M.: Evaluating the reliability of computational Grids from the end user’s point of view. J. Systems Archit. 52(12), 727–736 (2006)CrossRefGoogle Scholar
  11. 11.
    Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the Grid. J. Grid Computing 1(3), 251–272 (2003)MATHCrossRefGoogle Scholar
  12. 12.
    Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.: The Grid workloads archive. Future Gener. Comput. Syst. 24(7), 672–686 (2008)CrossRefGoogle Scholar
  13. 13.
    Laure, E., Fisher, S., Frohner, Á., Grandi, C., Kunszt, P.: Programming the Grid with gLite. Comput. Methods Sci. Technol. 12(1), 33–45 (2006)Google Scholar
  14. 14.
    Li, H., Groep, D., Walters, L.: Workload characteristics of a multi-cluster supercomputer. In: Job Scheduling Strategies for Parallel Processing, pp. 176–193 (2004)Google Scholar
  15. 15.
    Lingrand, D., Glatard, T., Montagnat, J.: Modeling the latency on production Grids with respect to the execution context. Parallel Comput. (PARCO) 35(10–11), 493–511 (2009a)CrossRefGoogle Scholar
  16. 16.
    Lingrand, D., Montagnat, J., Glatard, T.: Modeling user submission strategies on production Grids. In: International Symposium on High Performance Distributed Computing (HPDC’09), pp. 121–130 (2009b)Google Scholar
  17. 17.
    Medernach, E.: Workload analysis of a cluster in a Grid environment. In: Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 36–61 (2005)Google Scholar
  18. 18.
    Nurmi, D., Mandal, A., Brevik, J., Koelbel, C., Wolski, R., Kennedy, K.: Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: Conference on High Performance Networking and Computing (2006)Google Scholar
  19. 19.
    Pacini, F.: WMS user’s guide. Technical Report EGEE-JRA1-TEC-572489, EGEE (2006)Google Scholar
  20. 20.
    Swany, M., Wolski, R.: Building performance topologies for computational Grids. Int. J. High Perform. Comput. Appl. 18(2), 255–265 (2004)CrossRefGoogle Scholar
  21. 21.
    Thebe, O., Bunde, D.P., Leung, V.J.: Scheduling restartable jobs with short test runs. In: 14th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP’09), Workshop: IPDPS. LNCS, vol. 5798, pp. 116–137 (2009)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  • Diane Lingrand
    • 1
  • Johan Montagnat
    • 1
  • Janusz Martyniak
    • 2
  • David Colling
    • 2
  1. 1.University of Nice—Sophia Antipolis/CNRSNiceFrance
  2. 2.The Blackett LabImperial College LondonLondonUK

Personalised recommendations