Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload

Lingrand, Diane; Montagnat, Johan; Martyniak, Janusz; Colling, David

doi:10.1007/s10723-010-9151-2

Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload

Published: 23 March 2010

Volume 8, pages 305–321, (2010)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Diane Lingrand¹,
Johan Montagnat¹,
Janusz Martyniak² &
…
David Colling²

81 Accesses
12 Citations
Explore all metrics

Abstract

It is commonly observed that production Grids are inherently unreliable. The aim of this work is to improve Grid application performances by tuning the job submission system. A stochastic model, capturing the behavior of a complex Grid workload management system is proposed. To instantiate the model, detailed statistics are extracted from dense Grid activity traces. The model is exploited for optimizing a simple job resubmission strategy. It provides quantitative inputs to improve job submission performance and it enables the impact of faults and outliers on Grid operations to be quantified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling Instability for Large Scale Processing Tasks Within HEP Distributed Computing Environments

Efficient Load Balancing in Distributed Computing Environments with Enhanced User Priority Modeling

Forecasting Network Throughput of Remote Data Access in Computing Grids

References

Aparicio, G., Blanquer Espert, I., Hernández García, V.: A highly optimized Grid deployment: the metagenomic analysis example. In: Global Healthgrid: e-Science Meets Biomedical Informatics (Healthgrid’08), pp. 105–115 (2008)
Casanova, H., Legrand, A., Quinson, M.: SimGrid: a generic framework for large-scale distributed experiments. In: 10th IEEE International Conference on Computer Modeling and Simulation (UKSim), pp. 126–131 (2008)
Christodoulopoulos, K., Gkamas, V., Varvarigos, E.A.: Statistical analysis and modeling of jobs in a Grid environment. J. Grid Computing 6(1), 77–101 (2008)
Article Google Scholar
Colling, D., Martyniak, J., McGough, S., Křenek, A., Sitera, J., Mulač, M., Dvořák, F.: Real Time Monitor of Grid job executions. In: Computing in High Energy Physics/Journal of Physics: Conference Series (CHEP) (2009)
Dabrowski, C.: Reliability in Grid computing systems. Concurrency and Computation: Practice & Experience (CCPE) Special issue on Open Grid Forum 21(8), 927–959 (2009)
Google Scholar
Feitelson, D.: Workload Modeling for Performance Evaluation, vol. 2459, pp. 114–141. Springer, New York (2002)
Google Scholar
Frachtenberg, E., Schwiegelshohn, U.: New challenges of parallel job scheduling. In: 13th Job Scheduling Strategies for Parallel Processing (JSSPP). LNCS, vol. 4942, pp. 1–23 (2008)
Germain, C., Loomis, C., Mościcki, J.T., Texier, R.: Scheduling for responsive Grids. J. Grid Computing 6(1), 15–27 (2008)
Article Google Scholar
Glatard, T., Montagnat, J., Pennec, X.: Optimizing jobs timeouts on clusters and production Grids. In: International Symposium on Cluster Computing and the Grid (CCGrid’07), pp. 100–107 (2007)
Huedo, E., Montero, R.S., Llorente, I.M.: Evaluating the reliability of computational Grids from the end user’s point of view. J. Systems Archit. 52(12), 727–736 (2006)
Article Google Scholar
Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the Grid. J. Grid Computing 1(3), 251–272 (2003)
Article MATH Google Scholar
Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.: The Grid workloads archive. Future Gener. Comput. Syst. 24(7), 672–686 (2008)
Article Google Scholar
Laure, E., Fisher, S., Frohner, Á., Grandi, C., Kunszt, P.: Programming the Grid with gLite. Comput. Methods Sci. Technol. 12(1), 33–45 (2006)
Google Scholar
Li, H., Groep, D., Walters, L.: Workload characteristics of a multi-cluster supercomputer. In: Job Scheduling Strategies for Parallel Processing, pp. 176–193 (2004)
Lingrand, D., Glatard, T., Montagnat, J.: Modeling the latency on production Grids with respect to the execution context. Parallel Comput. (PARCO) 35(10–11), 493–511 (2009a)
Article Google Scholar
Lingrand, D., Montagnat, J., Glatard, T.: Modeling user submission strategies on production Grids. In: International Symposium on High Performance Distributed Computing (HPDC’09), pp. 121–130 (2009b)
Medernach, E.: Workload analysis of a cluster in a Grid environment. In: Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 36–61 (2005)
Nurmi, D., Mandal, A., Brevik, J., Koelbel, C., Wolski, R., Kennedy, K.: Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: Conference on High Performance Networking and Computing (2006)
Pacini, F.: WMS user’s guide. Technical Report EGEE-JRA1-TEC-572489, EGEE (2006)
Swany, M., Wolski, R.: Building performance topologies for computational Grids. Int. J. High Perform. Comput. Appl. 18(2), 255–265 (2004)
Article Google Scholar
Thebe, O., Bunde, D.P., Leung, V.J.: Scheduling restartable jobs with short test runs. In: 14th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP’09), Workshop: IPDPS. LNCS, vol. 5798, pp. 116–137 (2009)

Download references

Author information

Authors and Affiliations

University of Nice—Sophia Antipolis/CNRS, Nice, France
Diane Lingrand & Johan Montagnat
The Blackett Lab, Imperial College London, London, UK
Janusz Martyniak & David Colling

Authors

Diane Lingrand
View author publications
You can also search for this author in PubMed Google Scholar
Johan Montagnat
View author publications
You can also search for this author in PubMed Google Scholar
Janusz Martyniak
View author publications
You can also search for this author in PubMed Google Scholar
David Colling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diane Lingrand.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lingrand, D., Montagnat, J., Martyniak, J. et al. Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload. J Grid Computing 8, 305–321 (2010). https://doi.org/10.1007/s10723-010-9151-2

Download citation

Received: 24 August 2009
Accepted: 01 March 2010
Published: 23 March 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10723-010-9151-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload

Abstract

Access this article

Similar content being viewed by others

Modeling Instability for Large Scale Processing Tasks Within HEP Distributed Computing Environments

Efficient Load Balancing in Distributed Computing Environments with Enhanced User Priority Modeling

Forecasting Network Throughput of Remote Data Access in Computing Grids

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimization of Jobs Submission on the EGEE Production Grid: Modeling Faults Using Workload

Abstract

Access this article

Similar content being viewed by others

Modeling Instability for Large Scale Processing Tasks Within HEP Distributed Computing Environments

Efficient Load Balancing in Distributed Computing Environments with Enhanced User Priority Modeling

Forecasting Network Throughput of Remote Data Access in Computing Grids

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation