The Importance of Complete Data Sets for Job Scheduling Simulations

  • Dalibor Klusáček
  • Hana Rudová
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6253)

Abstract

This paper has been inspired by the study of the complex data set from the Czech National Grid MetaCentrum. Unlike other widely used workloads from Parallel Workloads Archive or Grid Workloads Archive, this data set includes additional information concerning machine failures, job requirements and machine parameters which allows to perform more realistic simulations. We show that large differences in the performance of various scheduling algorithms appear when these additional information are used. Moreover, we studied other publicly available workloads and partially reconstructed information concerning their machine failures and job requirements using statistical and analytical models to demonstrate that similar behavior is also expectable for other workloads. We suggest that additional information about both machines and jobs should be incorporated into the workloads archives to allow proper and more realistic simulations.

Keywords

Grid Cluster Scheduling MetaCentrum Workload Failures Specific Job Requirements 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Xhafa, F., Abraham, A.: Computational models and heuristic methods for grid scheduling problems. Future Generation Computer Systems 26(4), 608–621 (2010)CrossRefGoogle Scholar
  2. 2.
    Feitelson, D.G.: Parallel workloads archive (PWA), http://www.cs.huji.ac.il/labs/parallel/workload/
  3. 3.
    Epema, D., Anoep, S., Dumitrescu, C., Iosup, A., Jan, M., Li, H., Wolters, L.: Grid workloads archive (GWA), http://gwa.ewi.tudelft.nl/pmwiki/
  4. 4.
    Skovira, J., Chan, W., Zhou, H., Lifka, D.: The EASY - LoadLeveler API project. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1996 and JSSPP 1996. LNCS, vol. 1162, pp. 41–47. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  5. 5.
    Feitelson, D.G.: Experimental analysis of the root causes of performance evaluation results: A backfilling case study. IEEE Transactions on Parallel and Distributed Systems 16(2), 175–182 (2005)CrossRefGoogle Scholar
  6. 6.
    Jones, J.P.: PBS Professional 7, administrator guide. Altair (2005)Google Scholar
  7. 7.
    Xu, M.Q.: Effective metacomputing using LSF multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, pp. 100–105. IEEE, Los Alamitos (2001)Google Scholar
  8. 8.
    Cluster Resources: Moab workload manager administrator’s guide, version 5.3 (2010), http://www.clusterresources.com/products/mwm/docs/
  9. 9.
  10. 10.
    Klusáček, D., Rudová, H.: Complex real-life data sets in Grid simulations (abstract). In: Cracow Grid Workshop 2009 Abstracts (CGW 2009), Cracow, Poland (2009)Google Scholar
  11. 11.
    Klusáček, D., Rudová, H.: Efficient grid scheduling through the incremental schedule-based approach. Computational Intelligence: An International Journal (to appear 2010)Google Scholar
  12. 12.
    Klusáček, D., Rudová, H., Baraglia, R., Pasquali, M., Capannini, G.: Comparison of multi-criteria scheduling techniques. In: Grid Computing Achievements and Prospects, pp. 173–184. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  13. 13.
    Kondo, D., Javadi, B., Iosup, A., Epema, D.: The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Technical Report 00433523, INRIA (2009)Google Scholar
  14. 14.
    Zhang, Y., Squillante, M.S., Sivasubramaniam, A., Sahoo, R.K.: Performance implications of failures in large-scale cluster scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 233–252. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN 2006: Proceedings of the International Conference on Dependable Systems and Networks, pp. 249–258. IEEE Computer Society, Los Alamitos (2006)Google Scholar
  16. 16.
    Iosup, A., Jan, M., Sonmez, O., Epema, D.H.J.: On the dynamic resource availability in grids. In: GRID 2007: Proceedings of the 8th IEEE/ACM International Conference on Grid Computing, pp. 26–33. IEEE Computer Society, Los Alamitos (2007)CrossRefGoogle Scholar
  17. 17.
    Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1997 and JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  18. 18.
    Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global Grid computing for job scheduling. In: GRID 2004: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pp. 374–379. IEEE, Los Alamitos (2004)Google Scholar
  19. 19.
    Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.H.J.: The grid workloads archive. Future Generation Computer Systems 24(7), 672–686 (2008)CrossRefGoogle Scholar
  20. 20.
    Chapin, S.J., Cirne, W., Feitelson, D.G., Jones, J.P., Leutenegger, S.T., Schwiegelshohn, U., Smith, W., Talby, D.: Benchmarks and standards for the evaluation of parallel job schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999 and SPDP-WS 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  21. 21.
    Tsafrir, D., Etsion, Y., Feitelson, D.G.: Modeling user runtime estimates. In: Feitelson, D.G., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 1–35. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing 63(11), 1105–1122 (2003)CrossRefMATHGoogle Scholar
  23. 23.
    Feitelson, D.G., Rudolph, L.: Metrics and benchmarking for parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) IPPS-WS 1998, SPDP-WS 1998, and JSSPP 1998. LNCS, vol. 1459, pp. 1–24. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  24. 24.
    Repository of availability traces (RAT), http://www.cs.illinois.edu/~pbg/availability/
  25. 25.
    The computer failure data repository (CFDR), http://cfdr.usenix.org/
  26. 26.
    Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: DSN 2004: Proceedings of the 2004 International Conference on Dependable Systems and Networks, pp. 772–784. IEEE Computer Society, Los Alamitos (2004)Google Scholar
  27. 27.
    Johnson, N.L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distributions, 2nd edn., vol. 1. Wiley-Interscience, Hoboken (1994)MATHGoogle Scholar
  28. 28.
    Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. ACM SIGMETRICS Performance Evaluation Review 30(1), 217–227 (2002)CrossRefGoogle Scholar
  29. 29.
    Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Selective reservation strategies for backfill job scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 55–71. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  30. 30.
    Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: Queueing vs. planning. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  31. 31.
    Sulistio, A., Cibej, U., Venugopal, S., Robic, B., Buyya, R.: A toolkit for modelling and simulating data Grids: an extension to GridSim. Concurrency and Computation: Practice & Experience 20(13), 1591–1609 (2008)CrossRefGoogle Scholar
  32. 32.
    Klusáček, D., Rudová, H.: Alea 2 – job scheduling simulator. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques (SIMUTools 2010), ICST (2010)Google Scholar
  33. 33.
    Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12(6), 529–543 (2001)CrossRefGoogle Scholar
  34. 34.
    Krallmann, J., Schwiegelshohn, U., Yahyapour, R.: On the design and evaluation of job scheduling algorithms. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999, and SPDP-WS 1999. LNCS, vol. 1659, pp. 17–42. Springer, Heidelberg (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Dalibor Klusáček
    • 1
  • Hana Rudová
    • 1
  1. 1.Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations