Skip to main content

Analysis of the Jobs Resource Utilization on a Production System

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8429))

Included in the following conference series:

Abstract

In HPC community the System Utilization metric enables to determine if the resources of the cluster are efficiently used by the batch scheduler. This metric considers that all the allocated resources (memory, disk, processors, etc.) are full-time utilized. To optimize the system performance, we have to consider the effective physical consumption by jobs regarding the resource allocations. This information gives an insight into whether the cluster resources are efficiently used by the jobs. In this work we propose an analysis of production clusters based on the jobs resource utilization. The principle is to collect simultaneously traces from the job scheduler (provided by logs) and jobs resource consumptions. The latter has been realized by developing a job monitoring tool, whose impact on the system has been measured as lightweight (0.35 % speed-down). The key point is to statistically analyze both traces to detect and explain underutilization of the resources. This could enable to detect abnormal behavior, bottlenecks in the cluster leading to a poor scalability, and justifying optimizations such as gang scheduling or besteffort scheduling. This method has been applied to two medium sized production clusters on a period of eight months.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    SWF: http://www.cs.huji.ac.il/labs/parallel/workload/swf.html

  2. 2.

    PWA: http://www.cs.huji.ac.il/labs/parallel/workload

  3. 3.

    GWA: http://gwa.ewi.tudelft.nl/pmwiki/

  4. 4.

    SLURM. https://computing.llnl.gov/linux/slurm/

  5. 5.

    OAR: http://oar.imag.fr

  6. 6.

    LoadLeveler: http://www-03.ibm.com/systems/software/loadleveler

  7. 7.

    https://perf.wiki.kernel.org

  8. 8.

    http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

  9. 9.

    A library call tracer. http://linux.die.net/man/1/ltrace

  10. 10.

    http://opentsdb.net

  11. 11.

    http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html

  12. 12.

    http://www.clusterresources.com/torquedocs21/3.5linuxcpusets.shtml

  13. 13.

    https://computing.llnl.gov/linux/slurm/

  14. 14.

    http://oar.imag.fr/sources/2.5/docs/documentation/OAR-DOCUMENTATION-ADMIN/#cpuset-feature

  15. 15.

    http://www.r-project.org/

  16. 16.

    CIMENT Project. https://ciment.ujf-grenoble.fr/

  17. 17.

    CiGri Project: http://cigri.imag.fr/

  18. 18.

    http://sourceforge.net/projects/ior-sio/

  19. 19.

    Git clone https://forge.imag.fr/anonscm/git/evalys-tools/evalys-tools.git

References

  1. Ernemann, C., Song, B., Yahyapour, R.: Scaling of workload traces. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 166–182. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 2003 (2001)

    Google Scholar 

  3. Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  4. Rudolph, L., Smith, P.H.: Valuation of ultra-scale computing systems. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 39–55. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  5. Zhang, Y., Sivasubramaniam, A., Moreira, J., Franke, H.: Impact of workload and system parameters on next generation cluster scheduling mechanisms. IEEE Trans. Parallel Distrib. Syst. 12, 967–985 (2001)

    Article  Google Scholar 

  6. Chapin, S.J., Cirne, W., Feitelson, D.G., Jones, J.P., Leutenegger, S.T., Schwiegelshohn, U., Smith, W., Talby, D.: Benchmarks and standards for the evaluation of parallel job schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  7. Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, New York (1991)

    MATH  Google Scholar 

  8. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Capit, N., Costa, G.D., Georgiou, Y., Huard, G., Martin, C., Mounie, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In: Cluster Computing and the Grid, pp. 776–783 (2005)

    Google Scholar 

  10. Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30, 817–840 (2004)

    Article  Google Scholar 

  11. Imamagic, E., Dobrenic, D.: Grid infrastructure monitoring system based on nagios. In: Proceedings of the 2007 Workshop on Grid Monitoring. GMW ’07, pp. 23–28. ACM, New York (2007)

    Google Scholar 

  12. Curry, R., Simmonds, R.: Job centric cluster monitoring. In: 12th International Conference on Parallel and Distributed Systems, ICPADS 2006. vol. 1, 8 p., 25 September 2006

    Google Scholar 

  13. Nataraj, A., Sottile, M.J., Morris, A., Malony, A.D., Shende, S.S.: TAUoverSupermon: low-overhead online parallel performance monitoring. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 85–96. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  14. Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20, 287–331 (2006)

    Article  Google Scholar 

  15. Sottile, M.J., Minnich, R.G.: Supermon: A high-speed cluster monitoring system. In: Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER ’02. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  16. Sharma, S., Bridges, P.G., Maccabe, A.B.: A framework for analyzing linux system overheads on hpc applications. In: Proceedings of the 2005 Los Alamos Computer Science Institute Symposium, October 2005

    Google Scholar 

  17. Fuerlinger, K., Wright, N.J., Skinner, D.: Effective performance measurement at petascale using IPM. In: Proceedings of the Sixteenth IEEE International Conference on Parallel and Distributed Systems (ICPADS 2010), Shanghai, China, December 2010

    Google Scholar 

  18. Song, B., Ernemann, C., Yahyapour, R.: Parallel computer workload modeling with markov chains. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 47–62. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. Shan, H., Antypas, K., Shalf, J.: Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. SC ’08, pp. 42:1–42:12. IEEE Press, Piscataway (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph Emeras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Emeras, J., Ruiz, C., Vincent, JM., Richard, O. (2014). Analysis of the Jobs Resource Utilization on a Production System. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2013. Lecture Notes in Computer Science(), vol 8429. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43779-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-43779-7_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-43778-0

  • Online ISBN: 978-3-662-43779-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics