Analysis of the Jobs Resource Utilization on a Production System

Emeras, Joseph; Ruiz, Cristian; Vincent, Jean-Marc; Richard, Olivier

doi:10.1007/978-3-662-43779-7_1

Joseph Emeras¹⁷,
Cristian Ruiz¹⁷,
Jean-Marc Vincent¹⁷ &
…
Olivier Richard¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8429))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

726 Accesses
3 Citations

Abstract

In HPC community the System Utilization metric enables to determine if the resources of the cluster are efficiently used by the batch scheduler. This metric considers that all the allocated resources (memory, disk, processors, etc.) are full-time utilized. To optimize the system performance, we have to consider the effective physical consumption by jobs regarding the resource allocations. This information gives an insight into whether the cluster resources are efficiently used by the jobs. In this work we propose an analysis of production clusters based on the jobs resource utilization. The principle is to collect simultaneously traces from the job scheduler (provided by logs) and jobs resource consumptions. The latter has been realized by developing a job monitoring tool, whose impact on the system has been measured as lightweight (0.35 % speed-down). The key point is to statistically analyze both traces to detect and explain underutilization of the resources. This could enable to detect abnormal behavior, bottlenecks in the cluster leading to a poor scalability, and justifying optimizations such as gang scheduling or besteffort scheduling. This method has been applied to two medium sized production clusters on a period of eight months.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
SWF: http://www.cs.huji.ac.il/labs/parallel/workload/swf.html
2.
PWA: http://www.cs.huji.ac.il/labs/parallel/workload
3.
GWA: http://gwa.ewi.tudelft.nl/pmwiki/
4.
SLURM. https://computing.llnl.gov/linux/slurm/
5.
OAR: http://oar.imag.fr
6.
LoadLeveler: http://www-03.ibm.com/systems/software/loadleveler
7.
https://perf.wiki.kernel.org
8.
http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
9.
A library call tracer. http://linux.die.net/man/1/ltrace
10.
http://opentsdb.net
11.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
12.
http://www.clusterresources.com/torquedocs21/3.5linuxcpusets.shtml
13.
https://computing.llnl.gov/linux/slurm/
14.
http://oar.imag.fr/sources/2.5/docs/documentation/OAR-DOCUMENTATION-ADMIN/#cpuset-feature
15.
http://www.r-project.org/
16.
CIMENT Project. https://ciment.ujf-grenoble.fr/
17.
CiGri Project: http://cigri.imag.fr/
18.
http://sourceforge.net/projects/ior-sio/
19.
Git clone https://forge.imag.fr/anonscm/git/evalys-tools/evalys-tools.git

References

Ernemann, C., Song, B., Yahyapour, R.: Scaling of workload traces. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 166–182. Springer, Heidelberg (2003)
Chapter Google Scholar
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 2003 (2001)
Google Scholar
Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002)
Chapter Google Scholar
Rudolph, L., Smith, P.H.: Valuation of ultra-scale computing systems. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 39–55. Springer, Heidelberg (2000)
Chapter Google Scholar
Zhang, Y., Sivasubramaniam, A., Moreira, J., Franke, H.: Impact of workload and system parameters on next generation cluster scheduling mechanisms. IEEE Trans. Parallel Distrib. Syst. 12, 967–985 (2001)
Article Google Scholar
Chapin, S.J., Cirne, W., Feitelson, D.G., Jones, J.P., Leutenegger, S.T., Schwiegelshohn, U., Smith, W., Talby, D.: Benchmarks and standards for the evaluation of parallel job schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999)
Chapter Google Scholar
Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, New York (1991)
MATH Google Scholar
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003)
Chapter Google Scholar
Capit, N., Costa, G.D., Georgiou, Y., Huard, G., Martin, C., Mounie, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In: Cluster Computing and the Grid, pp. 776–783 (2005)
Google Scholar
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30, 817–840 (2004)
Article Google Scholar
Imamagic, E., Dobrenic, D.: Grid infrastructure monitoring system based on nagios. In: Proceedings of the 2007 Workshop on Grid Monitoring. GMW ’07, pp. 23–28. ACM, New York (2007)
Google Scholar
Curry, R., Simmonds, R.: Job centric cluster monitoring. In: 12th International Conference on Parallel and Distributed Systems, ICPADS 2006. vol. 1, 8 p., 25 September 2006
Google Scholar
Nataraj, A., Sottile, M.J., Morris, A., Malony, A.D., Shende, S.S.: TAUoverSupermon: low-overhead online parallel performance monitoring. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 85–96. Springer, Heidelberg (2007)
Chapter Google Scholar
Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20, 287–331 (2006)
Article Google Scholar
Sottile, M.J., Minnich, R.G.: Supermon: A high-speed cluster monitoring system. In: Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER ’02. IEEE Computer Society, Washington, DC (2002)
Google Scholar
Sharma, S., Bridges, P.G., Maccabe, A.B.: A framework for analyzing linux system overheads on hpc applications. In: Proceedings of the 2005 Los Alamos Computer Science Institute Symposium, October 2005
Google Scholar
Fuerlinger, K., Wright, N.J., Skinner, D.: Effective performance measurement at petascale using IPM. In: Proceedings of the Sixteenth IEEE International Conference on Parallel and Distributed Systems (ICPADS 2010), Shanghai, China, December 2010
Google Scholar
Song, B., Ernemann, C., Yahyapour, R.: Parallel computer workload modeling with markov chains. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 47–62. Springer, Heidelberg (2005)
Chapter Google Scholar
Shan, H., Antypas, K., Shalf, J.: Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. SC ’08, pp. 42:1–42:12. IEEE Press, Piscataway (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

LIG Laboratory, Grenoble, France
Joseph Emeras, Cristian Ruiz, Jean-Marc Vincent & Olivier Richard

Authors

Joseph Emeras
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Vincent
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Richard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joseph Emeras .

Editor information

Editors and Affiliations

Mathematics & Computer Science Division, Argonne National Laboratory, Argonne, Illinois, USA
Narayan Desai
Google, Mountain View, California, USA
Walfredo Cirne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Emeras, J., Ruiz, C., Vincent, JM., Richard, O. (2014). Analysis of the Jobs Resource Utilization on a Production System. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2013. Lecture Notes in Computer Science(), vol 8429. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43779-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-43779-7_1
Published: 11 June 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43778-0
Online ISBN: 978-3-662-43779-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics