Advertisement

Analysis of the Jobs Resource Utilization on a Production System

  • Joseph EmerasEmail author
  • Cristian Ruiz
  • Jean-Marc Vincent
  • Olivier Richard
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8429)

Abstract

In HPC community the System Utilization metric enables to determine if the resources of the cluster are efficiently used by the batch scheduler. This metric considers that all the allocated resources (memory, disk, processors, etc.) are full-time utilized. To optimize the system performance, we have to consider the effective physical consumption by jobs regarding the resource allocations. This information gives an insight into whether the cluster resources are efficiently used by the jobs. In this work we propose an analysis of production clusters based on the jobs resource utilization. The principle is to collect simultaneously traces from the job scheduler (provided by logs) and jobs resource consumptions. The latter has been realized by developing a job monitoring tool, whose impact on the system has been measured as lightweight (0.35 % speed-down). The key point is to statistically analyze both traces to detect and explain underutilization of the resources. This could enable to detect abnormal behavior, bottlenecks in the cluster leading to a poor scalability, and justifying optimizations such as gang scheduling or besteffort scheduling. This method has been applied to two medium sized production clusters on a period of eight months.

Keywords

Workload traces Monitoring Performance evaluation Optimization High performance computing 

References

  1. 1.
    Ernemann, C., Song, B., Yahyapour, R.: Scaling of workload traces. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 166–182. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  2. 2.
    Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 2003 (2001)Google Scholar
  3. 3.
    Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002) CrossRefGoogle Scholar
  4. 4.
    Rudolph, L., Smith, P.H.: Valuation of ultra-scale computing systems. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 39–55. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  5. 5.
    Zhang, Y., Sivasubramaniam, A., Moreira, J., Franke, H.: Impact of workload and system parameters on next generation cluster scheduling mechanisms. IEEE Trans. Parallel Distrib. Syst. 12, 967–985 (2001)CrossRefGoogle Scholar
  6. 6.
    Chapin, S.J., Cirne, W., Feitelson, D.G., Jones, J.P., Leutenegger, S.T., Schwiegelshohn, U., Smith, W., Talby, D.: Benchmarks and standards for the evaluation of parallel job schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999) CrossRefGoogle Scholar
  7. 7.
    Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, New York (1991)zbMATHGoogle Scholar
  8. 8.
    Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  9. 9.
    Capit, N., Costa, G.D., Georgiou, Y., Huard, G., Martin, C., Mounie, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In: Cluster Computing and the Grid, pp. 776–783 (2005)Google Scholar
  10. 10.
    Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30, 817–840 (2004)CrossRefGoogle Scholar
  11. 11.
    Imamagic, E., Dobrenic, D.: Grid infrastructure monitoring system based on nagios. In: Proceedings of the 2007 Workshop on Grid Monitoring. GMW ’07, pp. 23–28. ACM, New York (2007)Google Scholar
  12. 12.
    Curry, R., Simmonds, R.: Job centric cluster monitoring. In: 12th International Conference on Parallel and Distributed Systems, ICPADS 2006. vol. 1, 8 p., 25 September 2006Google Scholar
  13. 13.
    Nataraj, A., Sottile, M.J., Morris, A., Malony, A.D., Shende, S.S.: TAUoverSupermon: low-overhead online parallel performance monitoring. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 85–96. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  14. 14.
    Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20, 287–331 (2006)CrossRefGoogle Scholar
  15. 15.
    Sottile, M.J., Minnich, R.G.: Supermon: A high-speed cluster monitoring system. In: Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER ’02. IEEE Computer Society, Washington, DC (2002)Google Scholar
  16. 16.
    Sharma, S., Bridges, P.G., Maccabe, A.B.: A framework for analyzing linux system overheads on hpc applications. In: Proceedings of the 2005 Los Alamos Computer Science Institute Symposium, October 2005Google Scholar
  17. 17.
    Fuerlinger, K., Wright, N.J., Skinner, D.: Effective performance measurement at petascale using IPM. In: Proceedings of the Sixteenth IEEE International Conference on Parallel and Distributed Systems (ICPADS 2010), Shanghai, China, December 2010Google Scholar
  18. 18.
    Song, B., Ernemann, C., Yahyapour, R.: Parallel computer workload modeling with markov chains. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 47–62. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  19. 19.
    Shan, H., Antypas, K., Shalf, J.: Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. SC ’08, pp. 42:1–42:12. IEEE Press, Piscataway (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Joseph Emeras
    • 1
    Email author
  • Cristian Ruiz
    • 1
  • Jean-Marc Vincent
    • 1
  • Olivier Richard
    • 1
  1. 1.LIG LaboratoryGrenobleFrance

Personalised recommendations