Advertisement

Distributed Monitoring and Management of Exascale Systems in the Argo Project

  • Swann Perarnau
  • Rajeev Thakur
  • Kamil Iskra
  • Ken Raffenetti
  • Franck Cappello
  • Rinku Gupta
  • Pete Beckman
  • Marc Snir
  • Henry Hoffmann
  • Martin Schulz
  • Barry Rountree
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9038)

Abstract

New computing technologies are expected to change the highperformance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks-resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems.

In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second (and greater) level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.

Keywords

Power Budget Lawrence Livermore National Laboratory Logical Partition Argo System Message Broker 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dongarra, J., Beckman, P., et al.: The International Exascale Software Project Roadmap. International Journal of High Performance Computing Applications 25(1), 3–60 (2011)CrossRefGoogle Scholar
  2. 2.
    Ellsworth, D., Malony, A., Rountree, B., Schulz, M.: POW: system-wide dynamic reallocation of limited power in hpc. To appear in International ACM Symposium on High Performance Distributed Computing, HPDC 2015, Portland, OR, USA (2015)Google Scholar
  3. 3.
    Hoffmann, H., Maggio, M.: PCP: A generalized approach to optimizing performance under power constraints through resource management. In: International Conference on Autonomic Computing, ICAC 2014, Philadelphia, PA, USA (2014)Google Scholar
  4. 4.
    Rountree, B., Ahn, D.H., de Supinski, B.R., Lowenthal, D.K., Schulz, M.: Beyond DVFS: A first look at performance under a hardware-enforced power bound. In: International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSW 2012, Shanghai, China (2012)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2015

Authors and Affiliations

  • Swann Perarnau
    • 1
  • Rajeev Thakur
    • 1
  • Kamil Iskra
    • 1
  • Ken Raffenetti
    • 1
  • Franck Cappello
    • 1
  • Rinku Gupta
    • 1
  • Pete Beckman
    • 1
  • Marc Snir
    • 1
  • Henry Hoffmann
    • 2
  • Martin Schulz
    • 3
  • Barry Rountree
    • 3
  1. 1.Argonne National LaboratoryLemontUSA
  2. 2.University of ChicagoChicagoUSA
  3. 3.Lawrence Livermore National LaboratoryLivermoreUSA

Personalised recommendations