Skip to main content

AggMon: Scalable Hierarchical Cluster Monitoring

  • Conference paper
  • First Online:
Sustained Simulation Performance 2012

Abstract

Monitoring and supervising a huge number of compute nodes within a typical HPC cluster is an expensive task. Expensive in the sense of occupying bandwidth, and CPU power that would be better spend for application needs. In this paper, we describe a monitoring framework that is used to supervise thousands of compute nodes in a HPC cluster computer in an efficient way. Within this framework the compute nodes are organized in groups. Groups contain other groups and form a tree-like hierarchical graph. Communication paths are strictly along the edges of the graph. To decouple the components in the network a publish/subscribe messaging system based on AMQP has been chosen. Monitoring data is stored within a distributed time-series database that is located on dedicated nodes in the tree. For database queries and other administrative tasks a synchronous RPC channel, that is completely independent of the hierarchy has been implemented. A browser-based front-end to present the data to the user is currently in development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The development and investigation described in this paper were done as part of the TIMACS project and were sponsored by NEC and the German Bundesministerium für Bildung und Forschung.

References

  1. AMQP: Advanced message queuing protocol. http://www.amqp.org/ (2012)

  2. Ganglia monitoring system. http://ganglia.sourceforge.net/ (2012)

  3. Pika. http://pika.github.com/ (2012)

  4. py-amqplib. http://code.google.com/p/py-amqplib (2012)

  5. Python programming language. http://www.python.org/ (2012)

  6. RabbitMQ. http://www.rabbitmq.com/ (2012)

  7. ZeroMQ: The intelligent transport layer. http://www.zeromq.org/ (2012)

  8. Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.M.: The many faces of publish/subscribe. ACM Comput. Surv. 35, 114–131 (2003). DOI: http://doi.acm.org/10.1145/857076.857078. http://doi.acm.org/10.1145/857076.857078

  9. Marsh, G., Sampat, A.P., Potluri, S., Panda, D.K.: Scaling advanced message queuing protocol (amqp) architecture withbroker federation and infiniband. In: OSU Technical Report. Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 (2009)

    Google Scholar 

  10. van Renesse, R., Birman, K.P., Vogels, W.: Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. Department of Computer Science, Cornell University, Ithaca, NY 14853 (2002)

    Google Scholar 

  11. Wang, Q., gang Xu, J., an Wang, H., zhong Dai, G.: Adaptive real-time publish-subscribe messaging for distributed monitoring systems. In: OSU Technical Report. IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, Intelligence Engineering Lab., Institute of Software, Chinese Academy of Sciences P.O.Box 8718, Beijing 100080, China (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erich Focht .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Focht, E., Jeutter, A. (2013). AggMon: Scalable Hierarchical Cluster Monitoring. In: Resch, M., Wang, X., Bez, W., Focht, E., Kobayashi, H. (eds) Sustained Simulation Performance 2012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32454-3_5

Download citation

Publish with us

Policies and ethics