An Improved Ganglia-Like Clusters Monitoring System

  • Wenguo Wei
  • Shoubin Dong
  • Ling Zhang
  • Zhengyou Liang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3033)


Ganglia [1] is a scalable distributed monitoring system for high performance computing systems such as clusters and Grids. We propose an improved Ganglia-like clusters monitoring system, which has more reliability with federation node and associated link failures; some monitoring data is accessed by permission; adding control functions such as restart or shutdown confusion processes; send email or pager to cluster administrator when important event occurs; and optionally select some data to federation node based on user policy in order to speedup the WAN access. We have implemented a prototype system.


Monitoring Data Leaf Node Parent Node Configuration File Monitor System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Massie, M.L., Chun, B.N., Culler, D.E.: The Ganglia Distributed Monitoring System: Design, Implementation, and Experience (February 2003) (submitted for publication)Google Scholar
  2. 2.
    The TeraGrid Project. Teragrid project web page (2001),
  3. 3.
    Foster, Kesselman, C.: Globus: A meta computing infrastructure toolkit. International Journal of Supercomputer Applications 11(2), 115–128 (1997)CrossRefGoogle Scholar
  4. 4.
    Sottile, M., Minnich, R.: Supermon: A high speed cluster monitoring system. In: Proceedings of Cluster (September 2002)Google Scholar
  5. 5.
    Anderson, E., Patterson, D.: Extensible, scalable monitoring for clusters of computers. In: Proceedings of the 11th Systems Administration Conference (October 1997)Google Scholar
  6. 6.
    Amir, E., McCanne, S., Katz, R.H.: An active service framework and its application to realtime multimedia transcoding. In: Proceedings of the ACM SIGCOMM 1998 Conference on Communications Architectures and Protocols, pp. 178–189 (1998)Google Scholar
  7. 7.
    Chun, B.N., Culler, D.E.: Rexec: A decentralized, secure remote execution environment for clusters. In: Proceedings of the 4th Workshop on Communication, Architecture and Applications for Network based Parallel Computing (January 2000)Google Scholar
  8. 8.
    Hyarary, F.: Graph Theory. Addison-Wesley, Reading (1969)Google Scholar
  9. 9.
    Peterson, L., Culler, D., Anderson, T., Roscoe, T.: A blueprint for introducing disruptive technology into the internet. In: Proceedings of the 1st Workshop on Hot Topics in Networks, HotNets-I (October 2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Wenguo Wei
    • 1
    • 2
  • Shoubin Dong
    • 1
  • Ling Zhang
    • 1
  • Zhengyou Liang
    • 1
  1. 1.Guangdong Key Laboratory of Computer NetworkSouth China University of TechnologyGuangzhouP.R. China
  2. 2.Department of Computer ScienceGuangdong Polytechnic Normal UniversityGuangzhouP.R. China

Personalised recommendations