Abstract
We present a framework for the co-ordinated, autonomic management of multiple clusters in a compute center and their integration into a Grid environment. Site autonomy and the automation of administrative tasks are prime aspects in this framework. The system behavior is continuously monitored in a steering cycle and appropriate actions are taken to resolve any problems.
All presented components have been implemented in the course of the EU project DataGrid: The Lemon monitoring components, the FT fault-tolerance mechanism, the quattor system for software installation and configuration, the RMS job and resource management system, and the Gridification scheme that integrates clusters into the Grid.
Similar content being viewed by others
References
R. Alfieri, “VOMS: An Authorization System for Virtual Organizations”, in Proceedings of the 1st European Across Grids Conference, Santiago de Compostela, Spain, 2003.
E. Anderson and D. Patterson, “Extensible, Scalable Monitoring for Clusters of Computers”, in Proceedings of the 11th Systems Administration Conference (LISA’97), San Diego, CA, USA, 1997.
P. Anderson and A. Scobie, “LCFG: The Next Generation”, in UKUUG Winter Conference, 2002.
S. Bethke, M. Calvetti, H. Hoffmann, D. Jacobs, M. Kasemann and D. Linglin, “Report of the Steering Group of the LHC Computing Review”, Technical Report, CERN European Organization for Nuclear Research, 2001.
B. Bode, D. Halstead, R. Kendall and Z. Lei, “The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters”, in USENIX Conference, Atlanta, GA, 2000.
M. Burgess, “Cfengine: A Site Configuration Engine”, USENIX Computing Systems, Vol. 8, No. 3, 1995.
DataGrid, “EU DataGrid Project Homepage”, 2004. http://www.eu-datagrid.org/
I. Foster, C. Kesselman, G. Tsudik and S. Tuecke, “A Security Architecture for Computational Grids”, in Proceedings of the 5th ACM Conference on Computer and Communications Security Conference, San Francisco, CA, USA, pp. 83–92, 1998.
A. Frohner, “DataGrid Security Design Report”, Technical Report, EU DataGrid Project, 2003.
Hawkeye, “Condor Hawkeye Homepage”, 2004. http://www.cs.wisc.edu/condor/hawkeye/
R. Henderson, “Job Scheduling under the Portable Batch System”, in Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, Vol. 949, pp. 279–294, 1995.
S. Kannan, M. Roberts, P. Mayes, D. Brelsford and J. Skovira, Workload Management with LoadLeveler, IBM Redbooks, 2001.
A. Keller and A. Reinefeld, “Anatomy of a Resource Management System for HPC Clusters”, in Annual Review of Scalable Computing, Vol. 3, 2001.
J.O. Kephart and D.M. Chess, “The Vision of Autonomic Computing”, IEEE Computer, Vol. 36, No. 1, 41–50, 2001.
M. Lorch, D.B. Adams, D. Kafura, M.S.R. Koneni, A. Rathi and S. Shah, “The PRIMA System for Privilege Management, Authorization and Enforcement in Grid Environments”, in Proceedings of the 4th International Workshop on Grid Computing – Grid 2003, Phoenix, AR, USA, 2003.
OSCAR, “OSCAR Homepage”, 2004. http://oscar.sourceforge.net/
P. Papadopoulos, M. Katz and G. Bruno, “NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters”, Concurrency and Computation: Practice and Experience, Vol. 15, Nos. 7–8, 707–725, 2003.
Patrol, “Patrol Homepage”, 2004. http://www-d0en.fnal.gov/patrol/patrol_doc.html
Performance Co-Pilot, “Performance Co-Pilot Homepage”, 2004. http://oss.sgi.com/projects/pcp/
A. Reinefeld and V. Lindenstruth, “How to Build a High-Performance Compute Cluster for the Grid”, in 2nd International Workshop on Metacomputing Systems and Applications (MSA2001), Valencia, Spain, 2001.
T. Roeblitz, F. Schintke and A. Reinefeld, “From Clusters to the Fabric: The Job Management Perspective”, in Proceedings of the IEEE International Conference on Cluster Computing (Cluster’03), Hong Kong, China, 2003.
F.D. Sacerdoti, M.J. Katz, M.L. Massie and D.E. Culler, “Wide Area Cluster Monitoring with Ganglia”, in Proceedings of the IEEE International Conference on Cluster Computing (Cluster’03), Hong Kong, China, 2003.
SGE, “Sun Grid Engine Homepage”, 2004. http://www.sun.com/software/gridware/
SNMP, “Simple Network Management Protocol”, 2004. http://www.faqs.org/rfcs/rfc1157.html
P. Uthayopas, J. Maneesilp and P. Ingongnam, “SCMS: An Integrated Cluster Management Tool for Beowulf Cluster System”, in Proceedings of the International Conference on Parallel and Distributed Proceeding Techniques and Applications 2000 (PDPTA’2000), Las Vegas, NV, USA, 2000.
VACM, “VACM Homepage”, 2004. http://vacm.sourceforge.net/
S. Zhou, X. Zheng, J. Wang and P. Delisle, “Utopia: A Load Sharing Facility for Large, Heterogenous Distributed Computer Systems”, Software – Practice & Experience, Vol. 23, No. 12, 1305–1336, 1993.
Author information
Authors and Affiliations
Additional information
This work from the EU DataGrid project was funded by the European Commission grant IST-2000-25182.
Rights and permissions
About this article
Cite this article
Röblitz, T., Schintke, F., Reinefeld, A. et al. Autonomic Management of Large Clusters and Their Integration into the Grid. J Grid Computing 2, 247–260 (2004). https://doi.org/10.1007/s10723-004-7647-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-004-7647-3