A Scalable Process-Management Environment for Parallel Programs
We present a process management system for parallel programs such as those written using MPI. A primary goal of the system, which we call MPD (for multipurpose daemon), is to be scalable. By this we mean that startup of interactive parallel jobs comprising a thousand processes is quick, that signals can be quickly delivered to processes, and that stdin, stdout, and stderr are managed intuitively. Our primary target is parallel machines made up of clusters of SMPs, but the system is also useful in more tightly integrated environments. We describe how MPD enables much faster startup and better runtime management of MPICH jobs. We show how close control of stdio can support the easy implementation of a number of convenient system utilities, even a parallel debugger. MPD is implemented and freely distributed with MPICH.
Unable to display preview. Download preview PDF.
- Chiba City home page. http://www.mcs.anl.gov/chiba
- The Maui scheduler home page. http://maui-scheduler.mhpcc.edu/newdoc, http://www.mhpcc.edu/maui.
- M. A. Baker, G. C. Fox, and H. W. Yau. Review of cluster management software. NHSE Review, 1(1), May 1996.Google Scholar
- Micah Beck, Jack J. Dongarra, Graham E. Fagg, G. Al Geist, Paul Gray, James Kohl, Mauro Migliardi, Keith Moore, Terry Moore, Philip Papadopoulous, Stephen L. Scott, and Vaidy Sunderam. HARNESS: A next generation distributed virtual machine. International Journal on Future Generation Computer Systems, 15(5/6), 1999.Google Scholar
- Greg Burns, Raja Daoud, and James Vaigl. LAM: An open cluster environment for MPI. In John W. Ross, editor, Proceedings of Supercomputing Symposium’ 94, pages 379–386. University of Toronto, 1994.Google Scholar
- Ralph Butler and Ewing Lusk. Monitors, messages, and clusters: The p4 parallel programming system. Parallel Computing, 20:547–564, April 1994.Google Scholar
- DQS home page. http://www.scri.fsu.edu/~pasko/dqs.html.
- I. Foster and C. Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1999.Google Scholar
- Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek, and Vaidy Sunderam. PVM: Parallel Virtual Machine—A User’s Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., 1994.Google Scholar
- Douglas P. Ghormley, David Petrou, Steven H. Rodrigues, Amin M. Vahdat, and Thomas E. Anderson. GLUnix: A Global Layer Unix for a network of workstations. Software—Practice and Experience, 28(9):929–961, July 1998.Google Scholar
- William Gropp and Ewing Lusk. Scalable Unix tools on parallel processors. In Proceedings of the Scalable High-Performance Computing Conference, pages 56–62. IEEE Computer Society Press, 1994.Google Scholar
- IBM. Loadleveler: Using and Administering, version 2 release 1 edition, November 1998. SA22-7311-00.Google Scholar
- M. J. Litzkow, M. Livny, and M. W. Mutka. Condor-A hunter of idle workstations. In Proc. 8th Intl. Conf. on Distributed Computing Systems, pages 104–111, San Jose, Calif., June 1988.Google Scholar
- M. Migliardi and V. Sunderam. PVM emulation in the Harness metacomput-ing system: A plug-in based approach. In J.J. Dongarra, E. Luque, and Tomas Margalef, editors, Recent advances in parallel virtual machine and message passing interface: 6th European PVM/MPI Users’ Group Meeting, Barcelona, Spain, September 26–29, 1999: Proceedings, volume 1697 of Lecture Notes in Computer Science, pages 117–124, Berlin, 1999. Springer-Verlag.CrossRefGoogle Scholar
- PBS home page. http://pbs.mrj.com/.
- Load Sharing Facility (LSF). http://www.platform.com.
- J. Pruyne and M. Livny. Interfacing Condor and PVM to harness the cycles of workstation clusters. Future Generation Computer Systems, 12(1):67–85, May 1996.Google Scholar
- Andrew S. Tanenbaum. Computer Networks. Prentice Hall, third edition, 1996.Google Scholar