Abstract
This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU's load translates into a small increase in the coprocessor's load. The functionality implemented by the coprocessor is available at the application level as VCM instructions. Host CPU(s) and coprocessor interact through shared memory regions, thereby avoiding expensive CPU context switches. The host kernel is not involved in this interaction; it simply “connects” the application to the VCM during the initialization phase and is called infrequently to handle exceptional conditions. Protection is enforced by the VCM based on information supplied by the kernel. The VCM-based communication architecture admits low cost and open implementations, as demonstrated by its current ATM-based implementation based on off-the-shelf hardware components and using standard AAL5 packets. The architecture makes it easy to implement communication software that exhibits negligible overheads on the host CPU(s) and offers latencies and bandwidths close to the hardware limits of the underlying network. These characteristics are due to the VCM's support for zero-copy messaging with gather/scatter capabilities and the VCM's direct access to any data structure in an application's address space. This paper describes two versions of an ATM-based VCM implementation, which differ in the way they use the memory on the network adapter. Their performance under heavy load is compared in the context of a synthetic client/server application. The same application is used to evaluate the scalability of the architecture to multiple VCM-based network interfaces per host. Parallel implementations of the Traveling Salesman Problem and of Georgia Tech Time Warp, an engine for discrete-event simulation, are used to demonstrate VCM functionality and the high performance of its implementation. The distributed- and shared-memory versions of these two applications exhibit comparable performance, despite the significant cost-performance advantage of the distributed-memory platform.
Similar content being viewed by others
References
H. Agusleo and N. Soparkar, Employing logic-enhanced memory for high-performance ATM network interfaces, in: Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing(IEEE Computer Society, Los Alamitos, CA, 1996) pp. 192-200.
T.E. Anderson, H.M. Levy, B.N. Bershad and E.D. Lazowska, The interaction of architecture and operating system design, in: Proceedings of the 4th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(Association for Computing Machinery, New York, NY, 1991) pp. 108-120.
B. Bershad, S. Savage, P. Pardyak, E. Sirer, M. Fiuczynski, D. Becker, C. Chambers and S. Eggers, Extensibility, safety, and performance in the SPIN operating system, in: Proceedings of the 15th ACM Symposium on Operating System Principles(Association for Computing Machinery, New York, NY, 1995) pp. 267-283.
G. Buzzard, D. Jacobson, M. Mackey, S. Marovich and J. Wilkes, An implementation of the Hamlyn sender-managed interface architecture, in: Proceedings of the 2nd Symposium on Operating Systems Design and Implementations(Association for Computing Machinery, New York, NY, 1996) pp. 245-259.
C. Carothers, R. Fujimoto, Y.-B. Lin and P. England, Distributed simulations of large-scale pcs networks, in: Proceedings of the 2nd IEEE International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems(IEEE Computer Society, Los Alamitos, CA, 1994) pp. 2-11.
S. Das, R. Fujimoto, K. Panesar, D. Allison and M. Hybinette, GTW: A Time Warp system for shared memory multiprocessors, in: Proceedings of the 1994 Winter Simulation Conference(Association for Computing Machinery, New York, NY, 1994) pp. 1332-1339.
A. Davis, M. Swanson and M. Parker, Efficient communication mechanisms for cluster based parallel computing, in: Proceedings of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing, eds. D.K. Panda and C.B. Stunkel (Springer, Heidelberg, 1997) pp. 1-15.
P. Druschel and G. Banga, Lazy receiver processing (LRP): a network subsystem architecture for server systems, in: Proceedings of the 2nd Symposium on Operating Systems Design and Implementations(Association for Computing Machinery, New York, NY, 1996) pp. 261-275.
P. Druschel, L.L. Peterson and B.S. Davie, Experiences with a high-speed network adaptor: a software perspective, in: Proceedings of the SIGCOMM' 94 Conference on Communications Architectures, Protocols and Applications(Association for Computing Machinery, New York, NY, 1994) pp. 2-13.
C. Dubnicki, A. Bilas, K. Li and J. Philbin, Design and implementation of virtual memory-mapped communication on Myrinet, in: Proceedings of the 11th International Parallel Processing Symposium(IEEE Computer Society, Los Alamitos, CA, 1997) pp. 388-396.
C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis and K. Li, VMMC-2: Efficient support for reliable, connection-oriented communication, in: Proceedings of Hot Interconnects V(1997) pp. 37-46.
A. Edwards, G. Watson, J. Lumley, D. Banks, C. Calamvokis and C. Dalton, User-space protocols deliver high performance to applications on a low-cost Gb/s LAN, in: Proceedings of the SIGCOMM' 94 Conference on Communications Architectures, Protocols and Applications(Association for Computing Machinery, New York, NY, 1994) pp. 14-23.
D. Engler, M. Kaashoek and J. Jr. O'Toole, Exokernel: an operating system architecture for application-level resource management, in: Proceedings of the 15th ACM Symposium on Operating System Principles(Association for Computing Machinery, New York, NY, 1995) pp. 251-266.
E.W. Felten, R.D. Alpert, A. Bilas, M.A. Blumrich, D.W. Clark, S. Damianakis, C. Dubnicki, L. Iftode and K. Li, Early experience with message-passing on the SHRIMP multi-computer, in: Proceedings of the 23rd ACM Annual International Symposium on Computer Architecture(Association for Computing Machinery, New York, NY, 1996) pp. 296-307.
FORE Systems, Programmer's Reference Manual for AALI Interface, MANU 0023 (FORE Systems Inc., Warrendale, PA, 1995).
R. Fujimoto, Performance of Time Warp under synthetic workloads, in: Proceedings of the SCS Multi-conference on Distributed Simulation(Society for Computer Simulation, San Diego, CA, 1990) pp. 23-28.
R. Fujimoto and K. Panesar, Buffer management in shared-memory Time Warp systems, in: Proceedings of the 9th Workshop on Parallel and Distributed Simulation(IEEE Computer Society, Los Alamitos, CA, 1995) pp. 149-156.
D. Jefferson, Virtual time, ACM Transactions on Programming Languages and Systems 7 (1985) 404-425.
M.B. Jones, D. Ros¸u and M.-C. Roşu, CPU reservations and time constraints: Efficient, predictable scheduling of independent activities, in: Proceedings of the 16th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1997) pp. 198-211.
P.M. Kogge, EXECUBE - A new architecture for scalable MPPs, in: Proceedings of the 1994 International Conference on Parallel Processing(CRC Press, Boca Raton, FL, 1994) pp. 77-84.
R.P. Martin, A.M. Vahdat, D.E. Culler and T.E. Anderson, Effects of communication latency, overhead and bandwidth in a cluster architecture, in: Proceedings of the 24th ACM Annual International Symposium on Computer Architecture(Association for Computing Machinery, New York, NY, 1997) pp. 85-97.
S. Pakin, M. Laura and A. Chien, High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet, in: Proceedings of the 1995 ACM Conference on Supercomputing(CDROM) (Association for Computing Machinery, New York, NY, 1995).
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas and K. Yelick, A case for intelligent RAM: IRAM, IEEE Micro 17 (1997) 34-44.
L. Peterson, N. Hutchinson, S. O'Malley and M. Abbot, RPC in the x-kernel: Evaluating new design techniques, in: Proceedings of the 12th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1989) pp. 91-101.
M.-C. Roşu, Processor controlled off-processor I/O, TR95-1538, CS Department Cornell University, Ithaca, NY (1995).
M. Rosenblum, E. Bugnion, S.A. Herrod, E. Witchel and A. Gupta, The impact of architectural trends on operating system performance, in: Proceedings of the 15th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1995) pp. 285-298.
P. Sarkar and M. Bailey, CNI: A high-performance network interface for workstation clusters, in: Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing(IEEE Computer Society, Los Alamitos, CA, 1996) pp. 151-160.
A. Saulsbury, F. Pong and A. Nowatzyk, Missing the memory wall: the case for processor/memory integration, in: Proceedings of the 23rd Annual International Symposium on Computer Architecture(Association for Computing Machinery, New York, NY, 1996) pp. 90-101.
K.E. Schauser, C.J. Scheiman, J.M. Ferguson and P.Z. Kolano, Exploiting the capabilities of communication co-processors, in: Proceedings of the 10th International Parallel Processing Symposium(IEEE Computer Society, Los Alamitos, CA, 1996) pp. 109-115.
K. Schwan, T. Bihari, B.W. Weide and G. Taulbee, High performance operating system primitives for robotics and real-time control systems, ACM Transactions on Computer Systems 5 (1987) 189-231.
P. Steenkiste, A systematic approach to host interface design for high-speed networks, IEEE Computer 26 (1994) 47-57.
C.A. Thekkath and H.M. Levy, Limits to low-latency communication on high-speed networks, ACM Transactions on Computer Systems 11 (1993) 179-203.
C.B.S. Traw and J.M. Smith, Hardware/software organization of a high performance ATM host interface, IEEE Journal on Selected Areas in Communications 2 (1993) 240-253.
T. von Eicken, A. Basu, V. Buch and W. Vogels, U-Net: A user-level network interface for parallel and distributed computing, in: Proceedings of the 15th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1995) pp. 40-53.
D. Wallach, W. Hsieh, K. Johnson, M. Kaashoek and W. Weihl, Optimistic active messages: a mechanism for scheduling communication with computation, in: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Association for Computing Machinery, New York, NY, 1995) pp. 217-226.
M. Welsh, A. Basu and T. von Eicken, Incorporating memory management into user-level network interfaces, in: Proceedings of Hot Interconnects V(1997) pp. 27-36.
J. Wilkes, Hamlyn - an interface for sender-based communications, TR HPL-OSR-92-13, Hewlett-Packard Laboratories, Palo Alto, CA (1992).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Roşu, M., Schwan, K. & Fujimoto, R. Supporting parallel applications on clusters of workstations: The Virtual Communication Machine‐based architecture. Cluster Computing 1, 51–67 (1998). https://doi.org/10.1023/A:1019064911399
Issue Date:
DOI: https://doi.org/10.1023/A:1019064911399