Skip to main content
Log in

Supporting parallel applications on clusters of workstations: The Virtual Communication Machine‐based architecture

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU's load translates into a small increase in the coprocessor's load. The functionality implemented by the coprocessor is available at the application level as VCM instructions. Host CPU(s) and coprocessor interact through shared memory regions, thereby avoiding expensive CPU context switches. The host kernel is not involved in this interaction; it simply “connects” the application to the VCM during the initialization phase and is called infrequently to handle exceptional conditions. Protection is enforced by the VCM based on information supplied by the kernel. The VCM-based communication architecture admits low cost and open implementations, as demonstrated by its current ATM-based implementation based on off-the-shelf hardware components and using standard AAL5 packets. The architecture makes it easy to implement communication software that exhibits negligible overheads on the host CPU(s) and offers latencies and bandwidths close to the hardware limits of the underlying network. These characteristics are due to the VCM's support for zero-copy messaging with gather/scatter capabilities and the VCM's direct access to any data structure in an application's address space. This paper describes two versions of an ATM-based VCM implementation, which differ in the way they use the memory on the network adapter. Their performance under heavy load is compared in the context of a synthetic client/server application. The same application is used to evaluate the scalability of the architecture to multiple VCM-based network interfaces per host. Parallel implementations of the Traveling Salesman Problem and of Georgia Tech Time Warp, an engine for discrete-event simulation, are used to demonstrate VCM functionality and the high performance of its implementation. The distributed- and shared-memory versions of these two applications exhibit comparable performance, despite the significant cost-performance advantage of the distributed-memory platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. H. Agusleo and N. Soparkar, Employing logic-enhanced memory for high-performance ATM network interfaces, in: Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing(IEEE Computer Society, Los Alamitos, CA, 1996) pp. 192-200.

    Google Scholar 

  2. T.E. Anderson, H.M. Levy, B.N. Bershad and E.D. Lazowska, The interaction of architecture and operating system design, in: Proceedings of the 4th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(Association for Computing Machinery, New York, NY, 1991) pp. 108-120.

    Google Scholar 

  3. B. Bershad, S. Savage, P. Pardyak, E. Sirer, M. Fiuczynski, D. Becker, C. Chambers and S. Eggers, Extensibility, safety, and performance in the SPIN operating system, in: Proceedings of the 15th ACM Symposium on Operating System Principles(Association for Computing Machinery, New York, NY, 1995) pp. 267-283.

    Google Scholar 

  4. G. Buzzard, D. Jacobson, M. Mackey, S. Marovich and J. Wilkes, An implementation of the Hamlyn sender-managed interface architecture, in: Proceedings of the 2nd Symposium on Operating Systems Design and Implementations(Association for Computing Machinery, New York, NY, 1996) pp. 245-259.

    Google Scholar 

  5. C. Carothers, R. Fujimoto, Y.-B. Lin and P. England, Distributed simulations of large-scale pcs networks, in: Proceedings of the 2nd IEEE International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems(IEEE Computer Society, Los Alamitos, CA, 1994) pp. 2-11.

    Google Scholar 

  6. S. Das, R. Fujimoto, K. Panesar, D. Allison and M. Hybinette, GTW: A Time Warp system for shared memory multiprocessors, in: Proceedings of the 1994 Winter Simulation Conference(Association for Computing Machinery, New York, NY, 1994) pp. 1332-1339.

    Google Scholar 

  7. A. Davis, M. Swanson and M. Parker, Efficient communication mechanisms for cluster based parallel computing, in: Proceedings of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing, eds. D.K. Panda and C.B. Stunkel (Springer, Heidelberg, 1997) pp. 1-15.

    Google Scholar 

  8. P. Druschel and G. Banga, Lazy receiver processing (LRP): a network subsystem architecture for server systems, in: Proceedings of the 2nd Symposium on Operating Systems Design and Implementations(Association for Computing Machinery, New York, NY, 1996) pp. 261-275.

    Google Scholar 

  9. P. Druschel, L.L. Peterson and B.S. Davie, Experiences with a high-speed network adaptor: a software perspective, in: Proceedings of the SIGCOMM' 94 Conference on Communications Architectures, Protocols and Applications(Association for Computing Machinery, New York, NY, 1994) pp. 2-13.

    Google Scholar 

  10. C. Dubnicki, A. Bilas, K. Li and J. Philbin, Design and implementation of virtual memory-mapped communication on Myrinet, in: Proceedings of the 11th International Parallel Processing Symposium(IEEE Computer Society, Los Alamitos, CA, 1997) pp. 388-396.

    Google Scholar 

  11. C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis and K. Li, VMMC-2: Efficient support for reliable, connection-oriented communication, in: Proceedings of Hot Interconnects V(1997) pp. 37-46.

  12. A. Edwards, G. Watson, J. Lumley, D. Banks, C. Calamvokis and C. Dalton, User-space protocols deliver high performance to applications on a low-cost Gb/s LAN, in: Proceedings of the SIGCOMM' 94 Conference on Communications Architectures, Protocols and Applications(Association for Computing Machinery, New York, NY, 1994) pp. 14-23.

    Google Scholar 

  13. D. Engler, M. Kaashoek and J. Jr. O'Toole, Exokernel: an operating system architecture for application-level resource management, in: Proceedings of the 15th ACM Symposium on Operating System Principles(Association for Computing Machinery, New York, NY, 1995) pp. 251-266.

    Google Scholar 

  14. E.W. Felten, R.D. Alpert, A. Bilas, M.A. Blumrich, D.W. Clark, S. Damianakis, C. Dubnicki, L. Iftode and K. Li, Early experience with message-passing on the SHRIMP multi-computer, in: Proceedings of the 23rd ACM Annual International Symposium on Computer Architecture(Association for Computing Machinery, New York, NY, 1996) pp. 296-307.

    Google Scholar 

  15. FORE Systems, Programmer's Reference Manual for AALI Interface, MANU 0023 (FORE Systems Inc., Warrendale, PA, 1995).

    Google Scholar 

  16. R. Fujimoto, Performance of Time Warp under synthetic workloads, in: Proceedings of the SCS Multi-conference on Distributed Simulation(Society for Computer Simulation, San Diego, CA, 1990) pp. 23-28.

    Google Scholar 

  17. R. Fujimoto and K. Panesar, Buffer management in shared-memory Time Warp systems, in: Proceedings of the 9th Workshop on Parallel and Distributed Simulation(IEEE Computer Society, Los Alamitos, CA, 1995) pp. 149-156.

    Google Scholar 

  18. D. Jefferson, Virtual time, ACM Transactions on Programming Languages and Systems 7 (1985) 404-425.

    Article  Google Scholar 

  19. M.B. Jones, D. Ros¸u and M.-C. Roşu, CPU reservations and time constraints: Efficient, predictable scheduling of independent activities, in: Proceedings of the 16th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1997) pp. 198-211.

    Google Scholar 

  20. P.M. Kogge, EXECUBE - A new architecture for scalable MPPs, in: Proceedings of the 1994 International Conference on Parallel Processing(CRC Press, Boca Raton, FL, 1994) pp. 77-84.

    Google Scholar 

  21. R.P. Martin, A.M. Vahdat, D.E. Culler and T.E. Anderson, Effects of communication latency, overhead and bandwidth in a cluster architecture, in: Proceedings of the 24th ACM Annual International Symposium on Computer Architecture(Association for Computing Machinery, New York, NY, 1997) pp. 85-97.

    Google Scholar 

  22. S. Pakin, M. Laura and A. Chien, High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet, in: Proceedings of the 1995 ACM Conference on Supercomputing(CDROM) (Association for Computing Machinery, New York, NY, 1995).

    Google Scholar 

  23. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas and K. Yelick, A case for intelligent RAM: IRAM, IEEE Micro 17 (1997) 34-44.

    Article  Google Scholar 

  24. L. Peterson, N. Hutchinson, S. O'Malley and M. Abbot, RPC in the x-kernel: Evaluating new design techniques, in: Proceedings of the 12th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1989) pp. 91-101.

    Google Scholar 

  25. M.-C. Roşu, Processor controlled off-processor I/O, TR95-1538, CS Department Cornell University, Ithaca, NY (1995).

    Google Scholar 

  26. M. Rosenblum, E. Bugnion, S.A. Herrod, E. Witchel and A. Gupta, The impact of architectural trends on operating system performance, in: Proceedings of the 15th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1995) pp. 285-298.

    Google Scholar 

  27. P. Sarkar and M. Bailey, CNI: A high-performance network interface for workstation clusters, in: Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing(IEEE Computer Society, Los Alamitos, CA, 1996) pp. 151-160.

    Google Scholar 

  28. A. Saulsbury, F. Pong and A. Nowatzyk, Missing the memory wall: the case for processor/memory integration, in: Proceedings of the 23rd Annual International Symposium on Computer Architecture(Association for Computing Machinery, New York, NY, 1996) pp. 90-101.

    Google Scholar 

  29. K.E. Schauser, C.J. Scheiman, J.M. Ferguson and P.Z. Kolano, Exploiting the capabilities of communication co-processors, in: Proceedings of the 10th International Parallel Processing Symposium(IEEE Computer Society, Los Alamitos, CA, 1996) pp. 109-115.

    Google Scholar 

  30. K. Schwan, T. Bihari, B.W. Weide and G. Taulbee, High performance operating system primitives for robotics and real-time control systems, ACM Transactions on Computer Systems 5 (1987) 189-231.

    Article  Google Scholar 

  31. P. Steenkiste, A systematic approach to host interface design for high-speed networks, IEEE Computer 26 (1994) 47-57.

    Google Scholar 

  32. C.A. Thekkath and H.M. Levy, Limits to low-latency communication on high-speed networks, ACM Transactions on Computer Systems 11 (1993) 179-203.

    Article  Google Scholar 

  33. C.B.S. Traw and J.M. Smith, Hardware/software organization of a high performance ATM host interface, IEEE Journal on Selected Areas in Communications 2 (1993) 240-253.

    Article  Google Scholar 

  34. T. von Eicken, A. Basu, V. Buch and W. Vogels, U-Net: A user-level network interface for parallel and distributed computing, in: Proceedings of the 15th ACM Symposium on Operating Systems Principles(Association for Computing Machinery, New York, NY, 1995) pp. 40-53.

    Google Scholar 

  35. D. Wallach, W. Hsieh, K. Johnson, M. Kaashoek and W. Weihl, Optimistic active messages: a mechanism for scheduling communication with computation, in: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Association for Computing Machinery, New York, NY, 1995) pp. 217-226.

    Google Scholar 

  36. M. Welsh, A. Basu and T. von Eicken, Incorporating memory management into user-level network interfaces, in: Proceedings of Hot Interconnects V(1997) pp. 27-36.

  37. J. Wilkes, Hamlyn - an interface for sender-based communications, TR HPL-OSR-92-13, Hewlett-Packard Laboratories, Palo Alto, CA (1992).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roşu, M., Schwan, K. & Fujimoto, R. Supporting parallel applications on clusters of workstations: The Virtual Communication Machine‐based architecture. Cluster Computing 1, 51–67 (1998). https://doi.org/10.1023/A:1019064911399

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019064911399

Keywords

Navigation