A design of performance-optimized control-based synchronization
A fundamental issue that any control-based synchronization should address is how to minimize both the overheads of the synchronization and the processor idling due to the variation in the arrival time of the synchronizing processors. This paper proposes two techniques to alleviate the above two problems in a large-scale shared-memory multiprocessor. First, the notion of delayed global-materialization is introduced, that tries to minimize the time spent by the synchronizing processors to globally materialize previously issued shared write references. This step is required before the processors participate in the actual synchronization step. The scheme is based on a compile-time analysis of parallel programs to identify the write references to the shared memory locations that will be accessed in the subsequent computational unit. The global-materialization for these write references is made immediately while that for other shared write references is done as lazily as possible. Second, a novel prefetching technique is proposed that allows prefetching across different computational units separated by a synchronization operation so as to keep the otherwise idling processors busy during synchronization. This scheme also requires a compile-time analysis to determine whether the prefetch request for a given shared read reference can be safely made across synchronization. The required hardware supports for the above two schemes are identified and the issues arising when the two techniques are used together are addressed.
Unable to display preview. Download preview PDF.
- A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley, 1986.Google Scholar
- BBN. Butterfly Parallel Processor Overview, version 1 edition.Google Scholar
- F. Darema-Rogers, V. A. Norton, and G. F. Pfister. Using a single-program-multiple-data computational model for parallel execution of scientific applications. Technical Report IBM Technical Report RC 11552, IBM, Nov. 1985.Google Scholar
- A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer — Designing a MIMD, Shared-Memory Parallel Machine. In Proceedings of the 9th Annual International Symposium on Computer Architecture, pages 27–42, April 1982.Google Scholar
- R. Gupta. The fuzzy barrier: A mechanism for high speed synchronization of processors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pages 54–63, April 1989.Google Scholar
- H. Jordan. A special purpose architecture for finite element analysis. In Proceedings of the 1978 International Conference on Parallel Processing, pages 263–266. IEEE, 1978.Google Scholar
- R. L. Lee. The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors. Technical Report CSRD Report. No. 670, Center for Supercomputing Research and Development, University of Illinois, May 1987.Google Scholar
- R. L. Lee, P. C. Yew, and D. H. Lawrie. Data prefetching in shared memory multiprocessors. In Proceedings of the 1987 International Conference on Parallel processing, pages 28–31. IEEE, August 1987.Google Scholar
- E. L. Lusk and R. A. Overbeek. Appendix: Use of Monitors in FORTRAN: A Tutorial on the Barrier, Self-scheduling DO-Loop, and Ask-for Monitors. In J. S. Kowalik, editor, Parallel MIMD Computation: HEP Supercomputer and Its Applications, pages 367–411. MIT Press, Cambridge, MA, 1985.Google Scholar
- S. L. Min. Memory Hierarchy Management Schemes in Large Scale Shared-Memory Multiprocessors. PhD thesis, University of Washington, 1989.Google Scholar
- G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 International Conference on Parallel Processing, pages 764–771. IEEE, August 1985.Google Scholar