Abstract
One of the prime considerations for high scalar performance in supercomputers is a low memory latency. With the increasing disparity between main memory and CPU clock speeds, the use of an intermediate memory in the hierarchy becomes necessary. In this paper, we present an intermediate memory structure called a programmable cache. A programmable cache exploits structural locality to decrease the average memory access time. We evaluate the concept of a programmable cache by using the vector registers in the CRAY X-MP and Y-MP supercomputers as a programmable cache. Our results indicate that a programmable cache can be used profitably to reduce the memory latency if the pattern of references to a data structure can be determined at compile time.
Similar content being viewed by others
References
Abu-Sufah, W., and Maloney, A.D. 1986. Vector processing on the Alliant FX/8 multiprocessor. In Proc., 1986 Internat. Conf. on Parallel Processing (Aug.), 559–563.
Cray. 1984. Cray Computer Systems: CRAY X-MP Model 48 Mainframe Reference Manual. HR-0097, Cray Research, Inc., Mendota Heights, Minn.
Cray. 1985. Cray Computer Systems: CRAY-2 Hardware Reference Manual. HR-2000, Cray Research, Inc., Mendota Heights, Minn.
Eoyang, C, Mendez, R.H., and Lubeck, O.M. 1988. The birth of the second generation: The Hitachi S-820/80. In Proc., Supercomputing '88 (Orlando, Fla., Nov. 14–18), pp. 296–303.
Fisher, J.A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comp., C-30.
Gannon, D., and Jalby, W. 1987. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. CSRD Rept. No. 633, Center for Supercomputing Res. and Dev., Univ. of Ill. at Urbana-Champaign, Urbana, Ill.
Hsu, W.-C. 1987. Register allocation and code scheduling for load/store architectures. Comp. Sci. Tech. Rept. 722, Univ. of Wisconsin at Madison, Madison, Wisc.
Kroft, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proc., 8th Internat. Symp. on Comp. Architecture (May), pp. 81–87.
Miura, K., and Uchida, K. 1983. FACOM vector processor system: VP-100/VP-200. In Proc., NATO Advanced Res. Workship on High-Speed Computing (June).
Padua, D.A. and Wolfe, M.J. 1986. Advanced compiler optimizations for supercomputers. CACM, 29 (Dec), 1184–1201.
Pfister, G.F., Brantley, W.C., George, D.A., Harvey, S.L., Kleinfelder, W.J., McAuliffe, K.P., Melton, E.A., Norton V.A., and Weiss, J. 1985. The IBM Research Parallel Processor Prototype (RP3): Introduction and architecure. In Proc., 1985 Internat. Conf. on Parallel Processing (Aug.), pp. 764–771.
Rau, B.R. 1988. Cydra 5 directed dataflow architecture. Digest of Papers, COMPCON Spring 1988 (Feb.), pp. 106–113.
Russel, R.M. 1978. The CRAY-1 computer system. CACM, 21 (Jan.), 63–72.
Scheurich, C., and Dubois, M. 1988. The design of a lockup-free cache for high-performance multiprocessors. In Proc., Supercomputing '88 (Orlando, Fla., Nov. 14–18), pp. 352–359.
Watanabe, T. 1987. Architecture and performance of NEC supercomputer SX system. Parallel Computing, 5: 247–255.
Wilson, A.W. 1987. Hierarchical cache/bus architecture for shared memory multiprocessors. In Proc., 14th Annual Symp. on Comp. Architecture (Pittsburgh, Penn., June), pp. 244–252.
Author information
Authors and Affiliations
Additional information
The work of the first author was supported in part by NSF Grant CCR-8706722.
Rights and permissions
About this article
Cite this article
Sohi, G.S., Hsu, WC. The use of intermediate memories for low-latency memory access in supercomputer scalar units. J Supercomput 4, 5–21 (1990). https://doi.org/10.1007/BF00162340
Issue Date:
DOI: https://doi.org/10.1007/BF00162340