The Performance Impact of Address Relation Caching

  • Peter A. Dinda
  • David R. O’Hallaron

Abstract

An important portion of end-to-end latency in data transfer is spent in address computation, determining a relation between sender and receiver addresses. In deposit model communication, this computation happens only on the sender and some of its results are embedded in the message. Conventionally, address computation takes place on-line, as the message is assembled. If the amount of address computation is significant, and the communication is repeated, it may make sense to remove address computation from the critical path by caching its results. However, assembling a message using the cache uses additional memory bandwidth.

We present a fine grain analytic model for simple address relation caching in deposit model communication. The model predicts how many times a communication must be repeated in order for the average end-to-end latency of an implementation which caches to break even with that of an implementation which doesn’t cache. The model also predicts speedup and those regimes where a caching implementation never breaks even. The model shows that the effectiveness of caching depends on CPU speed, memory bandwidth and the complexity of the address computation. We verify the model on the iWarp and the Paragon and find that, for both machines, caching can improve performance even when address computation is quite simple (one instruction per data word on the iWarp and 16 instructions per data word on the Paragon.)

To show the practical benefit of address relation caching, we examine the performance of an HPF distributed array communication library that can be configured to use caching. In some cases, caching can double the performance of the library. Finally, we discuss other benefits to caching and several open issues.

Keywords

Data Item Critical Path Message Passing Interface None None Memory Bandwidth 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Reference

  1. [1]
    S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing ‘88, pages 330–339, November 1988.Google Scholar
  2. [2]
    High Performance Fortran Forum. High Performance Fortran language specification version 1.0 draft, January 1993.Google Scholar
  3. [3]
    T. Gross, D. O’Hallaron, and J. Subhlok. Task parallelism in a High Performance Fortran framework. IEEE Parallel & Distributed Technology, 2(3):16–26, 1994.CrossRefGoogle Scholar
  4. [4]
    i860 Microprocessor Family Programmer’s Reference Manual. Intel Corporation, 1992.Google Scholar
  5. [5]
    Intel Corp. Paragon X/PS Product Overview, March 1991.Google Scholar
  6. [6]
    P. Pierce and G. Regnier. The Paragon implementation of the NX message passing interface. In Proc. Scalable High Performance Computing Conference, pages 184–190, Knoxville, TN, May 1994. IEEE Computer Society Press.Google Scholar
  7. [7]
    J. Saltz,, S. Petiton, H. Berryman, and A. Rifkin. Performance effects of irregular communication patterns on massively parallel multiprocessors. Journal of Parallel and Distributed Computing, 13:202–212, 1991.CrossRefGoogle Scholar
  8. [8]
    P. Steenkiste, B. Zill, H. Kung, S. Schlick, J. Hughes, B. Kowalski, and J. Mullaney. A host interface architecture for high-speed networks. In Proceedings of the 4th IFIP Conference on High Performance Networks, pages A3 1–16, Liege, Belgium, December 1992. IFIP, Elsevier.Google Scholar
  9. [9]
    J. Stichnoth. Efficient compilation of array statements for private memory multicomputers. Technical Report CMU-CS-93–109, School of Computer Science, Carnegie Mellon University, February 1993.Google Scholar
  10. [10]
    J. Stichnoth, D. O’ Hallaron, and T. Gross. Generating communication for array statements: Design, implementation, and evaluation. Journal of Parallel and Distributed Computing, 21(1):150–159, April 1994.CrossRefGoogle Scholar
  11. [11]
    T. Stricker and T. Gross. Optimizing memory system performance for communication in parallel computers. In Proc. 22nd Intl. Symp. on Computer Architecture, Portofino, Italy, June 1995. ACM/IEEE. to appear.Google Scholar
  12. [12]
    T. Stricker, J. Stichnoth, D. O’Hallaron, S. Hinrichs, and T. Gross. Decoupling synchronization and data transfer in message passing systems of parallel computers. In Proc. Intl. Conf. on Supercomputing, page accepted, Barcelona, July 1995. ACM.Google Scholar
  13. [13]
    V. S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315–339, December 1990.CrossRefGoogle Scholar
  14. [14]
    D. Walker. The design of a standard message passing interface for distributed memory concurrent computers. Parallel Computing, 20(4):657–673, April 1994.MATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 1996

Authors and Affiliations

  • Peter A. Dinda
    • 1
  • David R. O’Hallaron
    • 1
  1. 1.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations