Cluster Computing

, Volume 11, Issue 1, pp 75–90 | Cite as

A framework for characterizing overlap of communication and computation in parallel applications

  • Aniruddha G. Shet
  • P. Sadayappan
  • David E. Bernholdt
  • Jarek Nieplocha
  • Vinod Tipparaju
Article

Abstract

Effective overlap of computation and communication is a well understood technique for latency hiding and can yield significant performance gains for applications on high-end computers. In this paper, we propose an instrumentation framework for message-passing systems to characterize the degree of overlap of communication with computation in the execution of parallel applications. The inability to obtain precise time-stamps for pertinent communication events is a significant problem, and is addressed by generation of minimum and maximum bounds on achieved overlap. The overlap measures can aid application developers and system designers in investigating scalability issues. The approach has been used to instrument two MPI implementations as well as the ARMCI system. The implementation resides entirely within the communication library and thus integrates well with existing approaches that operate outside the library. The utility of the framework is demonstrated by analyzing communication-computation overlap for micro-benchmarks and the NAS benchmarks, and the insights obtained are used to modify the NAS SP benchmark, resulting in improved overlap.

Keywords

Communication-computation overlap Latency hiding Performance instrumentation and monitoring Parallel applications 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bhoedjang, R.A.F., Ruhl, T., Bal, H.E.: User-level network interface protocols. In: IEEE Computer, pp. 53–60, November 1998 Google Scholar
  2. 2.
    Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.: Myrinet: a gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995) CrossRefGoogle Scholar
  3. 3.
    Brightwell, R., Underwood, K.D.: An analysis of the impact of mpi overlap and independent progress. In: International Conference on Supercomputing (ICS) (2004) Google Scholar
  4. 4.
    Brightwell, R., Underwood, K.D., Riesen, R.: An initial analysis of the impact of overlap and independent progress for MPI. In: EuroPVM/MPI (2004) Google Scholar
  5. 5.
  6. 6.
  7. 7.
    Dimitrov, R.: Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations. PhD thesis, Mississippi State University (2001) Google Scholar
  8. 8.
    Dimitrov, R.: ChaMPIon/Pro—The complete MPI-2 for massively parallel Linux, Linux clusters: the HPC revolution (2004) Google Scholar
  9. 9.
    Dimitrov, R., Skjellum, A.: Impact of latency on applications’ performance. In: Fourth MPI Developer’s and User’s Conference, March 2000 Google Scholar
  10. 10.
    Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104 Google Scholar
  11. 11.
    InfiniBand Trade Association: InfiniBand, http://www.infinibandta.org
  12. 12.
  13. 13.
  14. 14.
    Lawry, B., Wilson, R., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster (2002) Google Scholar
  15. 15.
    Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: International Parallel and Distributed Processing Symposium (IPDPS), April 2004 Google Scholar
  16. 16.
    Mellanox Technologies: Mellanox VAPI interface, July 2002 Google Scholar
  17. 17.
    Message Passing Interface Forum: MPI: a message-passing interface standard, March 1994 Google Scholar
  18. 18.
    Moore, S., Cronk, D., London, K., Dongarra, J.: Review of performance analysis tools for MPI parallel programs. In: 8th European PVM/MPI Users’ Group Meeting. Lecture Notes in Computer Science, vol. 2131, pp. 241–248. Springer, Berlin (2001) Google Scholar
  19. 19.
  20. 20.
    Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.K.: High performance remote memory access communications: the ARMCI approach. Int. J. High Perform. Comput. Appl. 20(2), 233–253 (2006) CrossRefGoogle Scholar
  21. 21.
    Nieplocha, J., Tipparaju, V., Krishnan, M., Santhanaraman, G., Panda, D.K.: Optimisation and performance evaluation of mechanisms for latency tolerance in remote memory access communication on clusters. Int. J. High Perform. Comput. Netw. (IJHPCN) 2(2-4) (2004) Google Scholar
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
    Petrini, F., Feng, W., Hoisie, A., Coll, S., Frachtenberg, E.: The quadrics network (QsNet): high-performance clustering technology. In: Hot Interconnects 9, August 2001 Google Scholar
  27. 27.
    Sur, S., Jin, H.-W., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Symposium on Principles and Practice of Parallel Programming (PPOPP), March 2006 Google Scholar
  28. 28.
  29. 29.
    Tipparaju, V., Krishnan, M., Nieplocha, J., Santhanaraman, G., Panda, D.K.: Exploiting Nonblocking Remote Memory Access Communication in Scientific Benchmarks on Clusters. In: HiPC (2003) Google Scholar
  30. 30.
    Tipparaju, V., Santhanaraman, G., Nieplocha, J., Panda, D.K.: Host assisted zero-copy remote memory access communication on InfiniBand. In: International Parallel and Distributed Processing Symposium (IPDPS), April 2004 Google Scholar
  31. 31.
    Vetter, J.S.: Performance analysis of distributed applications using automatic classification of communication inefficiencies. In: International Conference on Supercomputing (ICS) (2000) Google Scholar
  32. 32.
    Vetter, J.S.: Dynamic statistical profiling of communication activity in distributed applications. In: International Conference on Measurement and Modeling of Computer Systems (2002) Google Scholar
  33. 33.
    Vetter, J.S., McCracken, M.O.: Statistical scalability analysis of communication operations in distributed applications. In: Symposium on Principles and Practice of Parallel Programming (PPOPP) (2001) Google Scholar
  34. 34.
    White, J.B., Bova, S.W.: Where’s the overlap? An analysis of popular MPI implementations. In: Third MPI Developers’ and Users’ Conference, March 1999 Google Scholar
  35. 35.
    Woodall, T., Graham, R., Castain, R., Daniel, D., Sukalski, M., Fagg, G., Gabriel, E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A.: TEG: A high-performance, scalable, multi-network point-to-point communications methodology. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, p. 303–310 Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Aniruddha G. Shet
    • 1
  • P. Sadayappan
    • 1
  • David E. Bernholdt
    • 2
  • Jarek Nieplocha
    • 3
  • Vinod Tipparaju
    • 3
  1. 1.The Ohio State UniversityColumbusUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.Pacific Northwest National LaboratoryRichlandUSA

Personalised recommendations