Improving the Performance of Collective Operations in MPICH

  • Rajeev Thakur
  • William D. Gropp
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2840)

Abstract

We report on our work on improving the performance of collective operations in MPICH on clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth usage for long messages. Although we have implemented new algorithms for all MPI collective operations, because of limited space we describe only the algorithms for allgather, broadcast, reduce-scatter, and reduce. We present performance results using the SKaMPI benchmark on a Myrinet-connected Linux cluster and an IBM SP. In all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM’s MPI on the SP.

Keywords

Expense 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barnett, M., Gupta, S., Payne, D., Shuler, L., van de Geijn, R., Watts, J.: Interprocessor collective communication library (InterCom). In: Proceedings of Supercomputing 1994 (November 1994)Google Scholar
  2. 2.
    Barnett, M., Littlefield, R., Payne, D., van de Geijn, R.: Global combine on mesh architectures with wormhole routing. In: Proceedings of the 7th International Parallel Processing Symposium (April 1993)Google Scholar
  3. 3.
    Bokhari, S.: Complete exchange on the iPSC/860. Technical Report 91–4, ICASE, NASA Langley Research Center (1991)Google Scholar
  4. 4.
    Bokhari, S., Berryman, H.: Complete exchange on a circuit switched mesh. In: Proceedings of the Scalable High Performance Computing Conference, pp. 300– 306 (1992)Google Scholar
  5. 5.
    Hensgen, D., Finkel, R., Manbet, U.: Two algorithms for barrier synchronization. International Journal of Parallel Programming 17(1), 1–17 (1988)MATHCrossRefGoogle Scholar
  6. 6.
    Kale, L.V., Kumar, S., Vardarajan, K.: A framework for collective personalized communication. In: Proceedings of the 17th International Parallel and Distributed Processing Symposium, IPDPS 2003 (2003)Google Scholar
  7. 7.
    Karonis, N., de Supinski, B., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the Fourteenth International Parallel and Distributed Processing Symposium (IPDPS 2000), pp. 377–384 (2000)Google Scholar
  8. 8.
    Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: Mag-PIe: MPI’s collective communication operations for clustered wide area systems. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 1999), May 1999, pp. 131–140. ACM Press, New York (1999)CrossRefGoogle Scholar
  9. 9.
    Mitra, P., Payne, D., Shuler, L., van de Geijn, R., Watts, J.: Fast collective communication libraries, please. In: Proceedings of the Intel Supercomputing Users’ Group Meeting (June 1995)Google Scholar
  10. 10.
    Rabenseifner, R.: Effective bandwidth (b_eff) benchmark, http://www.hlrs.de/mpi/beff
  11. 11.
    Rabenseifner, R.: New optimized MPI reduce algorithm, http://www.hlrs.de/organization/par/services/models/mpi/myreduce.html
  12. 12.
    Sanders, P., Träff, J.L.: The hierarchical factor algorithm for all-toall communication. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 799–803. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  13. 13.
    Scott, D.: Efficient all-to-all communication patterns in hypercube and mesh topologies. In: Proceedings of the 6th Distributed Memory Computing Conference, pp. 398–403 (1991)Google Scholar
  14. 14.
    Shroff, M., van de Geijn, R.A.: CollMark: MPI collective communication benchmark. Technical report, Dept. of Computer Sciences, University of Texas at Austin (December 1999)Google Scholar
  15. 15.
    Sistare, S., vandeVaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMPs. In: Proceedings of SC 1999: High Performance Networking and Computing (November 1999)Google Scholar
  16. 16.
    Tipparaju, V., Nieplocha, J., Panda, D.K.: Fast collective operations using shared and remote memory access protocols on clusters. In: Proceedings of the 17th International Parallel and Distributed Processing Symposium, IPDPS 2003 (2003)Google Scholar
  17. 17.
    Träff, J.L.: Improved MPI all-to-all communication on a Giganet SMP cluster. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 392–400. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  18. 18.
    Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of SC 1999: High Performance Networking and Computing (November 1999)Google Scholar
  19. 19.
    Worsch, T., Reussner, R., Augustin, W.: On benchmarking collective MPI operations. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 271–279. Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Rajeev Thakur
    • 1
  • William D. Gropp
    • 1
  1. 1.Argonne National LaboratoryMathematics and Computer Science DivisionArgonneUSA

Personalised recommendations