A Case Study of Communication Optimizations on 3D Mesh Interconnects

  • Abhinav Bhatelé
  • Eric Bohm
  • Laxmikant V. Kalé
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5704)


Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance.

Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.


Idle Time Task Mapping Topology Information Communication Optimization Large Machine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Greenberg, R.I., Oh, H.C.: Universal wormhole routing. IEEE Transactions on Parallel and Distributed Systems 08(3), 254–262 (1997)CrossRefGoogle Scholar
  2. 2.
    Ni, L.M., McKinley, P.K.: A survey of wormhole routing techniques in direct networks. Computer 26(2), 62–76 (1993)CrossRefGoogle Scholar
  3. 3.
    Bhatele, A., Kale, L.V.: An Evaluation of the Effect of Interconnect Topologies on Message Latencies in Large Supercomputers. In: Proceedings of Workshop on Large-Scale Parallel Processing (IPDPS 2009) (May 2009)Google Scholar
  4. 4.
    Kalé, L., Krishnan, S.: CHARM++: A Portable Concurrent Object Oriented System Based on C++. In: Paepcke, A. (ed.) Proceedings of OOPSLA 1993, September 1993, pp. 91–108. ACM Press, New York (1993)Google Scholar
  5. 5.
    Bhandarkar, M., Kale, L.V., de Sturler, E., Hoeflinger, J.: Object-Based Adaptive Load Balancing for MPI Programs. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2074, pp. 108–117. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  6. 6.
    Pasquarello, A., Hybertsen, M.S., Car, R.: Interface structure between silicon and its oxide by first-principles molecular dynamics. Nature 396, 58 (1998)CrossRefGoogle Scholar
  7. 7.
    De Santis, L., Carloni, P.: Serine proteases: An ab initio molecular dynamics study. Proteins 37, 611 (1999)CrossRefGoogle Scholar
  8. 8.
    Saitta, A.M., Soper, P.D., Wasserman, E., Klein, M.L.: Influence of a knot on the strength of a polymer strand. Nature 399, 46 (1999)CrossRefGoogle Scholar
  9. 9.
    Rothlisberger, U., Carloni, P., Doclo, K., Parinello, M.: A comparative study of galactose oxidase and active site analogs based on QM/MM Car Parrinello simulations. J. Biol. Inorg. Chem. 5, 236 (2000)CrossRefGoogle Scholar
  10. 10.
    Bokhari, S.H.: On the mapping problem. IEEE Trans. Computers 30(3), 207–214 (1981)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Lee, S.Y., Aggarwal, J.K.: A mapping strategy for parallel processing. IEEE Trans. Computers 36(4), 433–442 (1987)Google Scholar
  12. 12.
    Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the 3rd conference on Hypercube concurrent computers and applications, pp. 210–221. ACM Press, New York (1988)Google Scholar
  13. 13.
    Berman, F., Snyder, L.: On mapping parallel algorithms into parallel architectures. Journal of Parallel and Distributed Computing 4(5), 439–458 (1987)CrossRefGoogle Scholar
  14. 14.
    Bollinger, S.W., Midkiff, S.F.: Processor and link assignment in multicomputers using simulated annealing. In: ICPP (1), pp. 1–7 (1988)Google Scholar
  15. 15.
    Arunkumar, S., Chockalingam, T.: Randomized heuristics for the mapping problem. International Journal of High Speed Computing (IJHSC) 4(4), 289–300 (1992)CrossRefzbMATHGoogle Scholar
  16. 16.
    Bhanot, G., Gara, A., Heidelberger, P., Lawless, E., Sexton, J.C., Walkup, R.: Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development 49(2/3), 489–500 (2005)CrossRefGoogle Scholar
  17. 17.
    Gygi, F., Draeger, E.W., Schulz, M., Supinski, B.R.D., Gunnels, J.A., Austel, V., Sexton, J.C., Franchetti, F., Kral, S., Ueberhuber, C., Lorenz, J.: Large-Scale Electronic Structure Calculations of High-Z Metals on the Blue Gene/L Platform. In: Proceedings of the International Conference in Supercomputing. ACM Press, New York (2006)Google Scholar
  18. 18.
    Bhatelé, A., Kalé, L.V., Kumar, S.: Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications. In: 23rd ACM International Conference on Supercomputing (2009)Google Scholar
  19. 19.
    Smith, B.E., Bode, B.: Performance Effects of Node Mappings on the IBM Blue Gene/L Machine. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 1005–1013. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Yu, H., Chung, I.H., Moreira, J.: Topology mapping for Blue Gene/L supercomputer. In: SC 2006: Proceedings of the, ACM/IEEE conference on Supercomputing, p. 116. ACM, New York (2006)CrossRefGoogle Scholar
  21. 21.
    Weisser, D., Nystrom, N., Vizino, C., Brown, S.T., Urbanic, J.: Optimizing Job Placement on the Cray XT3. In: 48th Cray User Group Proceedings (2006)Google Scholar
  22. 22.
    Bhatelé, A., Kalé, L.V.: Benefits of Topology Aware Mapping for Mesh Interconnects. Parallel Processing Letters (Special issue on Large-Scale Parallel Processing) 18(4), 549–566 (2008)MathSciNetGoogle Scholar
  23. 23.
    Bohm, E., Bhatele, A., Kale, L.V., Tuckerman, M.E., Kumar, S., Gunnels, J.A., Martyna, G.J.: Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM Journal of Research and Development: Applications of Massively Parallel Systems 52(1/2), 159–174 (2008)CrossRefGoogle Scholar
  24. 24.
    IBM Blue Gene Team: Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development 52(1/2) (2008)Google Scholar
  25. 25.
    Tuckerman, M.E.: Ab initio molecular dynamics: Basic concepts, current trends and novel applications. J. Phys. Condensed Matter 14, R1297 (2002)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Dongarra, J., Luszczek, P.: Introduction to the HPC Challenge Benchmark Suite. Technical Report UT-CS-05-544, University of Tennessee, Dept. of Computer Science (2005)Google Scholar
  27. 27.
    Salapura, V., Ganesan, K., Gara, A., Gschwind, M., Sexton, J., Walkup, R.: Next-Generation Performance Counters: Towards Monitoring Over Thousand Concurrent Events. In: IEEE International Symposium on Performance Analysis of Systems and Software, April 2008, pp. 139–146 (2008)Google Scholar
  28. 28.
    Catlett, C., et al.: TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications. In: Grandinetti, L. (ed.) HPC and Grids in Action. IOS Press, Amsterdam (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Abhinav Bhatelé
    • 1
  • Eric Bohm
    • 1
  • Laxmikant V. Kalé
    • 1
  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations