The Journal of Supercomputing

, Volume 73, Issue 4, pp 1691–1714 | Cite as

Topology mapping of irregular parallel applications on torus-connected supercomputers

  • Jingjin WuEmail author
  • Xuanxing Xiong
  • Eduardo Berrocal
  • Jia Wang
  • Zhiling Lan


Supercomputers with ever increasing computing power are being built for scientific applications. As the system size scales up, so does the size of interconnect network. As a result, communication in supercomputers becomes increasingly expensive due to the long distance between nodes and network contention. Topology mapping, which maps parallel application processes onto compute nodes by considering network topology and application communication pattern, is an essential technique for communication optimization. In this paper, we study the topology mapping problem for torus-connected supercomputers, and present an analytical topology mapping algorithm for parallel applications with irregular communication patterns. We consider our problem as a discrete optimization problem in the geometric domain of a torus topology, and design an analytical mapping algorithm, which uses numerical solvers to compute the mapping. Experimental results show that our algorithm provides high-quality mappings on 3-dimensional torus, which significantly reduce the communication time by up to 72%.


High-performance computing Topology mapping Communication optimization Torus network Analytical algorithm 



This work is supported in part by US National Science Foundation Grants OCI-0904670 and CNS-1320125. This work is also supported in part by the National Natural Science Foundation of China Grant 61402083. The authors thank the Argonne Leadership Computing Facility for the use of their supercomputers.


  1. 1.
    Abdel-Gawad AH, Thottethodi M, Bhatele A (2014) RAHTM: routing algorithm aware hierarchical task mapping. In: Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), p 325–335Google Scholar
  2. 2.
    Abts D (2011) The Cray XT4 and Seastar 3-D Torus interconnect. Encyclopedia of Parallel Computing, p 470–477Google Scholar
  3. 3.
    Agarwal T, Sharma A, Laxmikant A, Kale LV (2006) Topology-aware task mapping for reducing communication contention on large parallel machines. In: Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS)Google Scholar
  4. 4.
    Analytical Mapping Tool (2014) Accessed 30 July 2014
  5. 5.
    Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269CrossRefzbMATHGoogle Scholar
  6. 6.
    Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. In: Proc. the 7th Annual International High Performance Computing Conference. The 1993 High Performance Computing: New Horizons Supercomputing Symposium, p 349–357Google Scholar
  7. 7.
    Berman F, Snyder L (1987) On mapping parallel algorithms into parallel architectures. J Parallel Distrib Comput 4(5):439–458CrossRefGoogle Scholar
  8. 8.
    Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRefGoogle Scholar
  9. 9.
    Bhatele A (2010) Automating topology aware mapping for supercomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, UrbanaGoogle Scholar
  10. 10.
    Bhatele A, Gamblin T, Langer SH, Bremer PT, Draeger EW, Hamann B, Isaacs KE, Landge AG, Levine JA, Pascucci V, Schulz M, Still CH (2012) Mapping applications with collectives over sub-communicators on torus networks. In: Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), p 97:1–97:11Google Scholar
  11. 11.
    Bokhari SH (1981) On the mapping problem. IEEE Trans Comput 30(3):207–214CrossRefMathSciNetGoogle Scholar
  12. 12.
    Boyd S, Vandenberghe L (2009) Convex optimization. Cambridge University Press, CambridgezbMATHGoogle Scholar
  13. 13.
    Butz AR (1971) Alternative algorithm for Hilbert’s space-filling curve. IEEE Trans Comput C–20(4):424–426CrossRefzbMATHGoogle Scholar
  14. 14.
    Chen Y, Davis TA, Hager WW, Rajamanickam S (2008) Algorithm 887: CHOLMOD, supernodal sparse cholesky factorization and update/downdate. ACM Trans Math Softw 35(3):22:1–22:14CrossRefMathSciNetGoogle Scholar
  15. 15.
    Chockalingam T, Arunkumar S (1992) A randomized heuristics for the mapping problem: the genetic approach. Parallel Comput 18(10):1157–1165CrossRefzbMATHGoogle Scholar
  16. 16.
    Chung IH, Lee CR, Zhou J, Chung YC (2011) Hierarchical mapping for HPC applications. In: Proc. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), p 1815–1823Google Scholar
  17. 17.
    Davis TA, Hu Y (2011) The university of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1–25MathSciNetGoogle Scholar
  18. 18.
    Deveci M, Rajamanickam S, Leung VJ, Pedretti K, Olivier SL, Bunde DP, Çatalyürek UV, Devine K (2014) Exploiting geometric partitioning in task mapping for parallel computers. In: Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS), p 27–36Google Scholar
  19. 19.
    Ercal F, Ramanujam J, Sadayappan P (1988) Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proc. the Third Conference on Hypercube Concurrent Computers and Applications: Architecture, Software, Computer Systems, and General Issues, vol 1, C3P, p 210–221Google Scholar
  20. 20.
    Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore, LondonzbMATHGoogle Scholar
  21. 21.
    Hoefler T, Snir M (2011) Generic topology mapping strategies for large-scale parallel architectures. In: Proc. the International Conference on Supercomputing (ICS), p 75–84Google Scholar
  22. 22.
    Hu YF, Blake RJ, Emerson DR (1998) An optimal migration algorithm for dynamic load balancing. Concurr Pract Exp 10(6):467–483CrossRefzbMATHGoogle Scholar
  23. 23.
    IBM References for BG/P (2013) Accessed 1 May 2013
  24. 24.
    Jeannot E, Mercier G, Tessier F (2014) Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans Parallel Distrib Syst 25(4):993–1002CrossRefGoogle Scholar
  25. 25.
    Kravtsov AV, Klypin AA, Khokhlov AM (1997) Adaptive refinement tree: a new high-resolution N-body code for cosmological simulations. Astrophys J Suppl Ser 111:73–94CrossRefGoogle Scholar
  26. 26.
    Lee C, Bic L (1989) On the mapping problem using simulated annealing. In: Proc. International Phoenix Conference on Computers and Communications, p 40–44. doi: 10.1109/PCCC.1989.37357
  27. 27.
    LibTopoMap (2010) A generic topology mapping library. Accessed 8 May 2013
  28. 28.
    METIS (2013) Graph partitioning tool. Accessed 6 May 2013
  29. 29.
    Pellegrini F (1994) Static mapping by dual recursive bipartitioning of process architecture graphs. In: Proc. the Scalable High-Performance Computing Conference, p 486–493Google Scholar
  30. 30.
    Plewa T, Linde T, Weirs VG (2005) Adaptive mesh refinement-theory and applications. Springer, BerlinCrossRefzbMATHGoogle Scholar
  31. 31.
    Salman A, Ahmad I, Al-Madani S (2002) Particle swarm optimization for task assignment problem. Microprocess Microsyst 26(8):363–371CrossRefGoogle Scholar
  32. 32.
    Spielman D, Teng SH (2003) Solving sparse, symmetric, diagonally-dominant linear systems in time o(m1.31). In: Proc. IEEE Symposium on Foundations of Computer Science, p 416–427Google Scholar
  33. 33.
    Träff JL (2002) Implementing the MPI process topology mechanism. In: Proc. ACM/IEEE Conference on Supercomputing, p 28:1–28:14Google Scholar
  34. 34.
    Top 500 Supercomputer Sites (2015) Accessed 30 Nov 2015
  35. 35.
  36. 36.
    Viswanathan N, Chu CCN (2004) FastPlace: Efficient analytical placement using cell shifting, iterative local refinement and a hybrid net model. In: Proc. International Symposium on Physical Design, p 26–33Google Scholar
  37. 37.
    Viswanathan N, Chu CCN (2005) FastPlace: efficient analytical placement using cell shifting, iterative local refinement, and a hybrid net model. IEEE Trans Comput Aided Design 24(5):722–733CrossRefGoogle Scholar
  38. 38.
    Wallace S, Vishwanath V, Coghlan S, Tramm J, Lan Z, Papkay M (2013) Application power profiling on IBM Blue Gene/Q. In: Proc. IEEE International Conference on Cluster Computing (CLUSTER), p 1–8Google Scholar
  39. 39.
    Wu J, Gonzalez RE, Lan Z, Gnedin NY, Kravtsov AV, Rudd DH, Yu Y (2011) Performance emulation of cell-based AMR cosmology simulations. In: Proc. IEEE International Conference on Cluster Computing (CLUSTER), p 8–16Google Scholar
  40. 40.
    Wu J, Lan Z, Xiong X, Gnedin NY, Kravtsov AV (2012) Hierarchical task mapping of cell-based AMR cosmology simulations. In: Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), SC ’12, p 75:1–75:10Google Scholar
  41. 41.
    Wu J, Xiong X, Lan Z (2015) Hierarchical task mapping for parallel applications on supercomputers. J Supercomput 71(5):1776–1802CrossRefGoogle Scholar
  42. 42.
    Yu H, Chung IH, Moreira J (2006) Topology mapping for Blue Gene/L supercomputer. In: Proc. ACM/IEEE Conference on Supercomputing, p 52. doi: 10.1109/SC.2006.63
  43. 43.
    Yu Y, Rudd DH, Lan Z, Gnedin NY, Kravtsov AV, Wu J (2012) Improving parallel IO performance of cell-based AMR cosmology applications. In: Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS), p 933–944Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina
  2. 2.Design GroupSynopsys, Inc.Mountain ViewUSA
  3. 3.Department of Computer ScienceIllinois Institute of TechnologyChicagoUSA
  4. 4.Department of Electrical and Computer EngineeringIllinois Institute of TechnologyChicagoUSA

Personalised recommendations