TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

  • Xinhai Chen
  • Jie Liu
  • Shengguo Li
  • Peizhen Xie
  • Lihua Chi
  • Qinglin Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11334)


With the increasing size of high performance computing systems, the expensive communication overhead between processors has become a key factor leading to the performance bottleneck. However, default process-to-processor mapping strategies do not take into account the topology of the interconnection network, and thus the distance spanned by communication messages may be particularly far. In order to enhance the communication locality, we propose a new topology-aware mapping method called TAMM. By generating an accurate description of the communication pattern and network topology, TAMM employs a two-step optimization strategy to obtain an efficient mapping solution for various parallel applications. This strategy first extracts an appropriate subset of all idle computing resources on the underlying system and then constructs an optimized one-to-one mapping with a refined iterative algorithm. Experimental results demonstrate that TAMM can effectively improve the communication performance on the Tianhe-2A supercomputer.


High performance computing systems Topology-aware mapping Communication pattern Network topology 



This research work was supported in part by the National Key Research and Development Program of China (2017YFB0202104), the National Natural Science Foundation of China under Grant No.: 91530324, No.: 91430218, China Postdoctoral Science Foundation (CPSF) Grant No.: 2014M562570, Special Financial Grant from CPSF Grant No.: 2015T81127.


  1. 1.
    Bhatele, A., Laxmikant, V.: An evaluative study on the effect of contention on message latencies in large supercomputers. In: 2009 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8 (2009).
  2. 2.
    Bhatele, A.: Automating topology aware mapping for supercomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA (2010)Google Scholar
  3. 3.
    Brandfass, B., Alrutz, T., Gerhold, T.: Rank reordering for mpi communication optimization. Comput. Fluids 80, 372–380 (2013). Scholar
  4. 4.
    Cao, J., Xiao, L., Pang, Z., Wang, K., Xu, J.: The efficient in-band management for interconnect network in Tianhe-2 system. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 18–26 (2016).
  5. 5.
    Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 353–360. ACM (2006).
  6. 6.
    Duff, I.S.: European exascale software initiative: numerical libraries, solvers and algorithms. In: Alexander, M., et al. (eds.) Euro-Par 2011. LNCS, vol. 7155, pp. 295–304. Springer, Heidelberg (2012). Scholar
  7. 7.
    Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications: Architecture, Software, Computer Systems, and General Issues, C3P, vol. 1, pp. 210–221. ACM (1988).
  8. 8.
    Fujiwara, T., Malakar, P., Reda, K., Vishwanath, V., Papka, M.E., Ma, K.L.: A visual analytics system for optimizing communications in massively parallel applications. In: IEEE Conference on Visual Analytics Science and Technology (2017)Google Scholar
  9. 9.
    Galvez, J.J., Jain, N., Kale, L.V.: Automatic topology mapping of diverse large-scale parallel applications. In: Proceedings of the International Conference on Supercomputing, ICS 2017, pp. 17:1–17:10. ACM (2017).
  10. 10.
    Geist, A., Dosanjh, S.: IESP exascale challenge: co-design of architectures and algorithms. Int. J. High Perform. Comput. Appl. 23(4), 401–402 (2009). Scholar
  11. 11.
    Georgiou, Y., Jeannot, E., Mercier, G., Villiermet, A.: Topology-aware job mapping. Int. J. High Perform. Comput. Appl. 63 (2017).
  12. 12.
    Hendrickson, B., Leland, R.: The Chaco user’s guide: version 2.0. Technical report, Sandia National Laboratory (1994)Google Scholar
  13. 13.
    Hoefler, T., Jeannot, E., Mercier, G.: An overview of topology mapping algorithms and techniques in high-performance computing, Chap. 5, pp. 73–94. Wiley-Blackwell (2014). Scholar
  14. 14.
    Hoefler, T., Snir, M.: Generic topology mapping strategies for large-scale parallel architectures. In: Proceedings of the International Conference on Supercomputing, ICS 2011. pp. 75–84. ACM(2011).
  15. 15.
    Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters:algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014). Scholar
  16. 16.
    Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6272, pp. 199–210. Springer, Heidelberg (2010). Scholar
  17. 17.
    Karypis, G., Kumar, V.: Metis: a software package for partitioning unstructured graphs. International Cryogenics Monograph, pp. 121–124 (1998)Google Scholar
  18. 18.
    Li, S., Hoefler, T., Snir, M.: NUMA-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 85–96. ACM (2013).
  19. 19.
    Liao, X.K., et al.: High performance interconnect network for Tianhe system. J. Comput. Sci. Technol. 30(2), 259–272 (2015). Scholar
  20. 20.
    Liao, X., Xiao, L., Yang, C., Lu, Y.: Milkyway-2 supercomputer: system and application. Front. Comput. Sci. 8(3), 345–356 (2014). Scholar
  21. 21.
    Mercier, G., Clet-Ortega, J.: Towards an efficient process placement policy for MPI applications in multicore environments. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) EuroPVM/MPI 2009. LNCS, vol. 5759, pp. 104–115. Springer, Heidelberg (2009). Scholar
  22. 22.
    Mirsadeghi, S.H., Afsahi, A.: PTRAM: a parallel topology-and routing-aware mapping framework for large-scale HPC systems. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 386–396 (2016).
  23. 23.
    Mirsadeghi, S.H., Afsahi, A.: Topology-aware rank reordering for MPI collectives. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1759–1768 (2016).
  24. 24.
    Pang, Z., et al.: The TH express high performance interconnect networks. Front. Comput. Sci. 8(3), 357–366 (2014). Scholar
  25. 25.
    Pellegrini, F., Roman, J.: Scotch: a software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In: Liddell, H., Colbrook, A., Hertzberger, B., Sloot, P. (eds.) HPCN-Europe 1996. LNCS, vol. 1067, pp. 493–498. Springer, Heidelberg (1996). Scholar
  26. 26.
    Rodrigues, E.R., Madruga, F.L., Navaux, P.O.A., Panetta, J.: Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: 2009 IEEE Symposium on Computers and Communications, pp. 811–817 (2009).
  27. 27.
    Schreiber, R.S., et al.: The NAS parallel benchmarks. In: 1991 ACM/IEEE Conference on Supercomputing (Supercomputing 1991) (SC), pp. 158–165 (1991).
  28. 28.
    Sreepathi, S., D’Azevedo, E., Philip, B., Worley, P.: Communication characterization and optimization of applications using topology-aware task mapping on large supercomputers. In: Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, ICPE 2016, pp. 225–236. ACM (2016).
  29. 29.
    Subramoni, H., et al.: Design of network topology aware scheduling services for large infiniband clusters. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8 (2013).
  30. 30.
    Sweep3D: The ASCI Sweep3D Benchmark Code (2014). (2014)
  31. 31.
    Tuncer, O., Leung, V.J., Coskun, A.K.: PaCMap: topology mapping of unstructured communication patterns onto non-contiguous allocations. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 37–46. ACM (2015).
  32. 32.
    Walshaw, C., Cross, M.: Jostle: Parallel multilevel graph-partitioning software - an overview. Mesh Partitioning Techniques and Domain Decomposition Techniques (2007)Google Scholar
  33. 33.
    Wang, T., Qing, P., Wei, D., Qi, F.B.: Optimization of process-to-core mapping based on clustering analysis. Chin. J. Comput. 38, 1044–1055 (2015)MathSciNetGoogle Scholar
  34. 34.
    Wu, J., Xiong, X., Berrocal, E., Wang, J., Lan, Z.: Topology mapping of irregular parallel applications on torus-connected supercomputers. J. Supercomput. 73(4), 1691–1714 (2017). Scholar
  35. 35.
    Yu, H., Chung, I.H., Moreira, J.: Topology mapping for blue Gene/L supercomputer. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006. ACM (2006).
  36. 36.
    Zerr, R.J., Baker, R.S.: SNAP: SN (discrete ordinates) application proxy - proxy description. Technical report, Los Alamos National Laboratory (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Science and Technology on Parallel and Distributed Processing LaboratoryNational University of Defense TechnologyChangshaChina
  2. 2.Institute of Advanced Science and TechnologyHunan Institute of Traffic EngineeringHengyangChina

Personalised recommendations