An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing

  • Jie Zhang
  • Myoungsoo JungEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10578)


Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses. Our evaluation results and in-depth studies reveal that (i) aggregating small messages can improve network bandwidth without violating latency restrictions, (ii) while MICs can execute hundreds of hardware cores, the highest network throughput is achieved when only 4 \(\sim \) 6 point-to-point connections are established for data communication, (iii) data communication over multiple point-to-point connections between host and MICs introduce severe load unbalancing, which require to be optimized for future heterogeneous computing.


Manycore Communication Accelerator Parallel programming High performance computing Heterogeneous computing 



This research is mainly supported by NRF 2016R1C1B2015312. This work is also supported in part by IITP-2017-2017-0-01015, NRF-2015M3C4A7065645, DOE DE-AC02-05CH 11231 and MemRay grant (2015-11-1731). The corresponding author is M. Jung.


  1. 1.
    IMI Core. Symmetric communications interface (scif) user guide. Intel Std., Rev. 0.8 (2012)Google Scholar
  2. 2.
    Gropp, W., et al.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRefzbMATHGoogle Scholar
  3. 3.
    Intel, Intel Xeon Phi 7120A (2014).
  4. 4.
    Potluri, S., et al.: Efficient inter-node mpi communication using gpudirect rdma for infiniband clusters with nvidia gpus. In: ICpp, pp. 80–89. IEEE (2013)Google Scholar
  5. 5.
    Saule, E., Kaya, K., Çatalyürek, Ü.V.: Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 559–570. Springer, Heidelberg (2014). doi: 10.1007/978-3-642-55224-3_52 CrossRefGoogle Scholar
  6. 6.
    Stone, J.E., et al.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 66 (2010)CrossRefGoogle Scholar
  7. 7.
    Xue, M., Zhu, C.: The socket programming and software design for communication based on client/server. In: PACCS, pp. 775–777. IEEE (2009)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2017

Authors and Affiliations

  1. 1.Computer Architecture and Memory Systems Laboratory, School of Integrated TechnologyYonsei UniversitySeoulSouth Korea

Personalised recommendations