1 Introduction

In the past few years, many-core integrated (MIC) architecture has be employed for accelerating the heterogeneous computing [5]. Typically, a MIC architecture chip has dozens of reduced x86 cores, which can provide massive parallelism. Parallel user applications, which are executed across different cores of MIC(s), can share data or collaborate to compute with each other by using user-level communication systems, such as massage passing interface (MPI) [2] or open computing language (OpenCL) [6]. For the modern MIC designs, these user-level communication systems in practice are built upon a symmetric communication interface, referred to as SCIF, which is a kernel component of manycore platform software stack. Even though there are many prior studies that perform the user-level optimizations for efficient manycore communication [4], little attention has been paid to the low-level communication methods for MICs.

In this work, we build a real MIC-accelerated heterogeneous cluster that has eight main processor cores and 244 physical MIC cores (61 cores per MIC device) and characterize the performance behaviors on the heterogeneous cluster, which are observed by the low-level communication methods. Specifically, we evaluate the latency and bandwidth of the cluster using two different strategies: (i) massing passing and (ii) remote memory access (RMA). While the massage passing establish a pair of message queues for exchanging short, latency-sensitive messages, RMA enables one process to remotely access the memory of target process it connected to. In this paper, we explore a full design space of MICs with those two the message passing and RMA communication methods by taking into account a wide spectrum of system parameters, including different datapath configurations, various data page and message sizes, as well as different number of threads, ports and connection channels.

Fig. 1.
figure 1

Communication APIs of SCIF interface.

2 Background

In this section, we briefly explain how MIC devices communicate with the host through the user-level APIs supported by the software stack. In heterogeneous cluster, each separate multiprocessor (i.e., host and MICs) is regarded as a node. The software stack of MIC architectures provides a common transport interface, which is referred to as symmetric communication interface (SCIF) [1], to establish a point-to-point communication link to connect a pair of processes on either different nodes or in the same node. The APIs supported by the SCIF user mode library can be categorized as a set of connection APIs, messaging APIs, and remote memory access (RMA) APIs. Figure 1 demonstrates how two processes can connect with each other over SCIF APIs. As shown in Fig. 1a, the connection APIs provide a socket-like hand-shaking procedure (e.g., scif_open(), scif_bind(), scif_listen(), scif_connect(), and scif_accept()) to set up connections [7]. The messaging APIs support a two-sided communication between connected processes by implementing message queues (c.f. Fig. 1b). Messages can be sent and received via the commands scif_send() and scif_recv(). On the other hand, the RMA APIs are responsible for transferring a large bulk of data. As shown in Fig. 1c, it first maps a specific memory region of a local process to the address space of target process through a memory registration API, scif_register(). It then leverages read/write APIs, scif_readfrom() and scif_writeto(), to access remote data.

Fig. 2.
figure 2

Performance improvement of message passing with multiple connection channels.

Fig. 3.
figure 3

Performance improvements of RMA with connection channels.

Fig. 4.
figure 4

Max, min, and average communication latencies of each thread under various connection channel and DMA length configurations.

3 Empirical Evaluation and Analysis

Message-based Communication Method. Figure 2 illustrates the performance of MIC message passing mechanism by employing different number of producer/consumer threads within single connection channel or establishing multiple connection channels. Specifically, “XPYC” indicates X producer threads and Y consumer threads are employed for a single connection channel, while “XConn" means X connection channels are established with a single producer thread and a single consumer thread per connection channel. From the results, we can conclude that, it improves communication throughput without introducing extra latency penalty, if one can aggregate multiple small-size messages to 512B (Finding 1). Multiple producer and consumer threads are unable to improve the communication throughput of single point-to-point connection channel (Finding 2). One can also observe that properly establishing multiple point-to-point connection channels can significantly improve the performance (Finding 3).

RMA-based Communication Method. Figure 3 shows the performance of RMA by employing multiple producers-consumers or setting up multiple channel connections. From this figure, we conclude that it is better to leverage message-based approach to transfer data whose size is larger than 512B, compared to messaging based approach, even though the minimum data access granularity of the RMA-based approach is 4 KB (Finding 4). As shown in Fig. 3, more producer/consumer threads unfortunately cannot help improve the performance (Finding 5). In addition, more connection channels can improve the performance, but four connection channels are sufficient to achieve the best performance (Finding 6). To figure out the reason behind the poor performance imposed by establishing many connection channels, we analyze the busy time of each thread which performs data transfer over an individual connection channel. Figure 4 shows the maximum, minimum, and average communication latency of each thread with different number of connection channels and different transfer data size. Based on the results, we can conclude that the communication over SCIF can introduce long tail latency, which degrades the performance (Finding 7).

4 Conclusion

In this work, we evaluated and analyzed the performance of inter-node communications across CPU cores and multiple MICs. Our evaluation results reveal that the performance of current inter-node communication methods is sub-optimized owing to the low throughput of small requests and the long tail latency. We then provide seven system-level findings with an in-depth performance analysis.