Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

The heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} and achieves an energy saving of 5.16×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} on average.


Introduction
The heterogeneous many-core architecture is widely used in the fields of highperformance computing and scientific computing [13,32]. Because it requires a large number of cores [5], which can provide a high degree of parallelism and a high computing speed, reducing the energy consumption of many-core architectures has become a major challenge. To reduce the energy consumption, accelerator cores with local memories are provided in this architecture [13]. However, this means that two different kinds of memories are utilized in this architecture, i.e., off-chip memory and on-chip memory, which results in a more complex storage system, a profound challenge of which is determining how to handle the data transfer between off-chip memory and on-chip memory. For scientific computing applications, when using heterogeneous many-core processors to accelerate computations, the efficient use of storage systems is one of the most critical factors in improving performance and reducing energy.
Scratchpad memory (SPM) [3] is a kind of fast on-chip memory managed by software (a programmer or a compiler), while cache has to query the flag bit which is managed by hardware to check cache misses or hits. Compared with the hardware cache, SPM does not need to perform flag bit judgment and other tasks, and has the advantages of low power consumption and fast access. SPM is initially used in embedded systems to meet the real-time and time predictable requirements of embedded systems [17]. Besides, the Scratchpad memory is also extensively used in FPGAs and is also employed as application-specific caches [33,34]. The current heterogeneous many-core processors, such as Adapteva Epiphany, Sunway TaihuLight, and IBM Cell, also use SPM to achieve better performance and lower energy consumption. The characteristic of this type of architecture is that each accelerator core has its own SPM that can be accessed at high speed but has limited space.
The SPM is connected to the off-chip memory through a bus [26]. The accelerator cores can only access the data of off-chip memory directly by global load/ store instructions or direct memory access (DMA) [11,29]. The accelerator cores can communicate with each other by a network-on-chip (NoC). However, programmers need to explicitly manage the data transfer between the SPM and the off-chip memory in the application, which hinders program development.
In single-threaded applications, we can use compiler-directed data buffering through DMA to optimize the data transfer between SPM and off-chip memory [4,23]. This process requires precise analysis of the access patterns and careful management of the data size. With data buffering, global load/store operations to off-chip memory can be replaced with direct accesses to local buffers in SPM without redundant look-up operations. However, multithreaded applications have more synchronization than single-threaded applications. Keeping the coherence of multithreaded applications makes the optimization more complex.
Here, we propose a multithreaded SPM data transfer model (MSDTM) to optimize the data transfer between SPM and off-chip memory on heterogeneous many-core architecture. It first analyzes the application to classify data accesses 1 3 and determine the allocation of data transfer with the data transfer allocation (DTA) algorithm. Next, it uses the data transfer performance (DTP) model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. Then, the code is transformed by the MSDTM with loop distribution and strip-mining. We implement the proposed MSDTM on the GCC complier.
Optimizing data transfer operations by the MSDTM can effectively improve the performance of multithreaded applications and reduce the energy consumption. Since the MSDTM is used in the compilation process, it can also effectively reduce the programming difficulty.
The major contributions of this paper are as follows: • We propose an algorithm to determine the allocation of data transfer for multithreaded applications with an analysis of data accesses and dependence checking. • We formulate the data transfer strategy selection problem for multithreaded applications on an SPM-based heterogeneous many-core architecture and design a performance model to derive the optimal granularity of data transfer and select the most profitable strategy. • We implement and evaluate our proposed model on Sunway TaihuLight with the kernel of scientific computing programs and applications from general benchmarks.
The remainder of the paper is organized as follows. In Sect. 2, we mention some related work, while in Sect. 3, we use a simple example to illustrate our motivation. Section 4 presents the MSDTM. We evaluate the proposed MSDTM in Sect. 5. Section 6 concludes this paper.

SPM-based heterogeneous many-core architectures
For energy consumption and scalability considerations, some heterogeneous manycore processors choose SPM as a fast on-chip memory. For example, the IBM Cell [6] processor includes a 64-bit PowerPC general-purpose processor core power process element and 8 coprocessor synergistic processor elements (SPEs). Each SPE contains 256 KB of local storage space for storing code and data executed on the SPE, the storage address is private, and threads on different SPEs can communicate only through the main memory. Adapteva Epiphany [15] is an SPM-based manycore architecture that is energy efficient and suitable for embedded systems. Each processor in Adapteva Epiphany consists of an RISC core, a DMA engine, a network interface, and a 32 KB SPM, connected using a 2D mesh grid. This architecture provides a shared address space that allows threads to complete communication by accessing nonlocal SPM. Sunway TaihuLight [12] has a heterogeneous manycore architecture. Each core group includes one management processing element 1 3 Compiler-directed scratchpad memory data transfer… and 8 × 8 computing processing elements, and each computing processing element contains a 64 KB local data memory.

SPM data management
A number of works have optimized SPM data management on heterogeneous many-core architectures. [37] identifies continuous code blocks in memory, such as functions and basic blocks that are executed, and maps them to the SPM area. To improve the performance of embedded systems and minimize energy consumption, [8] performs static analysis on an MPSoC (multicore processor system-on-chip) shared distributed SPM to obtain the optimal memory allocation method. [24] optimizes the compiler on the basis of OpenMP and distributes the array data of the parallel part of applications executed on MPSoC to distributed SPMs. [36] formulates the SPM data allocation problem for multithreaded applications and proposes the NoC contention and latency aware compile-time framework to automatically determine the location of data variables, the replication degree of shared data, and onchip placement. [20] maps the SPM management problem for data aggregates into the well-understood register allocation problem for scalars to automatically assign static data aggregates in a program to an SPM.

Data transfer optimization
To optimize SPM data transfer, the direct blocking data buffer (DBDB) [6] is designed and implemented to optimize the use of local memory while providing a simple shared memory programming model for the Cell-BE architecture. [7] develops a model to automatically infer the optimal buffering scheme and size for static buffering, taking into account the DMA latency and transfer rates and the amount of computation in the application loop being targeted. [31] presents optimized buffering techniques and evaluates them for two multicore architectures: quad-core Opteron and the Cell-BE. [30] derives optimal and near-optimal values for the number of blocks that should be clustered in a single DMA command based on the computation time and size of the elementary data items as well as the DMA characteristics.
On heterogeneous many-core architectures, [38] presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications, in order to exploit spatial and temporal sharing of the heterogeneous processing units. [28] presents a runtime system that automatically optimizes data management on SPM to achieve performance similar to that on the fast memory-only system with a much smaller capacity of fast memory.
To the best of our knowledge, ours is the first work to propose an SPM DTP model for multithreaded applications on SPM-based heterogeneous many-core architectures to reduce the overall application execution time, with evaluation on a real platform.

Motivating example
We use a simple motivating example to illustrate the efficiency of optimizing SPM data transfer for multithreaded applications. For illustration purposes, we use the system parameters of Sunway TaihuLight as stated in Tables 1 and 2 in this example [9]. In addition, the start-up overhead of DMA transfers is 300 cycles.
According to Tables 1 and 2, we show the relationship between the time spent on memory access by NoC and the cost of memory access by the DMA with different granularities in Fig. 1.
From the trend of the curve in Fig. 1, we can predict that when the granularity is increased, DMA will result in better transfer efficiency compared to using NoC for data transfer.
The execution time of a multithreaded program consists of the computation time and time spent due to memory access. To simplify our illustration, we assume that Network-on-chip 10  Compiler-directed scratchpad memory data transfer… the computing performance of each thread is the same; thus, the time spent due to memory access is the only component that reduces the program execution time.
Since we execute multithreaded applications on heterogeneous many-core processors, the time spent due to memory access can be divided into (1) the latency of data access, depending on where the variable is located, i.e., SPM or off-chip memory (AccessLat), (2) the latency spent in data communication via the DMA operation (DMALat), and (3) the latency spent in data communication via NoC (NoCLat). The DMA latency and NoC latency can be divided further into the initialization time, transfer time and delay due to contention among memory requests. The efficiency of DMA transfers on 64 threads is lower than the efficiency on single threads, which is caused by the contention among memory requests. This observation means that the delay due to the contention among DMA requests is already included in the transfer time of DMA. Because NoC latency can be obtained by multiplying the total number of Hops and HopLat, we do not need to divide the NoC latency into the initialization time, transfer time and delay due to the contention among memory requests.
As shown in Fig. 2, we choose a multithreaded kernel of a scientific computing application that is executed on 64 threads.
Arrays A, B and C are global variables, which means that they need to be accessed from off-chip memory. The variables id, j and coeff are all private variables; thus, they need to be accessed only from the SPM of each core. Therefore, all we need to consider is how to optimize the time spent due to memory access for the A, B and C arrays. From line 6 and 7, we can see that there is a true dependence because the result of variable B in line 6 need to be used as a source operand in line 7. But what is different from common true dependence is that the result in line 6 is used by another thread. Here, we name this kind of true dependence as thread-carried true dependence.
In this example, variables A, B and C are allocated in off-chip memory by default. Unlike A and C, variable B is accessed twice in this piece of code. Without any optimization, we need to access the variables from the off-chip memory directly. Each thread issues 4 × 64 = 256 accesses to variables A, B and C. The access latency is 256 × (off -chip AccessLat) = 256 × 278 = 71168 cycles, which is also the total execution time due to memory access.
We use a data buffer to optimize the data transfer with DMA operations directly. After the transformation procedure, the immediate code is as presented in Fig. 3a.
Variable A is explicitly brought from off-chip memory to SPM via DMA. The off-chip memory access latency becomes the sum of the SPM memory access Fig. 2 Code without data transfer optimization latency and the DMA latency. Because of the thread-carried true dependence, we need to insert DMA operations between the source and the sink to maintain the dependence. The granularity of the two DMA operations is 8 B. The DMA cost can be obtained from Table 2. In addition to DMA operations, we also need to read/write data from the SPM via load/store instructions. The SPM access latency is (Ld∕stLat) × 4 × 64 = 4 × 4 × 64 = 1024 cycles. Therefore, the total execution time due to memory access is AccessLat + DMALat = 1024 + (327 × 2 + 312 × 2 × 64) = 41614 cycles. Compared to the default strategy, the data buffer has a . × lower execution time.
We can see that due to the existence of thread-carried dependences, two DMA operations need to be inserted in each loop iteration. However, the initialization time of each DMA operation is relatively long, while the amount of data transferred is small. This situation leads to no profit being gained from using DMA to optimize SPM data transfer. Therefore, as shown in Fig. 3b, we can use loop distribution and strip-mining to transform the code to change the granularity of the data transfer.
With the code transformation, the granularity of the DMA transfer in the loop becomes 256 B. The total execution time due to memory access is AccessLat + DMALat = 1024 + (327 × 2 + 313 × 2 × 2) = 2930 cycles. The larger granularity yields us a . × acceleration.
Next, we attempt to replace the DMA operation in the loop body with NoC for two different granularities in the data transfer process. The process of data transfer using NoC is shown in Fig. 4. Each box means an SPM for a thread. Since NoC can transfer data only by using XY routing in a row or a column, only one thread can send data at a time during the data transfer process. Compiler-directed scratchpad memory data transfer… The code using NoC to optimize the date transfers is shown in Fig. 5. Because of the structure of NoC, all threads need 8 Hops to complete the data transfer in this piece of code. Therefore, when the granularity is 8 B, NoCLat = 8 × HopLat = 8 × 10 = 80 cycles. The total execution time due to memory access is AccessLat + DMALat + NoCLat = 1024 + 327 × 3 + 80 × 64 = 7125 cycles. When the granularity is 256 B, NoCLat = 8 × 80 = 640 cycles. The total execution time due to memory access is AccessLat + DMALat + NoCLat = 1024 + 327 × 3 + 640 × 2 = 3285 cycles. The acceleration is . × , while for a transfer granularity of 8 B, it is . ×. Figure 6 shows that using DMA and NoC to optimize data transfer can effectively optimize the SPM application and effectively improve the execution efficiency of the multithreaded application. However, when the granularity of the data transfer is different, the optimization effect of using DMA and NoC for data transfer also differs.
Therefore, we propose the MSDTM to achieve more efficient SPM data transfer optimization in multithreaded applications.

Multithreaded SPM data transfer model
In this section, we describe in detail the design and implementation of the MSDTM. Figure 7 presents a high-level overview of the MSDTM framework. The input to the framework is a multithreaded application source code with marked kernel regions. Before we input the source code to the MSDTM, we need to port it to the heterogeneous many-core architecture. To simplify the description of the MSDTM, we focus only on the perfect loop nest wherein all content is in the innermost loop. We perform the loop transformation on the innermost loop.
As shown in Fig. 7, the MSDTM framework consists of three components: application analysis, the DTP model and a code transformation.

Application analysis
In this stage, we analyze the multithreaded application to obtain the per-thread kernel region memory access profile as the input to the DTP model. Fig. 6 The time spent due to memory access for the above three data transfer strategies under two different granularities 1 3 Compiler-directed scratchpad memory data transfer…

Data access classification
We traverse the whole marked kernel region to obtain the memory access profile of the global variables. We identify the access types of the global variables: read-only, write-only and read-write. While we bring the data only from off-chip memory to SPM with read-only access, we also need to transfer the data from SPM to off-chip memory with write-only and read-write access.
Moreover, we also classify the global variables as either regular or irregular [8]. Because of the predictable inefficiency, irregular access is ignored by the MSDTM. Furthermore, regular access can be classified as either contiguous or noncontiguous. We then aggregate the access of a variable to one single buffer and insert strided DMA operations for noncontiguous access.

Array partitioning and loop tiling
In most heterogeneous many-core architectures, the SPMs always have restricted space. However, in general, the kernels in an application may access large variables. Most arrays may not be accommodated in the SPM. Array partitioning and loop tiling can separate a large array into smaller ones to accommodate them in SPM [14,21,26]. Many mainstream compilers support the use of polyhedral model by programmers to perform automatic array partitioning and loop tiling [22]. Polyhedral model is an abstract representation of a loop program as a computation graph in which questions such as program equivalence or the possibility of parallel execution can be answered [10].

Dependence check
Before we perform the dependence check, we introduce a new kind of dependences, called input dependence [19], in which both the source and the sink use the same location. As Fig. 8 shows, the input dependence from S1 to S2 clearly indicates the opportunity to eliminate a load at the second reference.
We traverse all the memory accesses for dependence checking. The dependences are divided into thread-independent dependences and thread-carried dependences. Thread-independent dependences are used to check whether code transformations are legal, while thread-carried dependences are used to guide transfer operation insertions.
True dependence. The data from the source will be used by the sink; thus, transfer operations will be inserted before the sink to update the data.
Anti-dependence. Nothing needs to be done to achieve antidependence because the data used by the source are already brought to the SPM before the sink updates the data at the same location in the off-chip memory.
Output dependence. If output dependence is the only dependence that exists, only the last thread that updates the data at the same location in the off-chip memory needs to use DMA operations to bring the data from the SPM to the off-chip memory.
Input dependence. The threads with input dependence use the data from the same location in the off-chip memory; thus, inserting transfer operations before the sink may result in better efficiency.
We structure a thread-carried dependence graph (TDG) as a result of a dependence check. We take the code shown in Fig. 2 as an example. There are only two dependences in this piece of code. One is a thread-independent input dependence due to the read operations of scalar coeff, and the other is a thread-carried true dependence due to the write and read operations of variable B. So, in the TDG of the piece of code shown in Fig. 2, there is only one edge from the write of B ( S 1 ) to the read of B ( S 2 ). Because thread-carried dependence cannot be backward, no cycle of dependences will occur in the TDG.

Data transfer allocation (DTA) algorithm
To determine the allocation of data transfer operations in multithreaded applications, we propose the DTA algorithm (Algorithm 1). After data access analysis and a dependence check, the loop that needs to be transformed with its data access profile and the TDG are supplied to the DTP algorithm as the input. The main idea of Algorithm 1 is to insert data transfer operations according to TDG and the kind of dependences in TDG. In order to reduce the times of data transfer operations, we only insert one DMA_in or DMA_out operation corresponding to one input or output dependence at the beginning or the end of the loop. According to true We first divide the TDG into several subgraphs according to the data access profile (lines 6-13). Each subgraph contains all the dependences corresponding to the same data access. If any antidependence exists in the subgraph, we further divide the subgraph into two by the antidependence (lines [14][15][16][17][18][19][20][21]. Now, we have several subgraphs without antidependences and a cycle of dependences. The dependences are then further classified. With true dependences, we insert a couple of data transfer operations (copy_in & copy_out) between the source and the sink of each dependence (lines 23, 24). Without true dependences, we insert DMA_in operations corresponding to the input dependences at the beginning of the loop, while we insert DMA_out operations corresponding to the output dependences at the end of the loop (lines [25][26][27][28][29]. The strategies of the data transfer operations with true dependences are determined by the DTP model mentioned in Sect. 4.2. Since we need to traverse G i three times and a i once, the worst-case time complexity of Algorithm 1 is O(n).
We use the code in Fig. 2 as an example to illustrate the process of Algorithm 1. As we mentioned in Sect. 4.1.3, the TDG of the piece of code in Fig. 2 has only one edge from S 1 to S 2 which indicates there is only one thread-carried true dependence from S 1 to S 2 . According to Algorithm 1, since there is no anti-dependence in TDG, we do not need any division of the graph. The only thing we need to do is to insert a couple of data transfer operations between the source ( S 1 ) and the sink ( S 2 ).

Data transfer performance (DTP) model
With the allocation of the data transfer, the memory access profile and hardware configuration of a specific heterogeneous many-core architecture are input into the DTP model. We first formulate the model for multithreaded applications using a specific hardware configuration. Next, we use the performance model to derive the optimal granularity and select the most profitable transfer strategy at that granularity.

Model formulation
The execution time of a multithreaded application is determined by the slowest thread; hence, we need to select the most profitable transfer strategy to minimize the execution time of the slowest thread. Furthermore, we assume that the execution time of computation is fixed; thus, reducing the execution time due to memory access is the only way to minimize the execution time of the slowest thread. Let T be the execution time due to memory access of the slowest thread in the multithreaded application. As mentioned above, the execution time due to memory access consists of data access latency (AccessLat), DMA transfer latency (DMALat) and NoC transfer latency (NoCLat).
Data access latency: Let A = {a 1 , a 2 , ..., a n } be the variables that need to access the SPM or off-chip memory. Let s i represent the size of a i (1 ≤ i ≤ n) . Let represent the load/store latency of the SPM, while represent the load/store latency of the off-chip memory. Both and are defined by the hardware configuration. The data access latency is: For heterogeneous many-core architectures, is much smaller than . Therefore, data transfer operations can reduce the data access latency by transferring data from the off-chip memory to the SPM.
Data transfer granularity: As Table 2 shows, different data transfer granularities correspond to different DMA and NoC transfer speeds. We let g represent the granularity of data transfer operations. To obtain the granularity g in a loop, loop strip-mining and loop distribution are utilized during the code transformation. Thus, g is subject to the following constraint: Compiler-directed scratchpad memory data transfer… The cost per byte of DMA transfer can be defined relative to the granularity as: The cost per byte of NoC transfer can be defined relative to the granularity as: Normally, while v usually reduces by an inverse proportional function, u usually remains unchanged as g increases for most heterogeneous many-core architectures.
To meet the sizes of variables, each data transfer process requires several data transfer operations. This number of operations (or times) can be computed as: DMA transfer latency: Each DMA transfer operation can be divided into an initialization and a transfer process. Let I represent the initialization cost. At a transfer granularity of g, the latency of the DMA transfer per item is: Both the initialization cost and the speed of the DMA transfer are determined by hardware parameters. NoC transfer latency: The NoC transfer experiences contention in a link when several other transfers are simultaneously trying to utilize the same link. Because of the contention, all the threads need to transfer data via the NoC step by step. We let Hops represent the number of steps of the whole NoC transfer process. The latency of the NoC transfer per item is: The variable Hops is determined by the analysis of the multithreaded application, while the speed of the NoC transfer is determined by hardware factors.
Execution time of memory access per data item: Let E be the execution time of memory access for variable a. For each variable, we have three transfer strategies to select from. Let E direct represent the execution time of accessing data from off-chip memory directly. Let E DMA represent the execution time of bringing the data from off-chip memory via DMA and accessing it from the SPM. Let E NoC represent the execution time of obtaining data from other threads via NoC and accessing it from the SPM.
As mentioned in the discussion of the DTA algorithm, we consider data transfer with thread-carried true dependences. E can be computed as: For each DTA, each data item needs to be written back to memory and read by another thread. Furthermore, if we transfer the data via NoC between threads, we will need to write it back to the off-chip memory by DMA.
Total execution time of memory access: Since we propose MSDTM to derive the optimal granularity and select the most profitable data transfer strategy among direct access, DMA and NoC for each variable, the execution time due to memory accesses of the slowest thread T is the sum of E. The execution time of the slowest thread due to memory accesses can be computed as:

Deriving the optimal granularity
After the problem formulation, we use the performance model to provide guidelines for deriving the optimal granularity and selecting a profitable transfer strategy. To minimize the total execution time of memory access, we need to minimize the execution time of memory access per data item. This process can be represented as: In Equation 8, the parameters , , and I are defined by the hardware configuration, while s i and Hops are computed by an application analysis. These parameters will not change during the data transfer optimizations. Functions f and h are also defined by the hardware configuration. Therefore, during data transfer optimizations, minimizing times will lead to minimizing E DMA and E NoC . The derivation process is: The optimal granularity g for most heterogeneous many-core architectures is the minimal size of all the variables that need to be transferred in the loop body.
Compiler-directed scratchpad memory data transfer…

Comparison of strategies
With the optimal granularity g, the MSDTM can provide guidelines for selecting the most profitable strategy at each allocation of data transfer operations. The computation of the execution time of memory access per data item can be replaced with: The relationship between E direct , E DMA , E NoC and s i is plotted in Fig. 9. Their points of intersections s ′ and s ′′ split the domain of s i into three sub-domains. The execution time of memory access per data item E can be computed as: When the variable Hops which is obtained from application analysis changes, the intersections s ′ and s ′′ change as well.

Code transformation
When the application analysis is completed, the optimal granularity is derived, and the most profitable strategy is selected, the MSDTM transforms the code to optimize the data transfer operations.
First, loop distribution and strip-mining are required to make the size of the loop suitable for the optimal granularity. Loop distribution can be used to convert a sequential loop to multiple parallel loops, while strip-mining is a kind of optimizations to convert the available parallelism into a form more suitable for the hardware Fig. 9 The dependence of execution time of memory access per data item on the size of data by grouping the iterations into sets, each of which is treated as a schedule unit [18]. Then, DMA or NoC transfer operations are inserted for each allocation of data transfer according to the DTP model. Finally, we transform the subscripts of variables that need to be optimized to access the SPM.

Experimental evaluation
This section presents the experimental evaluation of our proposed MSDTM on Sunway TaihuLight.

Sunway TaihuLight
In contrast to other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators, the computing power of Sunway TaihuLight is provided by heterogeneous many-core SW26010 processors that include both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip. The general architecture of the SW26010 processor [9] is shown in Fig. 10.
The processor includes four core groups (CGs). Each CG includes one MPE, one CPE cluster with 8 × 8 CPEs, and one memory controller. Each CG has its own memory space, which is connected to the MPE and the CPE cluster through the memory controller. The processor connects to other outside devices through a system interface.
In terms of the memory hierarchy, each MPE has a 32 KB L1 instruction cache and a 32 KB L1 data cache, with a 256 KB L2 cache for both instructions and data. Each CPE has its own 16 KB L1 instruction cache and a 64 KB user-controlled SPM.
As Table 1 shows, while the MPE has access to an 8 GB main memory, the CPE can directly access the main memory through gld/gst instructions. In addition, the CPE can implement batch data transfer between the SPM and main memory Fig. 10 General architecture of the SW26010 processer via DMA commands. The efficiency of the DMA transfer is closely related to the amount of data transferred, the granularity of the DMA commands and the continuity of data in the memory. The ideal DMA transfer bandwidth of the processor is 134.4 GB/s. Register communication is used for data transfer as NoC between CPEs. Since the CPEs are physically arranged in an 8 × 8 array, register communication can transfer data using only XY routing. In XY routing, an access moves along the row-axis first and then along the column-axis. Through register communication, each CPE can perform row or column broadcasting and can send data to another specific CPE.

Experimental setup
The proposed MSDTM is implemented on the GCC compiler, which is firstly ported for Sunway TaihuLight. In the MSDTM implementation process, we reserve the switches for manual adjustment of the data transfer granularity and manual selection of the data transfer strategies. At the same time, we automatically obtain the optimal granularity and the most profitable strategy by the MSDTM.
To evaluate the performance of the proposed MSDTM, we select test cases from the NAS parallel benchmark suite (NPB) [2] and SPEC benchmarks [16], such as EP, FT, IS, LU, MG, and SP from the NPB and lbm [25] from the SPEC. In addition, we choose two representative application kernels, Stencil and PhotoNs. Stencil [1,21,27] computations are the foundation of many large applications in scientific computing, while PhotoNs is a cosmic N-body numerical simulation software developed by the National Observatory. Before the evaluation, we manually port the benchmarks and kernels for Sunway TaihuLight.
Besides, we select a simple but representative application kernel, 1D-FFT [35], to verify that the granularity obtained by the MSDTM is optimal and that the strategy is the most profitable one.

A case study with FFT
The 1D-FFT kernel is implemented based on butterfly computing with an input data size of 8192 bytes. We partition the data into 128 bytes to run the kernel on 64 threads and partition the data into 1024 bytes to run the kernel on 8 threads. In the 1D-FFT kernel, the size of the data in each thread limits the granularity of the data transfer. We manually set the granularity of the data transfer to 8, 16, 32, 64 and 128 bytes. In addition, we set the extra granularities to 256 bytes, 512 bytes and 1024 bytes for the 8-threaded version. We compare the execution time of the kernel at each granularity. For each granularity, we use the three transfer strategies mentioned above to optimize the data transfer. Figure 11 shows the measured values for 8-threaded and 64-threaded kernels. We can observe that the execution time of the kernel decreases as the granularity of data transfer increases with DMA transfer or NoC transfer, while it remains basically unchanged with direct access. This result means that whether in an 8-threaded application or a 64-threaded application, the optimal execution efficiency of the application is obtained when the granularity of the data transfer is maximal. In other words, the optimal granularity of data transfer is the minimum of the sizes of all data in each thread. In addition, the execution time of the kernel due to the DTP model is equal to the minimum time spent on the three strategies. This observation proves that we can select the most profitable strategy at each granularity based on the DTP model.
Furthermore, the MSDTM selects not only the optimal granularity but also the most profitable data transfer strategy with the optimal granularity. For the 8-threaded and 64-threaded 1D-FFT kernels, optimizing the data transfer with the MSDTM can yield speedups of . × and . × compared with version which use direct access.

Performance and energy evaluation
We evaluate the performance speedup of the proposed MSDTM compared to that of the original applications with direct memory accesses. In addition to the performance evaluation, we evaluate the energy reduction via a script supported by Sunway TaihuLight. The application with data transfer optimization by the MSDTM is executed on 8 threads, 16 threads and 64 threads. Figure 12 shows the performance improvement and energy reduction of the test cases executed on 8 threads, 16 threads and 64 threads. We can observe that MSDTM performs well with respect to both performance improvement and energy reduction under all scenarios. However, as we can see from Fig. 12, the test cases we use perform the best when executed on 8 threads and the worst when executed on 64 threads. This is due to the DMA transfer bandwidth on Sunway TaihuLight, which results in the roofline curve of the DMA transfer's efficiency with the varying of memory transaction granularity, as shown in Table 2. In other words, the efficiency of DMA transfer is bounded by a threshold of the memory transaction granularity, and the performance of DMA transfer will not be improved when such a threshold is hit. This threshold is 128B when experimenting with 64 threads and the value increases when the number of used threads decreases. One can thus expect better As Fig. 13 shows, the MSDTM yields considerable acceleration of all the test cases. In particular, the acceleration ratio of MG is . × , while the acceleration ratio of Stencil is . × . The reason that the two test cases get better performance speedup is they have more thread-carried dependences than others, which lead to more DMA or NoC transfers, for example, the overlapping of loop tiling in Stencil. In general, the MSDTM provides an average acceleration ratio of . × on 64 threads and an energy reduction of . ×.
Thus, we observe that the proposed MSDTM is effective in reducing the execution time and energy of the evaluated test cases.

Conclusions
In this work, we propose the MSDTM, a compile-time framework for optimizing multithreaded data transfer between SPM and the main memory on heterogeneous many-core architectures. This framework determines the allocation of data transfer operations via an application analysis and dependence checking. Next, the DTP model is used to obtain the optimal granularity of data transfer and select the most profitable strategy. In the experimental evaluation, the proposed MSDTM improves the application execution time by . × and achieves an energy savings of . ×.
The future works of this paper include further optimizations for SPM data transfer operations, such as overlapping the process of data transfer with kernel computation and combining the granularity of data transfer with the size of loop tiling to achieve higher efficiency.