AceMesh: a structured data driven programming language for high performance computing

Abstract

Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges of contemporary large scale high performance computing systems. In this paper we present AceMesh, a task-based, data-driven language extension targeting legacy MPI applications. Its language features include data-centric parallelizing template, aggregated task dependence for parallel loops. These features not only relieve the programmer from tedious refactoring details but also provide possibility for structured execution of complex task graphs, data locality exploitation upon data tile templates, and reducing system complexity incurred by complex array sections. We present the prototype implementation, including task shifting, data management and communication-related analysis and transformations. The language extension is evaluated on two supercomputing platforms. We compare the performance of AceMesh with existing programming models, and the results show that NPB/MG achieves at most 1.2X and 1.85X speedups on TaihuLight and TH-2, respectively, and the Tend_lin benchmark attains more than 2X speedup on average and attain at most 3.0X and 2.2X speedups on the two platforms, respectively.

Introduction

Current high-performance computing systems consist of distributed memory nodes, where each node contains a large number of cores. Applications have to be heavily optimized to harness all the resources provided by the modern hardware, exposing a high degree of parallelism inside each node, exploiting data locality, dynamically adapting to computing resources, and optimizing inter-node communications.

MPI is the de facto standard for distributed memory programming, and there are a large number of legacy MPI applications that need to be ported to modern HPC platforms. A common practice in the HPC community is to use MPI + X hybrid programming, where the two models are specialized in exploiting inter-node and intra-node parallelism, respectively. X is usually standard programming language/interface such as OpenMP and OpenACC, which provide portability across different multicore platforms.

Over the last decade, data-driven task-parallelism models, or Asynchronous Task-based Programming (ATaP) models is attracting growing interest. In this programming paradigm, programmers express an algorithm in terms of tasks and their accesses to shared data, while inter-task dependences and task scheduling are automatically realized by the programming system. The parallel execution of such model can be seen as a directed acyclic graph (DAG) where nodes are pieces of sequential code and edges are control and data dependences between them. ATaP models not only increase programmer productivity, but also improve parallel scalability for eliminating global synchronization among tasks and provide opportunity of communication latency hiding.

But there are still problems to this model. For array-based MPI applications, users are required to make a lot of manual transformations for loops, data, and communications. Since the model lacks a global view on data-task relationship, users can be easily distracted from important issues as data locality and dependence edge reduction, and the programming system may miss optimizations on locality-aware task and data placement for lack of global information or too costly to collect such information at runtime.

The paper presents AceMesh, a task-based, data-driven language extension targeting legacy MPI applications. We show why the typical ATaP model is low-level, and fragmented in the programming aspect through a simple example. AceMesh provides aggregated task definition exposing task dependences in a more abstract way, relieving users from tedious work of choosing proper tile offsets and lowering array sections to data tiles. Its data-centric parallelizing template accompanied with composite task not only improve programmability, reduce system complexity incurred by complex array sections, but also provide structured execution of DAG programs. The paper discusses performance-motivated programming rules on communication tasks in the AceMesh language extension. A prototype implementation is presented, including task shifting, data management and communication-related analysis and transformations. The programming interface is evaluated on two supercomputing platforms, showing better performance than popular programming models.

The article is structured as follows. In Sect. 2, we give an overview of the AceMesh language extension, describe the motivations and design details on important language features. In Sect. 3, a prototype implementation is presented, including the related compiler and runtime techniques. In Sect. 4, we evaluate AceMesh using two applications on two HPC platforms and give detailed performance analysis. Section 5 describes the related work and finally, Sect. 6 concludes the work.

AceMesh language extension

Data structures in scientific applications can be categorized into mainly two kinds, those using tiled based data structures and those using unified array based structures. In the former kind of applications, computation on one or several tiles is by nature a structured code block, forming a natural unit of task parallelism, where data dependences occur among tile computations. In the latter category, different operators sweep through the same data space in the form of different loop nests, and there is no natural unit of task parallelism in such loop nests. This paper focuses on the latter type of applications.

Motivations

We take 2-dimensional jacobi iteration as an illustrating example in Fig. 1. It is traditional 2-dimensional (2D) jacobi MPI program if the compiler directives are omitted. Here, instead of using two 2D arrays we use one single 3D array to store both the source array and the destination array. 1D domain decomposition is used for MPI parallelization, and only one pair of data exchange is listed to save space.

Fig. 1
figure1

Original jacobi2d with AceMesh directives

We try to parallelize the program into data driven form using OpenMP. As in Fig. 2, we tile the i-loop to expose fine-grained tasks, and enclose MPI calls to data flow tasks. The low level properties of OpenMP programming are in three ways.

Fig. 2
figure2

OpenMP version of jacobi2d with deadlock hazard and performance penalty

Firstly, it is nontrivial to expose fine-grained tasks from parallel loops. Loop tiling in line 5 of Fig. 2 looks straightforward, but for some parallel loops, inter-task data tile false sharing may be introduced if tile offset is not carefully chosen (counterexample will be given in the last paragraph of Sect. 3.1), and task level parallelism is decreased.

Secondly, it is nontrivial to give correct in/out clauses. OpenMP4.0 allows array sections as list item of in/out clause, and array section is a designated subset of the elements of an array, denoted as subscript triplet [lo]:[up]:[stride] in Fortran. But there is a restriction to the OpenMP depend clause, “List items used in depend clauses of the same task or sibling tasks must indicate identical storage locations or disjoint storage locations [3]”. And this puts a lot of tedious work to the users. They should manually partition the array space into tiles in their mind, map the referenced memory region to multiple array tiles, and express each of them using array sections. In line 7, one output data tile and three input data tiles are carefully designated, extra computations are introduced in line 18 and 19 in order to compute the exact array sections in line 20.

The reason that OpenMP has the above restriction on array sections lies on the implementation complexity and inefficiency. Array section intersection will be a frequent operation during on-the-fly dependence calculation and dynamic task mapping. Algorithms which behave fast in insertion, deletion and matching of multi-dimensional array sections, are in general imprecise, and they may still have non-negligible space and time complexity in the worst cases. As the task grain size grows smaller in future, the problem will become more severe.

Thirdly, lacking MPI-awareness in the OpenMP runtime, users face with correctness and performance pitfalls. The program in Fig. 2 will incur deadlocks, and users should add man-made dependences to avoid them. Furthermore, communication tasks may block on the core which hurts the parallel performance.

All in all, OpenMP users have to make tedious code transformations to exploit DAG parallelism. And it is error prone, since data tiles exist only in the mind of the users, while loop tiling and the referenced data tile information of each task are hard wired in the application.

The AceMesh language extension handles the above problems in a systematic way. Based on the data tile directives that it supports, the AceMesh compiler turns each referenced array section to a minimal set of data tiles. With the do tasktile directive, the compiler adopts proper loop tiling to exploit fine-grained task parallelism efficiently. And the AceMesh compiler will also take care of the correctness and performance issues incurred by MPI-related tasks. Four groups of AceMesh directives are introduced in Fig. 1, line 2 declares how array A is tiled, the do directive in line 5 partitions the loop into many data flow tasks, and the task directives in line 12 and 15 define two communication related tasks. Using AceMesh, the execution order of loop tiles is turned from breadth-first order to depth-first order, the overlapping of communication and computation will be automatically realized.

Overview

AceMesh is a directives-based, data-driven task parallel language extension, and Table 1 lists its main directives.

Table 1 Summary of AceMesh’s directives

AceMesh supports incremental parallelization with begin/end directive to focus on specific program code regions, the enclosed region is also called DAG region in the paper. ArrayTile describes the dominating parallelizing template among concurrent tasks. AceMesh provides two kinds of task directives (task and do directive) with data dependence information. Map clause is introduced to task or do directives for heterogeneous platforms, the argument can be master or acc, indicating that the task defined will be executed on the master thread or offloaded to accelerators.

In an AceMesh program, parallelism is implicitly created in terms of thread pool when the application starts. The source codes inside the DAG region fall into two categories, tasking codes and task registration/generating codes. Tasking codes are those appearing inside the dynamic extent of a task directive and it will be outlined to a function whose execution is deferred, when and where to execute them is decided by the runtime system. Task generating codes are the reminder codes, and are executed by the master thread. Dependences going from tasking codes to task generating codes should be expressed via taskwait directive.

AceMesh allows the definition of multiple levels of parallelism, directives that define tasks, declare arrayTile or apply synchronization can also appear inside a task/do directive. Some codes can be both regarded as task generating codes with respect to a nested task dependence graph, and at the same be the tasking codes of its parent task.

Aggregated task definition

For parallel loops, we propose do directive (Fig. 3) to expose fine-grained tasks in an aggregated way.

Fig. 3
figure3

Do directive with tasktile clause

Tasktile clause describes how the iteration space of the associated loops is divided into tiles, and each tile is encapsulated into a data-flow task. If the evaluation of the if clause is false, the tasktile clause is disabled and the whole loop nest will be degraded to a single task. The nested clause of the do/task directive indicates that there is a task dependence graph inside the task.

The in/out/inout clauses describe the data dependences of each task, and their list items allow for variables and array sections. Each subscript of an array section is a triplet ‘[lo_expr]: [up_expr]:[step]’, and both lo_expr and up_expr can be affine expressions of the associated loop indices that appear in tasktile clause. A special * represents the whole dimension extent. To build the in/out clause, the programmers just need to find out the affine coefficient, the least and the greatest offset among all the related array subscripts. Array sections in the in/out clauses can be simplified to the variable form since the array section can usually be deduced by the compiler. If the loop is too complex to analyze, a conservative array section will be introduced and warning is given to the user. In Fig. 1, the complete form of the in/out clause in line 5 is, in (A(0:N − 1,i − 1:i + 1,src)) out(A(1:N − 2,i,dest)). The in/out array section will be instantiated with the loop index range of each task.

The do directive will be translated to task generating codes and a task function by the compiler. For the do directive in Fig. 1, its task generating codes are shown in Fig. 4. The four arguments of the task function are packed into the array argv in line 10, and the task object is generated in line 11. The referenced data tiles should also be registered, and the compiler discriminates different data tiles, complete data tiles are registered directly on line 7 and 9, while incomplete data tiles are registered through another function, func1_reg (whose definition is omitted here). The task function is shown in Fig. 5.

Fig. 4
figure4

Task generating codes in the translated jacobi2d

Fig. 5
figure5

Outlined task function of jacobi2d’s computation loop

Semantic difference between two kinds of task definition. Both do directive and task directive define tasks, but they are different in two ways. (i) The tasktile clause specifies the chunk sizes of a typical task chunk, but the first chunk may not be full-sized, an offset of the first chunk will be decided by the programming system. (ii) Since there is no data dependences among tasks generated from the same parallel loop, the runtime system can make optimizations during dependence resolution and task mapping. The compiler gets a global view on the data overlapping pattern among tasks to facilitate task and data placement.

ArrayTile directive and structured DAG execution

Affine loops are common cases in scientific applications, and array subscripts are usually unary linear expressions of loop indices, so we can build relationship between parallel loops and array dimensions. When we partition a parallel loop into multiple tasks, the referenced array regions are meanwhile implicitly tiled along certain dimension. ArrayTile directive acts like a data-centric parallelizing template, it actually specifies how the arrays are split into tiles logically, how each tile accessed by different tasks, and most of the parallel loops within its scope align their parallel schemas with these array tiles.

The directive specifies an array tile shape, and runs as follows.

!$acemesh arrayTile dimtile (size_list) [dim(shape_list)] arrvar ()|default [if()]

Array names can be listed in the arrvar clause or implicitly declared using a default clause. Dimtile clause lists tile size of each array dimension, and ‘*’ is a special tile size that means only one chunk exists in that dimension. Dim clause is introduced to explicitly declare dim shapes for allocatable arrays. Arrays that are not explicitly specified by any arrayTile directives actually have only one tile, and private arrays of any task need not to be tiled obviously.

Data dependence on data tiles. The AceMesh compiler turns the referenced array section into a series of data tiles, each of which has non-null intersection with the original array section. Representing each data tile with one address, the runtime dependence testing is turned from section-based intersection into an address-based problem, so greatly simplify the implementation complexity.

Data tile remapping. Each arrayTile directive can be viewed as a special definition, it may specify a different data tile shape for the same array with respect to a previous directive, incurring data tile remapping. The compiler together with the runtime system locate each remapping point and add a special synchronization task there. The task can be regarded as the following directive.

!$acemesh task out (new_dtiles, old_dtiles) {null_task}

Here, new_dtiles are all the representative addresses of the new data tiles, while old_dtiles are those of the old data tiles.

Data tile template selection. To decide proper data tile template for each array, users should analyze the DAG region globally. Firstly, for each loop nest, users collect parallel schema candidates (loops to be partitioned), and deduce the implied data tile templates. Then in a long loop sequence, users judge what the dominating data tile template is for each array, and whether and where a remapping point should be introduced.

Data tile based locality exploitation. The aceMesh compiler adopts quasi-static task affinity, with dynamic work stealing as supplementary. Each data tile is assigned an affinity by the runtime system according to platform topology. The compiler decides task’s affinity according to data tiles with the highest access weights, so exploits data locality across loops. For loops whose parallel schema does not coincide with the related data tile template, the compiler simply distributes the tasks evenly for load balancing.

Nested task dependence graph and its resource assignment. AceMesh task itself can also be a task dependence graph, and we call such task composite task. A composite task cannot start to build its task dependence graph till its dispatching. But, how many cores should be assigned to each composite task in a decentralized execution environment? Without a global view about how array variables are accessed among concurrent tasks, it is hard for the runtime system to make a proper decision in advance. Concurrent execution of composite tasks may involve more competition on computing cores. According to data tile templates and the task’s data access information, its affinity is the set of cores which the accessed data tiles have affinity.

Figure 6 is a snippet of code which is greatly simplified from the filtering phase of the benchmark tend_lin. On line 2, do directive defines many composite tasks with the chunk size of 6. Since array DT is tiled on two dimensions, the number of data tiles that each composite task accesses is [(endlatdyn–beglatdyn + 1)/4]*6, and this is the number of cores that it needs.

Fig. 6
figure6

Composite tasks and resource requirements

Safe and parallel execution of inter-process communication

Parallel execution of inter-process communication in a data driven form can realize natural overlap of communication and computation. AceMesh allows both collective communications and two-sided point-to-point communications to appear inside tasks. User do not need to convert communications into non-blocking ones, the AceMesh compiling system automatically makes the transformation, promotes the post operations (communication initiation, such as mpi_irecv) as early as possible, and preempts the tasks that waits for MPI communication.

When defining communication tasks in AceMesh, some performance-related programming rules should be obeyed.

Sharing of resources and communication serialization. Inter-process communication reuses several kinds of resources, such as communicators, message tags (integer), variables that keep the communication requests and communication buffers. Among them, communicators and message tags do not incur any data dependences by nature, but bring implicit ordering constraints among communication calls in the original program according to MPI specification.

There are two kinds of ordering constraints among MPI communications. All processes must call collective operations in the same order per communicator. And for two-sided point-to-point messages, the 4-tuple envelope < source, destination, communicator, tag > is used to match point-to-point messages, it guarantees that two matching messages are received in the same order that they were sent, this mechanism makes the communication behavior deterministic.

To relieve the users from tackling such orderings, AceMesh programming system automatically adds artificial dependences to realize these ordering constraints.

Using more communication-related resources provides more flexibility for communication advancing. But communicator construction incurs overheads so it should be used carefully by the user. Tag substitution relies on accurate send-receive matching, but automatic static analysis is not sufficient to infer matching sends and receives in a parallel program (Preissl et al. 2008), so again tag transformation should be left to the programmer. To differentiate tags, users may use tag pool (such as self-increasing mechanism) to allocate tags, and this may bring true dependences among tasks further serializing point-to-point communications.

Programming rule 1. Programmer may use tag pool to differentiate communication tags, the dynamic increasing mechanism cannot be put inside any task blocks, since it brings true dependences among tasks, prevents the compiler optimization and finally serialize all the point-to-point communications.

Deadlock avoidance. Running a well-formed (no deadlocks originally) MPI program in a data driven manner trivially may produce deadlocks, even if ordering constraints are maintained. There is no data dependences among send operations and receive operations within one process, and this freedom may cause problems. Figure 7 shows a scheduling snapshot that may happen in a three-process task parallel application, where deadlock occurs. This scene which is carefully evaded by the original MPI program is usually encountered by a data driven task parallel program. The AceMesh programming system will not block the core when a task waits for a communication completion, and so evade deadlocks.

Fig. 7
figure7

Deadlock snapshot of a task-based 3-process program

There are several deadlock evading mechanisms, but they have either non-negligible overhead or limitations on task definition. Multiple MPI communications are allowed to be encapsulated in an AceMesh task, but there are also two requirements that should be met. We use task block to denote the associated structured block of a task directive.

Programming rule 2. There is no data dependences among blocking communications inside the same task.

Programming rule 3. Blocking communications (except mpi_wait) should lie on the exit of task blocks.

If rule 2 is not met, the task may pause-and-resume several times during its execution and this brings overhead. Violation of rule 3 may also introduce pause-and-resume, mpi_wait is allowed to appear anywhere in the task block since it will be moved away by the compiler automatically. A well-formed AceMesh task will not incur pause-and-resume during its execution.

Implementation

The AceMesh compiler is a source-to-source compiler built upon the rose compiler infrastructure. The AceMesh compiler is comprised of three main parts, the frontend, midend and the unparser. Figure 8 gives the flow chart of the midend. (1) variable management includes data tile analysis and local variable promotion; (2) tasktile loop analysis includes array section analysis, tile offset selection, array section lowering and task affinity analysis; (3) task block analysis extracts MPI ordering keys; (4) registration code generation produces codes such as task parameters passing, task objects and data accesses registration. (5) task function generation. After the translation, a native compiler compiles the translated code and creates an executable binary with linking to the AceMesh runtime library.

Fig. 8
figure8

Flowchart of AceMesh compiler’s midend

The AceMesh runtime system has two main parts, (1) task dependence graph building (including dependence resolution); (2) task dependence tracking and task scheduling, including MPI-aware scheduling with deadlock avoidance.

Task shifting and false sharing elimination

Since each data tile is represented by its first element’s address, the false sharing of data tiles among different tasks results in excessive dependence edges. Parallelism will decrease dramatically if excessive edges are exerted among tasks from the same do directive.

We use a task shifting algorithm to find a proper tasktile offset for each do directive, to eliminate the above false sharing. And even if the algorithm fails, we will not add extra edges among these tasks. For each data tile that brings false sharing, we add an artificial task (which writes that data tile), so we build dependences going from all the tasks in question to this special task, and going from here to later tasks.

Modulo offset of range chunking. Each index range [lo, up] can be chunked (or tiled) into multiple chunks (we use tile from now on). Besides tile size, each tiling scheme is also associated with a modulo offset c, c ∈ [0, tilesize), to indicate the modulo of the first index in each tile. So, each tile’s range is [tilesize*(i − 1) + c: tilesize*i + c − 1] ∩ [lo, up], where i in an integer. Figure 9 illustrates a tiling scheme on the index range [3:17], where the figure under each cell is the index and that above the cell is the modulo of each index, and the bold vertical lines separate different tiles. There are 5 tiles in this case, while the first tile has only 2 elements.

Fig. 9
figure9

An index range with tilesize = 4, modulo_offset = 1

In AceMesh, array tile’s modulo_offset is fixed at 1 for Fortran language (0 for C language), while mudulo_offset of tasktiles are adjustable.

Notations and preconditions. We consider affine loops with unit loop step. We check each partitioned loop at a time, and focus on references of one single array. Suppose the array dimension’s tile size is dtile, the corresponding loop’s tile size is ltile, and the array references share the same affine coefficient k, with the smallest affine offset o1, and the largest offset o2. There exist an integer m, such that k*ltile = m*dtile. The modulo offset that we search for is c ∈ [0, ltile).

Task shifting algorithm. We use Eq. 3 to search for a proper modulo offset, the proof runs as follows. Let access_dtiles(ha,t) denote all the data tiles of array ha that task t either reads or writes. A sufficient condition that array ha does not incur false sharing is:

$${\text{access}}\_{\text{dtiles}}\left( {{\text{ha}},t_{i} } \right) \cap access\_dtiles\left( {ha,t_{i + 1} } \right) = \emptyset ,\;\;{\text{for any two adjacent tasks t}}_{{\text{i}}} {\text{and t}}_{{{\text{i}} + {1}}} .$$
(1)

According to the notion of modulo offset, ha’s largest index that ti accesses is:

$${\text{k}}*{\text{ltile}}*{\text{x}} + {\text{k}}*{\text{c}} + {\text{o2}} - {\text{k}},{\text{ where x is an integer}}$$

and ha’s smallest index that ti+1 accesses is:

$${\text{k}}*{\text{ltile}}*{\text{x}} + {\text{k}}*{\text{c}} + {\text{o1}}$$

Now, we can turn Eq. 1 to another sufficient condition as follows:

$$\exists c \in \left[ {0,ltile} \right),\exists r \in \left[ {{\text{k}}*{\text{c}} + {\text{o2}} - {\text{k}},{\text{ k}}*{\text{c}} + {\text{o1}} - {1}} \right],\exists {\text{z}} \in {\text{Z}},{\text{ r}} = {\text{dtile}}*{\text{z}}.$$
(2)

According to c’s value range, r falls into a finite range, in which values that are multiple of dtile form a set MD. Then Eq. 2 equals to Eq. 3.

$$\exists q \in \left[ {{\text{o2}} - {\text{k}},{\text{ o1}} - {1}} \right],\exists r \in {\text{MD}},\left( {{\text{r}} - {\text{q}}} \right)\% {\text{k}} = 0.$$
(3)

And in Eq. 3, (r−q)/k is a modulo offset that will not incur any false sharing. Figure 10 shows a loop highly simplified from the interp loop in NPB/MG, which has two different write references of array u. According to Eq. 3, q ∈ {− 2}, MD = {0,8}, so there are two solutions to c, c = 1 or c = 5.

Fig. 10
figure10

Task shifting offset of loop j is 1 or 5

Task shifting algorithm works well for common cases. A counterexample runs as follows. If we change the loop body of Fig. 10 to u(i,3*j − 1) = u(i,3*j + 1), the j-loop is still a doall loop. According to Eq. 3, q ∈ {− 2},MD = Φ. And even if we change the tilesize to 9, still MD = Φ. So there is no modulo offset that can eliminate false sharing according to our task shifting algorithm.

Data management

Lifetime extension. There are many subroutines in real world applications, and many of the local variables are shared among different tasks and across different parallel loops. Since the execution of tasks maybe deferred, the lifetimes of these local variables should be extended till all the other tasks referencing them have finished. In AceMesh, there is a memory pool for variable promotion accompanied with each task dependence graph, and the pool will not be released till the end of the graph execution. The compiler identifies all the local variables that need to be promoted, allocates space at the task generating time and substitutes the related reference.

Implementation on communication tasks

Realize ordering constraints. The compiler introduces artificial virtual variables to realize communication ordering constraints, as in Table 2. The compiler collects all the communication ordering keys inside a task t, builds backward program slice according to these references, promotes the program slice out of the task block and instruments runtime calls to get the addresses of the virtual variables, and registers them as output variables. The programming convention in Sect. 2.5 makes the above code motion legal, since communicator is not modified inside tasks, and tag increasing mechanism is outside of any task codes.

Table 2 Virtual variables for maintaining MPI orderings

Safe scheduling of communication tasks. To guarantee safe and parallel execution of communication tasks, the compiler should be integrated with a MPI-aware task scheduler. A communication task block goes through four compiling steps, (i) mark as a block_type task if it contains post or blocking operation (except native wait operations). (ii) For each block_type task, delete all the native mpi_wait, and reconstruct mpi_wait in the task along with the post communication or just turn blocking communications into non-blocking ones. (iii) Move all the mpi_wait to the end of the task, and transform them to mpi_test. (iv) Insert runtime calls to push each uncompleted request to the runtime system.

At runtime, there is a communication thread dedicated to executing block_type tasks, and it triggers between two modes, tasks executing or polling pending requests for uncompleted tasks. Each block_type task goes through several phases. (i) Be dispatched to the communication thread. (ii) After task execution, tasks without pending requests will complete and all its dependences are released, while those with pending requests will change its state to ‘finished’ but not ‘completed’. (iii) Only when all the pending requests are completed, a task can complete.

Evaluation

Is is obvious that AceMesh has easy-to-use advantage. In this section we evaluate AceMesh on two different platforms showing that it can achieve better performance than popular programming models through improving cache reuse, dispersing memory bandwidth requirements (by co-scheduling different types of tasks), exploiting task parallelism globally across the whole DAG region, and hiding MPI communication latency.

Evaluated platforms. We use the following two systems in our experiments.

The first is SunWay TaihuLight supercomputer. Each of its compute node is a single heterogeneous SW26010 CPU (see Fig. 11), composed of four core groups (CGs) connected via a low-latency on-chip network. Each CG consists of a management processing element (MPE), a mesh-based accelerator array with 64 computing processing elements (CPE) on it; a total of 260 cores per CPU. The MPE is a general-purpose processor, while the CPEs are simplified cores without caches, equipped with 64 KB scratch pad memory (SPM). Both of them are single-threaded cores.

Fig. 11
figure11

The architecture of SW26010 (Xu et al. 2017)

The second platform is the TianHe-2 (TH-2) system, and each of its nodes is a 2-socket Xeon E5-2692V2 system running at 2.2 GHz. Each processor features 12 cores and 24 hardware threads, with 32 KB L1 data cache, 256 KB L2 cache, and a 30 MB shared L3 cache. The nodes are connected via TH Express-2 network. Since the AceMesh compiler is currently not NUMA aware, we use at most 12 core per process.

AceMesh versions. For each benchmark, the AceMesh versions on different platforms are almost the same, except that map clauses are added to tile/do directives to realize heterogeneous computation on SunWay TaihuLight. On TianHe-2, the AceMesh versions only use 11 computing threads in each process, since an extra communication thread is needed by the AceMesh runtime.

Reference versions. Since it is almost impossible to refactor a real world application like tend_lin to data flow style using OpenMP4.0, we choose popular programming languages/interfaces as references. We choose OpenMP3.0 as reference on TH-2 and OpenACC on SunWay TaihuLight. On TH-2, Intel compiler ifort-13.0.0 is used, and the OpenMP versions are tested in two ways, using 11 threads or 12 threads. On SunWay TaihuLight, we checked each parallel loop in the OpenACC versions to ensure that it is fully optimized using SPM as much as possible. We tune all the directive parameters (such as OpenMP’s loop schedule parameter, OpenACC’s tile parameters etc.), for each parallel version.

Evaluation of NPB/MG

The NPB MG benchmark uses a V-cycle Multigrid algorithm to solve Poisson’s equation on a discrete 3D grid with periodic boundary condition. In the V cycle, the computation starts from finest refinement level, going down level by level toward the bottom then back up to the top. Among the four critical operations, restriction and residual evaluation operate at a single grid level, while interpolate and project are between adjacent grid levels. Domain decomposition is applied on each dimension (X, Y, Z), so point-to-point communication is needed to update every processor’s boundary values for each dimension. Communications along the three dimensions are serialized for communication coalescing, and this narrows the performance space that communication overlap can gain.

Problem sizes. Three inputs are tested on the SunWay platform, while only C and D sizes are evaluated on TH-2. Size C has 20 time steps and the finest grid is 5123, and size D has 50 steps and a 10243 grid, while size E also has 50 steps and a 20483 grid.

About the reference versions. We parallelize the outermost loop using static loop schedule. We tried dynamic schedules for the OpenMP version, but found no performance improvements. We also parallelize the pack/unpack loop of the X dimension communication.

About the AceMesh versions. All the three main arrays are declared 2-dimensionally arrayTiled, in order to attain more opportunities on communication overlap. Three group of arrays share the same data tile shape, and tasktile arguments of the five main computation loops are consistent with array’s data tile.

Evaluation on SunWay TaihuLight. To ensure performance justice, we check the accumulated pure kernel computation time on SPE (excluding the DMA time) on the finest two grid level, the time difference between the OpenACC version and the AceMesh version is within 1%.

Figure 12a shows AceMesh/OpenACC speedups on problem size C, and we use all the 64 CPEs of the accelerator. The main performance advantage of AceMesh comes from communication hiding. The speedups grow from 1.14 to 1.19 with the increasing of process numbers, since the communication time grows. It should be noted that the out-of-order scheduling of AceMesh helps to improve the utilization of DMA bandwidth. In Fig. 12b, the accumulated DMA time of the AceMesh version is much less than that of the counterpart, the speedups range from 1.28 to 1.45.

Fig. 12
figure12

NPB/MG-C results on SunWay TaihuLight. a Relative speedups on size C. b DMA time comparison on size C

Figure 13 shows results across different problem sizes, and at most 4096 processes are used. The AceMesh/OpenACC speedups are above 1.12, but the best speedup is attained on the 8-process configuration. If we focus on the most time-consuming, finest two levels, the speedups are always above 1.20, and the speedups increase with the process numbers. After investigation, we find that although they have the same subdomain space in each process, but differ in the depth of the refinement level, problem size E has 3 more coarse levels than size B, and these coarse levels are communication intensive and have little opportunity for communication overlap, so size E gains less speedup from AceMesh.

Fig. 13
figure13

Relative speedups of NPB/MG on SunWay TaihuLight across different inputs

Evaluation on TH-2. On TH-2, when using the same number of processes, the application turns to be communication intensive, where 32% and 44% of the time is communication for the 32 and 64 process OpenMP execution, so AceMesh has larger space on communication-computation overlap. Figure 14 gives the results on TH-2 on problem size C and D. For both problem sizes, the AceMesh/OpenMP speedups grow with process numbers, and it gains higher speedups on size C than on size D. For size C, when compared with 11-thread OpenMP, the speedups climb sharply from 1.1 to 1.66 and finally reaches 2.2. PAPI profiling shows that L3 cache misses drops 34%, 59% and 89% respectively for process number 32, 64 and 128, compared to the OpenMP versions. Actually, AceMesh’s depth-first task scheduling benefits both L3 and L2 cache. When compared with 12-thread OpenMP, the speedups drop a little but it still reaches 1.85 in the 128 process number of size C and 1.18 of size D, even if the OpenMP version uses 1 more computing cores.

Fig. 14
figure14

NPB/MG results on TH-2. a Relative speedups on size C. b Relative speedups on size D

Communication scalability. We focus on size C, the communication scalability is good on SunWay platform with the communication ratios ranging from 4% (8-process) to 14% (128-process). While on TH-2, the ratios grow from 16 to 64%, and large communication ratios provide more chances on communication hiding.

Evaluation of Tend_lin

Tend_lin is a benchmark extracted from IAP AGCM-4, a high resolution Atmospheric General Circulation Models (AGCM) code developed by He Zhang et al. (2009). Actually, it is the hotspot of IAP AGCM-4’s dynamic core, which revolves the baroclinic primitive equations. To get the four basic prognostic variables, the zonal wind (U), meridional wind (V), the pressure (P), and the temperature (T), finite difference equations are divided into advection and adaptation processes where the adaptation (the tend_lin subroutine) process is far more time-consuming. We build a standalone benchmark Tend_lin where the original tend_lin subroutine is called multiple times to simulate the real circumstances in IAP AGCM.

The benchmark uses the same 2D decomposition scheme as in IAP AGCM-4, where the latitude and altitude level of the mesh are partitioned among parallel processes, and the longitude dimension is left serialized. Tend_lin computes DPsa, DU, DV and DT which are the tendencies of mode variables PT, UT, VT and TT. Each variable goes through two processing phases, the computation phase (we call them stencil phase later) and the filtering phase. In the first stage, there are many multi-point 3D stencil computations, several point-to-point communications, along with three collective communications including mpi_allgatherv.

Filtering is applied on different levels. Among the four arrays, the filter of Dpsa and DT incur nearest-neighbor communications, while DU and DV will not. 20 different tags are used in this phase to avoid exerting too much ordering constraints among communications. There is plenty of parallelism implied in the filtering phase, firstly the four arrays’ filtering is independent of each other, secondly different level’s filtering of the same array is also independent of each other, and thirdly, the filtering of different latitude is also independent of each other.

Problem sizes. We consider two resolutions of the model, which are 0.5° × 0.5° (50 km) with a 3D grid of 720 × 361 × 30, and 0.25° × 0.25° (25 km) on a 3D grid of 1440 × 721 × 30. In all, the application has a relatively low computation to communication ratio, and the communication may occupy 70% of the total time for the 128 process OpenMP version on TH-2.

About the AceMesh versions. All the physical variables are declared arrayTiled in the latitude dimension with the same tile size, and three tendency variables (DT, DU, DV) are further arrayTiled (with tile size = 1.) in the altitude dimension. Most of the parallel loops are tasktiled consistently on the latitude dimension, while a few loop nests choose the outermost loop to parallelize since the latitude loop is not suitable for task parallelism. For each tendency variable, each level’s filtering is a nested task dependence graph as shown in Fig. 6.

About the reference versions. The OpenMP and OpenACC versions usually only exploit parallelism within a local loop nest, and there is not enough parallelism on the latitude dimension when using large process numbers, so we use collapse(2) clause to expose more parallelism. For the OpenMP version, we further use nested ‘parallel’ region in the filtering phase to expose more parallelism, allowing multiple regions to run concurrently. But for the OpenACC version, nested parallelism is not supported on SunWay.

Parameter tuning. We also tune process domain decomposition parameters for tend_lin. For 64-process scale, the best version of AceMesh and that of OpenMP may take different factorizations, 4*16, 8*8, or 16*4.

Evaluation on SunWay TaihuLight. Figure 15 gives the AceMesh/OpenACC performance on SunWay TaihuLight for both inputs. AceMesh performance on the 0.5° input can be 3.0X of that of OpenACC, while the biggest speedup on the 0.25° input is 2.86X, and the average speedup of both inputs are above 2.0X. To understand the relative speedups, we measure the stencil phase and the filtering phase, respectively. For the filtering phase, the performance advantage of AceMesh grow with the process numbers, since the number of tasks in each loop decrease with the increasing of process numbers, and OpenACC cannot feed up the 64 cores of the accelerator, while AceMesh can exploit three levels of parallelism implied among different arrays, different levels and different latitudes. It should be noted that AceMesh gets negative speedups in the filter phase when process number is 16 or 32, the reason is that there is plenty of parallelism on each parallel loop in those cases, and the AceMesh’s runtime overhead on task graph building hinders its performance. For the stencil phase, AceMesh versions always perform faster than their OpenACC counterparts, but the relative speedups are not monotonous according to process numbers.

Fig. 15.
figure15

Tend_lin results on TaihuLight. a AceMesh/OpenACC speedups (0.5° resolution). b AceMesh/OpenACC speedups (0.25° resolution)

Evaluation on TH-2. Figure 16 gives the performance comparison on TH-2, and the results are quite similar to that on TaihuLight, i.e. the 0.5° input witness larger speedups than the 0.25° input. Since there’re only 12 cores in each MPI process, the starvation of parallelism is relatively mild. The relative speedups of the filtering phase reach at most 1.93X and 3.37X on the two inputs respectively. Still, the 8-process version witness negative speedup on the filtering phase and it shows that the graph building overhead is non-negligible. The speedups of the stencil phase are mild and stable. And the final speedup is 1.5X on average on both inputs, and the highest speedups reach 2.11X and 2.23X for the two inputs, respectively.

Fig. 16
figure16

Tend_lin results on TH-2. a AceMesh/OpenMP speedups (0.5° resolution). b AceMesh/OpenMP speedups (0.25° resolution)

Communication scalability. In tend_lin, the scalability of communication is not good. On SunWay platform, communication ratios grow from 38% (16-process) to 61% (128-process) and 69% (1024-process) for the 0.25° input, while the ratios can grow to 84% on 1024 process scale for the 0.5° input. On TH-2, the ratio can reach 69% on the 128-process scale for the 0.5° input.

Related work

Asynchronous Task-based Programming models are gaining popularity to address the programmability and performance challenges of contemporary multicore and heterogeneous systems. The ATaP model varies in the type of parallelism, from purely fork-join as in OpenMP 5.0 (2018) to directed acyclic graph based, as in OmpSs (Duran et al. 2011) and StarPU (Augonnet et al. 2011).

One of the main advantages of ATaP models is their potential to automatically overlap computation with communication. Many such models have been developed for large scale, distributed-memory systems, such as Charm++ (Acun et al. 2014), HPX (Kaiser et al. 2014), Tarragon (Cicotti 2011), and Legion (Bauer et al. 2012), but the program should be rewritten from scratch, so they are not friendly to legacy applications. With the help of user directives, Nguyen et al. (2017) translate legacy MPI programs to Tarragon programs in order to realize communication overlap.

Hybrid programming using MPI + ATaP is another approach to modernization legacy codes, and our work falls into this category. ATaP runtime should be extended to be aware of certain MPI primitives. Marjanovi´c et al. (2010) propose restart directive and a restart mechanism to avoid MPI related deadlocks. Sala et al. (2019) present TAMPI (Task aware MPI) library, and two runtime APIs allowing a task to be suspended once a MPI operation blocks inside it, and to be resumed its execution or finalize once the communication completes. AceMesh’s MPI-aware scheduling mechanism is similar to TAMPI’s approach. Castillo et al. (2019) extend MPI to expose information about internal events to the tasking runtime, letting their interaction more efficient. All the above work does not handle the communication ordering constraints, and the related code refactoring is left to the user.

ATaP allows multi-dimensional array section in data dependence expressions, although this broadens the applicability of the programming model, it also incurs relative large overhead. Perez (2014) extends OMPSs’s runtime system to support overlapped array sections, and finds that in the worst cases, task management overhead can be 2.3 times higher compared to section-unaware runtime. Many compiling systems (Ghosh et al. 2013; Podobas et al. 2014) choose to realize section-unaware runtime.

In ATaP models, tasks are expressed separately with no global view of the data usage pattern, task and data mapping is deferred until dispatching time. DEP approach (dependence easy placement) (Drebes et al. 2014, 2016; Virouleau et al. 2016) tackles memory locality by automatically mapping tasks depending on where their input and output data are allocated. Barrera et al. (2018) perform graph partitioning dynamically over the task dependence graph, to minimize data transfers across NUMA nodes. While in our work both memory locality and cache locality can be handled earlier at the task generating time. Furthermore, tile-based array section description may also simplify DEP’s mapping process. Broquedis et al. (2010) suggest dynamically generating structured trees (bubble) out of OpenMP programs, mapping them to hierarchical hardware according to collected information such as task-data relationship, runtime feedbacks and scheduling hints. AceMesh’s arrayTile directive, accompanied with aggregated task definition directive, make it possible to build implicit tree of tasks at generation time facilitating bubble-like schedulers for the ATaP model.

Conclusion

This paper presents AceMesh, a data-driven programming language extension, its implementation and evaluation. It provides data tile directives as parallelizing templates, and an aggregated approach to express dependent tasks to simplify programming. We also show that this tile-based approach will not hurt task-level parallelism but provide possibility for structured execution of complex task graphs. We introduce the prototype implementation. The evaluation on two supercomputers shows how the programming system helps parallelizing real world applications and how it remarkably improves performance over existing programming languages/interfaces.

There are several directions for future work. One is how to use data tile templates on NUMA machines and further to support dynamic task and data placement. Secondly, we will investigate more optimizations to facilitate programming, such as automatic tag range amplification under certain programmer directives, tolerating pause-and-resume communication tasks, and more intelligent task scheduling in favor of communication overlapping.

References

  1. Acun, B., Gupta, B., Jain, N., Langer, A., Menon, H., Mikida, E., Ni, A., Robson, M., Sun, Y., Totoni, E., Wesolowski, L., Kale, L.: Parallel programming with migratable objects: Charm++ in Practice. SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, 2014, pp. 647–658, doi: 10.1109/SC.2014.58.

  2. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exper. 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631

    Article  Google Scholar 

  3. Barrera, I.S., Moretó, M., Ayguadé, E., Labarta, J., Valero, M., Casas, M.: Reducing data movement on large shared memory systems by exploiting computation dependencies. In Proceedings of the 2018 International Conference on Supercomputing (ICS ’18). ACM, New York, NY, USA, pp. 207–217. https://doi.org/10.1145/3205289.3205310

  4. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In Proceedings of the 2012 ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society, Los Alamitos, CA, USA, Article 66, p. 11.

  5. Broquedis, F., Aumage, O., Goglin, B., Thibault, S., Wacrenier, P., Namyst,R.: Structuring the execution of OpenMP applications for multicore architectures. 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), Atlanta, GA, 2010, pp. 1-10.

  6. Castillo, M., Jain, N., Casas, M., Moreto, M., Schulz, M. Beivide, R., Valero, M., Bhatele, A.: Optimizing computation-communication overlap in asynchronous task-based programs. In Proceedings of the ACM International Conference on Supercomputing (ICS ’19). Association for Computing Machinery, New York, NY, USA, pp. 380–391. https://doi.org/10.1145/3330345.3330379

  7. Cicotti, P.: Tarragon: a programming model for latency-hiding scientific computations. PhD thesis, Department of Computer Science and Engineering, University of California, San Diego (2011)

  8. Drebes, A., Heydemann, K., Drach, N., Pop, A., Cohen, A.: Topology-aware and dependence-aware scheduling and memory allocation for task-parallel languages. ACM Trans. Archit. Code Optim. 11(3), 1–25 (2014). https://doi.org/10.1145/2641764

    Article  Google Scholar 

  9. Drebes, A., Pop, A., Heydemann, A., Cohen, A., Drach, N.: Scalable task parallelism for NUMA: a uniform abstraction for coordinated scheduling and memory management. In International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, New York, NY, USA, pp. 125–137. https://doi.org/10.1145/2967938.2967946

  10. Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011)

    MathSciNet  Article  Google Scholar 

  11. Ghosh, P., Yan, Y., Chapman, B.: A prototype implementation of OpenMP task dependency support. In: Rendell, A.P., Chapman, B.M., M¨uller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 128–140. Springer, Heidelberg (2013)

  12. Kaiser, H., Heller, T., Adelstein-Lelbach, B., Serio, A., Fey, D.: HPX: a task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS ’14). ACM, New York, NY, USA, Article 6, p. 11.

  13. Marjanovi´c, V., Labarta, J., Ayguadé, E., Valero, M.: Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In Proceedings of the 24th ACM International Conference on Supercomputing, 2010, pp. 5–16, doi: 10.1145/1810085.1810091

  14. Nguyen, T., Cicotti, P., Bylaska, E., Quinlan, D., Baden, S.: Automatic translation of MPI source into a latency-tolerant, data-driven form. J. Parallel Distrib. Comput. 106, 1–13 (2017). https://doi.org/10.1016/j.jpdc.2017.02.009

    Article  Google Scholar 

  15. Perez, J.M.: A dependency-aware parallel programming model. PhD thesis. Universitat Politècnica de Catalunya, Barcelona (2014)

  16. Podobas, A., Brorsson, M., Vlassov, V.: TurboBLYSK: scheduling for improved data-driven task performance with fast dependency resolution. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., M¨uller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 45–57. Springer, Cham.

  17. Preissl, R., Schulz, M., Kranzlmuller, D., de Supinski, B., Quinlan, D.: Using MPI communication patterns to guide source code transformations. In Computational Science ICCS 2008, Volume 5103 of Lecture Notes in Computer Science, pp. 253–260. Springer, Berlin/Heidelberg (2008).

  18. OpenMP Architecture Review Board: OpenMP application program interface. Version 5.0. Nov. 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

  19. Sala, K., Teruel, X., Perez, J.M., Peña, A.J., Beltran, V., Labarta, J.: Integrating blocking and non-blocking MPI primitives with task-based programming models. Parallel Comput. 85, 153–166 (2019). https://doi.org/10.1016/j.parco.2018.12.008

    Article  Google Scholar 

  20. Virouleau, P., Broquedis, F., Gautier, T., Rastello, F.: Using data dependencies to improve task-based scheduling strategies on NUMA architectures. In Euro-Par 2016: Parallel Processing. Springer, Cham, pp. 531–544. https://doi.org/10.1007/978-3-319-43659-3_39

  21. Xu, Z., Lin, J., Matsuoka, S.: Benchmarking SW26010 many-core processor. In Proceedings—2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, pp. 743–752, June 30, 2017

  22. Zhang, H., Lin, Z., Zeng, Q.: The computational scheme and the test for dynamical framework of IAP AGCM-4. Chin. J. Atmos. Sci. 33, 1267–1285 (2009)

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (Grant No. 2017YFB02-02002); the Innovation Research Group of NSFC (Grant No. 61521092).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Li Chen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Tang, S., Fu, Y. et al. AceMesh: a structured data driven programming language for high performance computing. CCF Trans. HPC 2, 309–322 (2020). https://doi.org/10.1007/s42514-020-00047-4

Download citation

Keywords

  • High performance computing
  • Programming model
  • MPI
  • Task parallel
  • Data driven
  • Task dependence