XcalableMP Programming Model and Language

XcalableMP (XMP) is a directive-based language extension of Fortran and C for distributed-memory parallel computers, and can be classified as a partitioned global address space (PGAS) language. One of the remarkable characteristics of XMP is that it supports both global-view and local-view parallel programming. This chapter describes the programming model and language specification of XMP.

In such circumstance, to develop a new parallel programming model that enables easy parallelization of existing serial programs and design a new language based on it, "the XMP Specification Working Group" was established in 2008. This group utilized the lessons from the experience of HPF to define a new parallel language XcalableMP (XMP). The group was reorganized to one of the working groups of PC Cluster Consortium in 2011.
It is learned from the lessons of HPF that more automatic processing of compilers increases the gap between a program and its execution, and, as a result, decreases the usability of the language. In XMP, the programmers specify explicitly the details of parallel programs on the basis of compiler directives to make their execution easy to understand. In particular, they can specify explicitly communication, synchronization, data mapping, and work mapping to facilitate performance tuning. In addition, XMP supports features for one-sided communication on each process, which was not available in HPF. This feature might enable programmers to implement parallel algorithms more easily.
In this chapter, an overview of the programming model and language specification of XMP is shown. You can find the latest and complete language specification of XMP in: XcalableMP Specification Working Group, XcalableMP Specification Version 1.4, http://xcalablemp.org/download/spec/xmp-spec-1.4.pdf (2018).

Target Hardware
The target of XcalableMP is distributed-memory multicomputers (Fig. 1). Each compute node, which may contain several cores, has its own local memory (shared by the cores, if any), and is connected with the others via an interconnection network. Each node can access its local memory directly and remote memory (the memory of another node) indirectly (i.e. via inter-node communication). However, Fig. 1 Target hardware of XMP it is assumed that accessing remote memory may be much slower than accessing local memory.

Execution Model
The execution entities in an XMP program are referred to as XMP nodes or, more simply, nodes, which has its own memory and can communicate with each other.
An XcalableMP program execution is based on the Single Program Multiple Data (SPMD) model, where each node starts execution from the same main routine, and continues to execute the same code independently (i.e. asynchronously) until it encounters an XcalableMP construct (Fig. 2).
A set of nodes that executes a procedure, statement, loop, a block, etc. is referred to as its executing node set, and is determined by the innermost task, loop, or array directive surrounding it dynamically, or at runtime. The current executing node set is an executing node set of the current context, which is managed by the XcalableMP runtime system on each node.
The current executing node set at the beginning of the program execution, or entire node set, is a node set that contains all the available nodes, which can be specified in an implementation-defined way (e.g. through a command-line option).
When a node encounters at runtime either a loop, array, or task construct, and is contained by the node set specified (explicitly or implicitly) by the on clause of the directive, it updates the current executing node set with the specified one and executes the body of the construct, after which it resumes the last executing node set and proceeds to execute the subsequent statements.
In particular, when a node in the current executing node set encounters a loop or an array construct, it executes the loop or the array assignment in parallel with the other nodes, so that each iteration of the loop or element of the assignment is independently executed by the node in which the specified data element resides.
When a node encounters a synchronization or a communication directive, synchronization or communication occurs between it and the other nodes. That is, such global constructs are performed collectively by the current executing nodes. Note that neither synchronization nor communication occurs unless these constructs are specified.

Data Model
There are two classes of data in XcalableMP: global data and local data. Data declared in an XcalableMP program are local by default.
Global data are distributed onto a node set by the align directive (see Sect. 2.4). Each fragment of distributed global data is allocated in the local memory of a node in the node set.
Local data comprises all data that are not global. They are replicated within the local memory of each of the executing nodes.
A node can access directly only local data and sections of global data that reside in its local memory. To access data in remote memory, explicit communication must be specified in such ways as global communication constructs and coarray assignments (Fig. 3).

Partitioned Global Address Space
XMP can be classified as a partitioned global address space (PGAS) language, such as Co-Array Fortran [1], Unified Parallel C [2], and Chapel [3].
In such PGAS languages, multiple executing entities (i.e. threads, processes, or nodes in XMP) share a part of their address space, which is, however, partitioned and a portion of which is local to each executing entity.
The two programming models, global-view and local-view, that XMP supports to achieve high performance and productivity on PGAS are explained below.

Global-View Programming Model
The global-view programming model is useful when, starting from a serial version of a program, the programmer parallelizes it in a data-parallel style by adding directives with minimum modification. Based on this model, the programmer specifies the distribution of data among nodes using the data distribution directives. The loop construct assigns each iteration of a loop to the node at which the computed data is located. The global-view communication directives are used to synchronize nodes, maintain the consistency of shadow areas of distributed data, and move sections of distributed data globally. Note that the programmer must specify explicitly communication to make all data references in their program local using appropriate directives.
In many cases, the XcalableMP program following the global-view programming model is based on a serial program, and it can produce the same result, regardless of the number of nodes (Fig. 4).
There are three groups of directives for this model: • Data mapping, which specifies the data distribution and mapping to nodes • Work mapping (parallelization), which specifies the work distribution and mapping to nodes. • Communication and synchronization, which specify how a node communicates and synchronizes with the other nodes.
Because these directives are ignored as a comment by the compilers of base languages (Fortran and C), an XcalableMP program can usually be compiled by them to ensure that they run properly.

Local-View Programming Model
The local-view programming model is suitable for programs that implement an algorithm and a remote data reference that are to be executed by each node (Fig. 5).  For this model, some language extensions and directives are provided. The coarray notation, which is imported from Fortran 2008, is one such extension, and can be used to explicitly specify data on which node is to be accessed. For example, the expression of A(i) [N] in XcalableMP Fortran is used to access an array element of A(i) located on the node N. If the access is a reference, then a one-sided communication to read the value from the remote memory (i.e. the get operation) is issued by the executing node. If the access is a definition, then a one-sided communication to write the value to the remote memory (i.e. the put operation) is issued by the executing node.

Mixture of Global View and Local View
In the global-view model, nodes are used to distribute data and works. In the localview model, nodes are used to address remote data in the coarray notation. In application programs, the programmers should choose an appropriate data model according to the characteristics of their program. Figure 6 illustrates the global view and the local view of data.
Data can have both a global view and a local view, and can be accessed in both of the views. XcalableMP provides a directive to give the local name (alias) to global data declared in the global-view programming model to enable them to also be accessed in the local-view programming model. This feature is useful to optimize a certain part of a program by using explicit remote data access in the local-view programming model.

Base Languages
The XcalableMP language specification is defined on the basis of Fortran and C as the base languages. More specifically, the base language of XcalableMP Fortran is Fortran 90 or later, and that of XcalableMP C is ISO C90 (ANSI C89) or later with some extensions (see below).

Array Section in XcalableMP C
In XcalableMP C, the base language C is extended so that a part of an array, that is, an array section or subarray, can be put in an array assignment statement, which is described in Sect. 1.5.2, and some XcalableMP constructs. An array section is built from a subset of the elements of an array, which is specified by a sequence of square-bracketed integer expressions or triplets, which are in the form of: [ base ] : [ length ] [ : step ] When step is positive, the triplet specifies a set of subscripts that is a regularly spaced integer sequence of length length beginning with base and proceeding in increments of step up to the largest. The same applies to negative step too.
When base is omitted, it is assumed to be 0. When length is omitted, it is assumed to be the number of remainder elements of the dimension of the array. When step is omitted, it is assumed to be 1.  array section of the whole of A

Array Assignment Statement in XcalableMP C
In XcalableMP C, the base language C is also extended so that it supports array assignment statements just as Fortran does. With such statement, the value of each element of the result of the right-hand side expression is assigned to the corresponding element of the array section on the lefthand side. When an operator or an elemental function is applied to array sections in the right-hand side expression, it is evaluated to an array section that has the same shape as that of the operands or arguments, and each element of which is the result of the operator or function applied to the corresponding element of the operands or arguments. A scalar object is assumed to be an array section that has the same shape as that of the other array section(s) in the expression or on the left-hand side, and where each element has its value.
Note that an array assignment is a statement, and therefore cannot appear as an expression in any other statements.
In the example below, an array assignment statement in the fourth line copies the five elements from B[0] to B [4] into the elements from A [5]

Interoperability
Most of the existing parallel applications are written with MPI. It is not realistic to port them over to XMP because each of them consists of millions of lines.
Because XMP is interoperable with MPI, users can develop an XMP application by modifying a part of an existing one instead of rewriting it totally. Besides, when developing a parallel application from scratch, it is possible to use XMP to write a complicated part of, for example, domain decomposition while they use MPI, which could be faster than XMP, to write a hot-spot part that need to be tuned carefully. In addition, XMP is interoperable with OpenMP and Python (see Chap. 5).
It might be difficult to develop an application with just one programming language or framework since it generally has its own strong and weak points. Thus, an XMP program is interoperable with those in other languages to provide both high productivity and performance.

nodes Directive
The nodes directive declares a node array, which is an array-like arrangement of nodes in a node set. A node array can be multi-dimensional.

Note
The ordering of the elements in a node array follows that of a normal array in the base language, C or Fortran.

XcalableMP C #pragma xmp nodes p[ * ]
XcalableMP Fortran !$xmp nodes p( * ) An asterisk can be specified as the size in the nodes directive to declare a dynamic node array. In the above code, one-dimensional dynamic node array p is declared with an asterisk as the size. The actual size of a dynamic node array is determined at runtime to fit the size of the current executing node set. For example, when the programmer runs the sample code with three nodes, the node array p includes three nodes.
The programmer can also declare multi-dimensional dynamic node arrays with an asterisk.

Note
The programmer can put an asterisk only in the last dimension, in XMP/Fortran, or the first dimension, in XMP/C, of the node array.

Hint
The dynamic node array may interfere with compiler optimizations. In general, programs with static ones achieve better performance.
The programmer can declare a node subarray derived from an existing node array. Node subarrays can be used, for example, to optimize inter-node communication by reducing the number of nodes participating in the communication.
XcalableMP In line 1, a node array p including 16 nodes is declared. In line 2, a node subarray q corresponding to the first half of p is declared. In line 3, a twodimensional node subarray r corresponding to the latter half of p is declared.
The programmer can declare an n-dimensional node subarray derived from an m-dimensional one (Fig. 7).

Note
The semantics of an asterisk in a node reference is different from that in a declaration.

template Directive
The template directive declares a template, which is a virtual array that is used as a "template" of parallelization in the programs and to be distributed onto a node array.
XcalableMP C #pragma xmp template t[10] XcalableMP Fortran !$xmp template t(10) This template directive declares a one-dimensional template t having ten elements. Templates are indexed in the similar manner to arrays in the base languages. For the above examples, the template t is indexed from zero to nine (i.e. t[0] · · · t[9]), in XMP/C, or one to ten (i.e. t(1) · · · t(10)), in XMP/Fortran. Hint In many cases, a template should be declared to have the same shape as your target array.
XcalableMP C #pragma xmp template t [:] XcalableMP Fortran !$xmp template t(:) In the above examples, a colon instead of an integer is specified as the size to declare a one-dimensional dynamic template t. The colon indicates that the size of the template is not fixed and to be fixed at runtime by the template_fix construct (Sect. 2.6).

distribute Directive
The distribute directive specifies a distribution of the target template. Either of block, cyclic, block-cyclic, or gblock (i.e. uneven block) can be specified to distribute a dimension of a template.

Block Distribution
XcalableMP C #pragma xmp distribute t[block] onto p XcalableMP Fortran !$xmp distribute t(block) onto p The target template t is divided into contiguous blocks and distributed among nodes in the node array p (Fig. 8). Let's suppose that the size of the template is N and the number of nodes is K. If N is divisible by K, a block of size N/K is assigned to each node; otherwise, a block of size ceil(N/K) is assigned to each of N/ceil(N/K) nodes, a block of size mod(N, K) to one node, and no block to (K − N/ceil(N/K) − 1) nodes. The block distribution is useful for regular computations such as a stencil one.

Note
The function ceil(x) returns a minimum integer value greater than or equal to x, and mod(x, y) returns x modulo y.

Cyclic Distribution
The target template t is divided into chunks of size one and distributed among nodes in the node array p in a round-robin manner (Fig. 9). The cyclic distribution is useful for the case where the load on each element of the template is not balanced. XcalableMP

Block-Cyclic Distribution
The target template t is divided into chunks of size w and distributed among nodes in the node array p in a round-robin manner (Fig. 10). The block-cyclic distribution is useful for the case where the load on each element of the template is not balanced but the locality of the elements is required.   (3)) onto p
XcalableMP C #pragma xmp nodes p [3] #pragma xmp template t The programmer can specify an asterisk instead of a mapping array in the gblock distribution to defer fixing the actual distribution. In such a case, the actual distribution will be fixed at runtime by using the template_fix construct.

Distribution of Multi-Dimensional Templates
The programmer can distribute a multi-dimensional template onto a node array.
XcalableMP The distribute directive declares the distribution of a two-dimensional template t onto a two-dimensional node array p. Each dimension of the template is divided in a block manner and each of the rectangular region is assigned to a node (Fig. 12).
The programmer can specify a different distribution format in each of the dimension of a template (Fig. 13).
XcalableMP   When an asterisk is specified in a distribute directive as a distribution format, the target dimension is "non-distributed." In the following example, the first dimension is distributed in a block manner and the second dimension is nondistributed (Fig. 14).

align Directive
The align directive specifies that an array is to be mapped in the same way as a specified template. In other words, an align directive defines the correspondence of elements between an array and a template, and each of the array element is allocated on the node where the corresponding template element is assigned.
The array a is decomposed and laid out so that each element a(i) is colocated with the corresponding template element t(i) (Fig. 15).
The align directive can also be used for multi-dimensional arrays (Fig. 16 The programmer can align an n-dimensional array with an m-dimensional template for n > m (Fig. 17).
XcalableMP When an asterisk is specified as a subscript in a dimension of the target array in the align directive, the dimension is "collapsed" (i.e. not distributed). In the sample program above, the first dimension of the array a is distributed onto the node array p while the second dimension is collapsed.
The programmer also can align an n-dimensional array with an m-dimensional template for n < m (Fig. 18 When an asterisk is specified as a subscript in a dimension of the target template in the align directive, the array will be "replicated" along the axis of the dimension. In XMP/C, a[0:4] will be replicated and allocated on p[0][0] and p[0] [1] while, in XMP/Fortran, a(1:4) will be allocated on p(1,1) and p(2,1).

Dynamic Allocation of Distributed Array
This section explains how distributed (i.e. global) arrays are allocated at runtime. The basic procedure is common in XMP/C and XMP/Fortran with a few specific difference.
XcalableMP : a = xmp_malloc(xmp_desc_of(a), N); In XMP/C, first, declare a pointer of the type of the target array; second, align it as if it were an array; finally, allocate memory for it with the xmp_malloc() function. xmp_desc_of() is an intrinsic/built-in function that returns the descriptor of the XMP object (i.e. nodes, templates, or global arrays) specified by the argument.

allocate(a(N))
In XMP/Fortran, first, declare an allocatable array; second, align it; finally, allocate memory for it with the allocate statement.
For multi-dimensional arrays, the procedure is the same as that for onedimensional ones, as follows: XcalableMP Note If the size of the template is not fixed until runtime, the programmer has to fix it at runtime with the template_fix construct.

template_fix Construct
The template_fix construct fixes the shape and/or the distribution of an unfixed template.
XcalableMP C #pragma xmp nodes p [4]  XcalableMP Fortran !$xmp nodes p(4) !$xmp template t(:) !$xmp distribute t(block) onto p real, allocatable :: a(:) 5 integer :: n !$xmp align a(i) with t(i) n = 100 !$xmp template_fix t(n) 10 allocate(a(n)) In the above sample code, first, a template t whose size is unfixed (":") is declared; second, a pointer a, in XMP/C, or an allocatable array a, in XMP/Fortran, is aligned with the template; third, the size of the template is fixed with a template_fix construct; finally, the pointer or the allocatable array is allocated with the xmp_malloc() built-in function in XMP/C or the allocate statement in XMP/Fortran, respectively.

Note
The template_fix constructs can be applied to a template only once.
This construct can also be used to fix a mapping array of a template that is distributed in "gblock( * )" at declaration.

task and tasks Construct
The task construct defines a task that is executed by a specified node set. The tasks construct asserts that the task constructs it surrounds can be executed in parallel.
Note that a new node set is generated by each task construct. Let's consider inserting a bcast construct into the task. This bcast construct is executed by the node set specified by the on clause of the task construct. Thus, the node p [1] broadcasts the value of num to p [2] and p [3] in XMP/C, and p(2) to p(3) and p(4) in XMP/Fortran.
The bcast construct in the above code is equivalent to that in the following code, where it is executed by a new node set q that is explicitly declared.
For example, consider inserting xmpc_node_num() or xmp_node_num() into the task in the first program.
Note A new node set should be collectively generated by all of the executing nodes at the point of a task construct unless it is surrounded by a tasks construct. Therefore, in the above example, p[0] in XMP/C and p(1) in XMP/Fortran must process the task construct.

tasks Construct
Let's consider that each of two tasks invokes a function.
XcalableMP C #pragma xmp nodes p [4] #pragma xmp task on p[0:2] call func_a() 5 !$xmp end task !$xmp task on p(3:4) call func_b() !$xmp end task In the above example, the two tasks cannot be executed in parallel because the on clauses must be evaluated by all of the executing nodes (Fig. 20). In such a case, the programmer must specify a tasks construct surrounding the tasks to execute them in parallel (Fig. 21).
XcalableMP C #pragma xmp nodes p [4]  Because the node sets specified by the on clauses of the task constructs surrounded by a tasks construct are disjoint, they can be executed in parallel.

loop Construct
The loop construct is used to parallelize a loop.
XcalableMP The loop directive above specifies that the iteration i of the following loop is executed by the node that owns the template element t[i] or t(i), which is specified in the on clause.
Such a loop must satisfy the following two conditions: 1. There is no data/control dependence among the iterations. In other words, the iterations of the loop can be executed in any order to produce the same result. 2. Elements of distributed arrays, if any, are accessed only by the node(s) that own(s) the elements.
The programs below are examples of a right loop directive and a loop statement. Condition 1 is satisfied because i is the only one index of the distributed array a that is accessed within the loop, and condition 2 is also satisfied because the indices of the template in the on clause of the loop directive are identical to that of the distributed array (Fig. 22).   In this case, conditions 1 and 2 are satisfied and therefore it is possible to parallelize them. In XMP/C, p[0] processes the indices from one to four and p [1] from five to eight. In XMP/Fortran, p(1) processes the indices from two to five and p(2) from six to nine (Fig. 23).
Next, is it possible to parallelize the below loops in which the index of the distributed array is different?
XcalableMP C #pragma xmp loop on t[i] for(int i=1;i<9;i++) a[i+1] = i; XcalableMP Fortran !$xmp loop on t(i) do i=2, 9 a(i+1) = i enddo In this case, condition 1 is satisfied but 2 is not, and therefore it is not possible to parallelize them. In XMP/C, p[0] tries to access a[5] but does not own it. In XMP/Fortran, p(1) tries to access a(6) but does not own it (Fig. 24).

Reduction Computation
The serial programs below are examples of a reduction computation.  If the above loops are parallelized just by adding a loop directive, the value of the variable sum varies from node to node because it is calculated separately on each node (Fig. 25). The value should be reduced to produce the right result.
XcalableMP Then, to correct the error in the above code, add a reduction clause to the loop directive as follows (Fig. 26)  An operator and target variables for reduction computation are specified in a reduction clause. In the above examples, a "+" operator and a target variable sum are specified for the reduction computation to produce a total sum among nodes.
Operations that can be specified as an operator in a reduction clause are limited to the following associative ones.

Parallelizing Nested Loop
Parallelization of nested loops can be specified similarly to a single one, as follows.

array Construct
The array construct is for work mapping of array assignment statements. This construct can also be applied to multi-dimensional arrays.

XcalableMP C #pragma xmp align a[i][j] with t[i][j]
: Note The template appearing in the on clause must have the same shape as the arrays in the following statement. The right-hand side value in this construct must be identical among all nodes because the array construct is a global (i.e. collective) operation.

shadow Directive and reflect Construct
Stencil computation frequently appears in scientific simulation programs, where, to update an array element a[i], its neighboring elements a[i-1] and a[i+1] are referenced. If a[i] is on the boundary region of a block-distributed array on a node, a[i+1] may reside on another (neighboring) node.
Since it involves large overhead to copy a[i+1] from the neighboring node to update each a[i], a technique of copying collectively the elements on the neighboring node to the area added to the distributed array on each node is usually adopted. In XMP, such additional area is called "shadow."

Declaring Shadow
Shadow areas can be declared with the shadow directive. In the example below, an array a has shadow areas of width one on both the lower and upper bounds.
XcalableMP C #pragma xmp nodes p [4]  In the Fig. 27, shaded elements are those that each node owns and white ones are shadow.
Note Arrays distributed in a cyclic manner cannot have shadow.
In some programs, it is natural that the widths of the shadow area on the lower and upper bounds are different. There is also a case where the shadow area exists  (2) only on either of the bounds. In the example below, it is declared that a distributed array a has a shadow area of width one only on the upper bound (Fig. 28).
XcalableMP C #pragma xmp nodes p [4]  The values on the left-and right-hand sides of a colon designate the widths on the lower and upper bounds, respectively.

Updating Shadow
To copy data to shadow areas from neighboring nodes, use the reflect construct. In the example below, the shadow areas of an array a that are of width one on both the upper and lower bounds are updated (Fig. 29).
XcalableMP  Similarly, in XMP/Fortran, node p(2) sends an element a(5) to the shadow area on the upper bound on node p(1) and a(8) to the shadow area on the lower bound on p(3); p(1) sends an element a(4) to the shadow area on the lower bound on p(2), and p(3) sends a(9) to the shadow area on the upper bound on p(2).
The default behavior of a reflect directive is to update the whole of the shadow area declared by the shadow directive. However, there are some cases where a specific part of the shadow area is to be updated to reduce the communication cost at a point of the code.
To update only a specific part of the shadow area, add the width clause to the reflect directive.
The values on the left-and right-hand sides of a colon in the width clause designate the widths on the lower and upper bounds to be updated, respectively. In the example below, only the shadow area on the upper bound is updated (Fig. 30).
XcalableMP C #pragma xmp reflect (a) width(0:1) XcalableMP Fortran !$xmp reflect (a) width(0:1) Note If the widths of the shadow areas to be updated on the upper and lower bounds are equal, that is, for example, width(1:1), you can abbreviate it as width(1).
Note It is not possible to update the shadow area on a particular node because reflect is a collective operation.
The reflect directive does not update either the shadow area on the lower bound on the leading node or that on the upper bound on the last node. However, the values in such areas are needed for stencil computation if periodic boundary conditions are used in the computation.
To update such areas, add a periodic qualifier into the width clause. Let's look at the following example where an array a having shadow areas of width one on both the lower and upper bounds appears (Fig. 31).
XcalableMP C #pragma xmp reflect (a) width(/periodic/1:1) XcalableMP Fortran !$xmp reflect (a) width(/periodic/1:1) The periodic qualifier has the following effects, in addition to that of a normal reflect directive: in XMP/C, node p[0] sends an element a[0] to the shadow area on the upper bound on node p [3], and p [3] sends a[15] to the shadow area on the lower bound on p[0]; in XMP/Fortran, node p(1) sends an element a(1) to the shadow area on the upper bound on node p(4), and p(4) sends a(16) to the shadow area on the lower bound on p(1).
The shadow directive and reflect construct can be applied to arrays distributed in multiple dimensions. The  XcalableMP Fortran !$xmp nodes p(3,3) !$xmp template t(9,9) !$xmp distribute t(block,block) onto p real :: a(9,9) 5 !$xmp align a(j,i) with t(j,i) !$xmp shadow a(1,1) : !$xmp reflect (a) The central node receives data from the surrounding eight nodes to update its shadow areas (Fig. 32). The shadow areas of the other nodes are also updated, which is omitted in the figure.
For some applications, data from ordinal directions are not necessary. In such a case, the data communication from/to the ordinal directions can be avoided by adding the orthogonal clause to a reflect construct (Fig. 33).
XcalableMP C #pragma xmp reflect (a) orthogonal XcalableMP Fortran !$xmp reflect (a) orthogonal Note The orthogonal clause is effective only for arrays more than one dimension of which is distributed. Fig. 32 Example of multi-dimensional shadow (1) Besides, you can also add shadow areas to only specified dimension (Fig. 34).

gmove Construct
The programmers can specify a communication of distributed arrays in the form of assignment statements by using the gmove construct. In other words, with the gmove construct, any array assignment between two arrays (i.e. global data movement) that may involve inter-node communication can be specified.
There are three modes of gmove; "collective mode," "in mode," and "out mode."

Collective Mode
The global data movement involved by a collective gmove is performed collectively, and results in implicit synchronization among the executing nodes.  [3]. Similarly, in XMP/Fortran, p(1) sends b(1)-b(4) to p(3)-p(4), and p(2) sends b(5) to p(4) (Fig. 35).  While array a is distributed in a cyclic manner, array b is distributed in a block manner.
By using this method, the distribution of an array can be "changed" during computation.
XcalableMP C #pragma xmp nodes p [4]   In this example (Fig. 37), the elements of an array b that is distributed in a block manner are copied to the corresponding elements of an array a that is distributed in a generalized-block manner. For the arrays a and b, communication occurs if the corresponding elements reside in different nodes (arrows illustrate communication between nodes in the figures).
In the assignment statement, if a scalar (i.e. one element of an array or a variable) is specified on the right-hand side and an array section is specified on the left-hand side, a broadcast communication occurs for it.
XcalableMP C #pragma xmp nodes p [4]  In this example (Fig. 38), in XMP/C, an array element b[0] of node p[0] will be broadcasted to the specified array section on node p[2] and p [3]. Similarly, in XMP/Fortran, an array element b(1) of node p(1) will be broadcasted to the specified array section on node p(3) and p(4).
Not only distributed arrays but also replicated arrays can be specified on the right-hand side.
XcalableMP C #pragma xmp nodes p [4]   In this example, a replicated array b is locally copied to distributed array a without communication.

In Mode
The right-hand side data of the assignment, all or part of which may reside outside the executing node set, can be transferred from its owner nodes to the executing nodes with an in gmove.
XcalableMP C #pragma xmp nodes p [4] #pragma xmp template t [4]   In this example, the task directive divides four nodes into two sets, the first-half and the second-half. A gmove construct that is in an in mode copies data using a get operation from the second-half node to the first-half node (Fig. 40).

Out Mode
For the left-hand side data of the assignment, all or part of which may reside outside the executing node set, the corresponding elements can be transferred from the executing nodes to its owner nodes with an out gmove construct.  A gmove construct that is in out mode copies data using a put communication from the first-half nodes to the second-half nodes (Fig. 41).

barrier Construct
The barrier construct executes a barrier synchronization.
XcalableMP C #pragma xmp barrier XcalableMP Fortran !$xmp barrier You can specify a node set on which the barrier synchronization is to be performed by using the on clause. In the example below, a barrier synchronization is performed among the first two nodes of p.

reduction Construct
This construct performs a reduction operation. It has the same meaning as the reduction clause of the loop construct, but this construct can be specified anywhere executable constructs can be located (Fig. 42).
XcalableMP C #pragma xmp nodes p [4] : sum = xmpc_node_num() + 1; #pragma xmp reduction (+:sum) on p[2:2] The operators you can use in the reduction construct are as follows: XcalableMP Note In contrast to the reduction clause of the loop construct, which precedes loops, the reduction construct does not accept operators of firstmax, firstmin, lastmax, and lastmin.
Note Similar to the reduction clause, the reduction construct may generate slightly different results in a parallel execution from those in a sequential execution, because the results depend on the order of combining the value.

bcast Construct
The bcast construct broadcasts the values of the variables on the node specified by the from clause, that is, the root node, to the node set specified by the on clause.
If there is no from clause, the first node of the executing node set is selected as the root node. If there is no on clause, the current executing node set of the construct is selected as the executing node set.
In the example below, the first node of the node set p, that is, p[0] or p(1), is the root node (Fig. 44).

wait_async Construct
Communication directives (i.e. reflect, gmove, reduction, bcast, and reduce_shadow) can perform asynchronous communication if the async clause is added. The wait_async construct is used to guarantee the completion of such an asynchronous communication.
XcalableMP C #pragma xmp bcast (num) async (1) : #pragma xmp wait_async (1) XcalableMP Fortran !$xmp bcast (num) async (1) : Since the bcast directive has an async clause, communication may not be completed immediately after the bcast directive. The completion of that communication is guaranteed with the wait_async construct having the same value as that of the async clause. Therefore, between the bcast construct and the wait_async constructs, you may not reference the target variable of the bcast directive.
Hint Asynchronous communication can be overlapped with the following computation to hide its overhead.
Note Expressions that can be specified as tags in the async clause are of type int, in XMP/C, or integer, in XMP/Fortran.

reduce_shadow Construct
The reduce_shadow directive adds the value of a shadow object to the corresponding data object of the array.
XcalableMP  For the above example, in XMP/C, a[3] on p[0] has a value of eight, and a[4] on p[1] has a value of ten. Similarly, in XMP/Fortran, a(4) of p(1) has a value of eight, and a(5) on p(2) has a value of ten (Fig. 47).
The programmers can add the periodic modifier to the width clause to reduce shadow objects to the corresponding data object periodically.

Introduction
The programmer can use coarrays to specify one-sided communication in the localview model.
Depending on the environment, such one-sided communication might achieve better performance than global communication in the global-view model. However, it is more difficult and complicated to write parallel programs in the local-view model because the programmer must specify every detail of parallelization, such as data mapping, work mapping, and communication.
The coarray feature in XMP/Fortran is upward-compatible with that in Fortran 2008; that in XMP/C is defined as an extension to the base language.
An execution entity in local-view XMP programs is referred to as an "image" while a node in global-view ones. These two words have almost the same meaning in XMP. In XMP/C, the programmer declares a coarray by adding ":[ * ]" after the array declaration. In XMP/Fortran, the programmer declares a coarray by adding "[ * ]" after the array declaration.

Coarray Declaration
Note Based on Fortran 2008, coarrays should have the same size among all images.
Coarrays can be accessed in expressions by remote images as well as the local images.

Put Communication
When a coarray appears in the left-hand side of an assignment statement, it involves put communication. The integer in the square bracket specifies the target image index. The image index is zero-based, in XMP/C, or one-based, in XMP/Fortran. xmpc_this_image() in XMP/C and this_image() in XMP/Fortran return the current image index.
In the above example, in XMP/C, an image zero puts b[3:3] to a[0:3] on image one; in XMP/Fortran, an image one puts b(3:5) to a(1:3) on image two. Figure 49 illustrates the put communication performed in the example.

Get Communication
When a coarray appears in the right-hand side of an assignment statement, it involves get communication.
XcalableMP  Figure 50 illustrates the get communication performed in the example.

Hint
As illustrated above, get communication involves an extra step to send a request to the target node. Put communication achieves better performance than get because there is no such extra step.

XcalableMP Fortran sync all
At "sync all," each image waits until all issued one-sided communication is complete and then performs barrier synchronization among the all images. In the above example, the left image puts data to the right image and both nodes invoke sync all. When both nodes return from it, the execution continues to the following statements (Fig. 51).

XcalableMP Fortran sync images (image-set)
Each image in the specified image set waits until all one-sided communication issued is complete, and performs barrier synchronization among the images.

XcalableMP Fortran sync memory
Each image waits until all one-sided communication is complete. This function/statement does not imply barrier synchronization, unlike sync all and sync images, and therefore can be locally executed.

Procedure Interface
Procedure calls in XMP are almost the same as those in the base language. Procedure calls between other languages or to external libraries are also allowed if the base language supports them.
In the example below, a function/subroutine sub1() calls another function/subroutine sub2() with a distributed array x as an argument.
XcalableMP To handle a parameter or dummy argument as a global data in the callee procedure, the programmer need to explicitly distribute it with an align directive (Fig. 52).
If no align directive is specified in the callee procedure for a parameter or dummy argument that is declared as a global data in the caller procedure, it is handled as if it were declared in the callee procedure as a local data on each node, as follows (Fig. 53).

Overview
XMPT is the tool interface of XMP and inspired by OMPT, which is the tool interface of OpenMP [4]. Hence, XMPT is designed as event-based and callbackbased as OMPT; that is, for each event at runtime, the corresponding callback is invoked. One or more XMPT events are defined corresponding to each of XMP constructs and coarray-related actions (e.g. remote write/read and synchronization).
XMPT is preliminarily implemented in the Omni XMP compiler chapter "Implementation and Performance Evaluation of Omni Compiler", and used in MUST [5] and experimentally in Extrae [6]. More details of the application of XMPT in MUST are described in [7].

Initialization
Tool developers can provide the xmpt_initialize function in which they register a callback for each of the XMPT events of interest, as follows. In the above example, the tool developer implements callbacks callback_bcast_begin and callback_bcast_end that interact with his/her tool.
When an XMP program starts execution, the XMP runtime implicitly invokes xmpt_initialize, if provided, to set up the callbacks.

xmpt_event_sync_images_all_end xmpt_event_sync_images_begin xmpt_event_sync_images_end
When one of the XMPT events for which callbacks are registered occurs at runtime, the corresponding callback is invoked by the XMP runtime. For example, if callbacks are registered for events xmpt_event_bcast_begin and xmpt_ event_bcast_end as in the example in the previous section, the callbacks callback_bcast_begin and callback_bcast_end are invoked immediately before and after each of bcast constructs, respectively.
The XMP runtime passes therein all the information about the construct, including the mapping of the target global arrays, to the callback as its parameters. Thus, the tool is able to extract necessary information from the arguments.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.