Reference Work Entry

Encyclopedia of Parallel Computing

pp 1379-1391

OpenSHMEM - Toward a Unified RMA Model

  • Stephen W PooleAffiliated withOak Ridge National Laboratory
  • , Oscar HernandezAffiliated withOak Ridge National Laboratory
  • , Jeffery A KuehnAffiliated withOak Ridge National Laboratory
  • , Galen M ShipmanAffiliated withOak Ridge National Laboratory
  • , Anthony CurtisAffiliated withComputer Science Department, University of Houston
  • , Karl FeindAffiliated withSGI

Definition

OpenSHMEM is a standards-based partitioned global address space (PGAS) one-sided communications library. There is a long-standing and successful family of SHMEM APIs but there is no standard and implementations differ from each other in various subtle ways, hindering acceptance, portability, and in some cases, program correctness. We discuss the differences between SHMEM implementations and contrast SHMEM with other extant libraries supporting RMA semantics to provide motivation for a standards-based OpenSHMEM with the requisite breadth of functionality.

The Message Passing Interface (MPI) [1] is currently the most widely used communication model for large-scale simulation-based scientific parallel applications. Of these applications, a large number rely on two-sided communication mechanisms. Two-sided communication mechanisms require both sides of the exchange (source and destination) to actively participate, such as in MPI_SEND and MPI_RECV. While many algorithms benefit from the coupling of data transfer and synchronization that two-sided mechanisms provide, there exists a substantial number of algorithms that do not benefit from this coupling, and in fact may be hindered by the induced overhead of the synchronization.

The one-sided communication mechanisms are capable of decoupling the data transfer from the synchronization of the communication source and target. Remote memory access (RMA) is a one-sided communication mechanism that allows data to be transferred from one process memory space to another (remote) process memory space. The RMA operation is described entirely by one process (the active side) without the direct intervention of the other process (the passive side). Irregular communication patterns, in which data source and data target are not known a priori (as demonstrated in the GUPs [2] benchmark), often benefit from one-sided communication models such as RMA.

One of the most widely used one-sided communication models is SHMEM which stands for Symmetric Hierarchical MEMory access, so named for providing a view of distributed memory through the use of references to symmetric data storage on each sub-domain of a scalable system. While some sources (incorrectly) identify SHMEM as “SHared MEMory access,” it should be noted that SHMEM is unrelated to AT&T System V Shared Memory as implemented by the shmat, shmctl, shmget, etc. UNIX system calls. SHMEM [3] has a long history as a parallel programming model, having been used extensively on a number of products since 1993, including Cray T3D, Cray X1E, the Cray XT3/4, SGI Origin, SGI Altix, clusters based on the Quadrics interconnect, and to a very limited extent, Infiniband-based clusters.

  • History of SHMEM
    • Cray SHMEM
      • * SHMEM first introduced by Cray Research Inc. in 1993 for Cray T3D

      • * Cray is acquired by SGI in 1996

      • * Cray is acquired by Tera in 2000 (MTA)

      • * Platforms: Cray T3D, T3E, C90, J90, SV1, SV2, X1, X2, XE, XMT, XT

    • SGI SHMEM
      • * SGI purchases Cray Research Inc. and SHMEM was integrated into SGI’s Message Passing Toolkit (MPT)

      • * SGI currently owns the rights to SHMEM and OpenSHMEM

      • * Platforms: Origin, Altix 4700, Altix XE, Altix ICE, Altix UV

      • * SGI was purchased by Rackable Systems in 2009

      • * SGI and Open Source Software Solutions, Inc. (OSSS) signed a SHMEM trademark licensing agreement, in 2010

    • Other Implementations
      • * Quadrics (Vega UK, Ltd.),Hewlett Packard, GPSHMEM,IBM,QLogic,Mellanox

      • * University of Houston, University of Florida

Despite being supported by a variety of vendors there is no standard defining the SHMEM memory model or programming interface. Consistencies (where they exist) and extensions across the various implementations have been driven by the needs of an enthusiastic user community. The lack of a SHMEM standard has allowed each implementation to differ in both interface and semantics from vendor to vendor and even product line to product line, which has to this point limited broader acceptance. For an introduction to the SHMEM programming model, please see any of the following:  [45] or  [6].

The SHMEM API and SHMEM in general encompass the following ideas:
  • Single Program/Multiple Data (SPMD) style

  • Characteristics: One-sided, data passing, RDMA, RMA, PGAS

  • Put, Get, Atomic Updates (AMO) to reference remote memory or memories

  • Remote memory/arrays are symmetric

  • SHMEM decouples data transfer from synchronization

  • Barriers and polling synchronization are used

  • Collectives

  • Explicit control over data transfer

  • Low latency communications

  • Library programming interface to PGAS

  • SHMEM is a suitable under-layer for some PGAS implementations

  • SHMEM is a viable alternative to MPI

In addition to SHMEM, a number of other parallel communication libraries currently support the one-sided communication model via RMA. These RMA APIs exhibit much commonality but also several core differences. In this entry, we will compare the most widely used RMA APIs for high performance computing to the SHMEM model.

The following RMA models will be considered in this entry:
SGI SHMEM:

SGI SHMEM [7] (SGI-SHMEM) has one of the richest sets of RMA operations and is currently supported on the NUMAlink-based and Infiniband-based Altix product lines.

Cray SHMEM - Unicos MP:

Cray SHMEM on the Unicos MP [8] (MP-SHMEM) is very similar to SGI-SHMEM and is currently supported on the X1E supercomputer.

Cray SHMEM - Unicos LC:

Cray SHMEM on the Unicos LC [9] (LC-SHMEM) is a subset of SHMEM on the Unicos MP but lacks a number of key items such as full data-type support. Unicos LC is currently available on the Cray XT 3/4/5 supercomputers.

Quadrics SHMEM:

Quadrics SHMEM [10] (Q-SHMEM) supports most of the communication mechanisms available in Unicos MP but lacks a number of key items that enhance usability.

Cyclops-64 SHMEM:

Cyclops-64 SHMEM [11] (C64-SHMEM) is a SHMEM API which supports the Cyclops-64 architecture. Most of the core features of Cray SHMEM are available with some additional interfaces specific to the Cyclops-64 architecture.

GASNet:

GASNet [12] is a lower level interface with minimal RMA support intended to be used as a communication mechanism for other parallel programming models and environments. As such, it is missing many of the usability features of the SHMEM family of RMA APIs. GASNet also differs from the other communication models due to its unique support for active messages.

ARMCI:

ARMCI [13] is another lower level interface designed to support higher level protocols and programming models such as Global Arrays.

MPI-2:

MPI-2 [14] provides a one-sided interface that has not been widely adopted for a number of reasons [15] but was intended to provide a portable alternative to SHMEM and other one-sided communication models.

In order to effectively compare these RMA APIs, we will define the following areas of functionality:
Symmetric Objects:

Symmetric objects provide a mechanism to address remote variables using the address of the corresponding local variable.

Remote Write:

Put operations across a number of primitive data-types as well as contiguous, strided, or indexed memory locations

Remote Read:

Get operations across a number of primitive data-types as well as contiguous, strided, or indexed memory locations

Synchronization:
Two basic types:
  1. Task Synchronization – synchronization operation between tasks, such as barriers and locks

     
  2. Memory Synchronization – memory polling, and ordering of remote and local memory reads and writes

     
Atomic Memory Operations:

Compare/Mask Swap, fetch-and-add, increment

Reductions:

Reductions across a number of primitive data-types and reduction operations such as and, or, min, and max

Collective Communication:

Broadcast, collect, and fcollect operations

Symmetric Objects

In SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM some data objects are said to be “symmetric” across multiple PEs. For a single-executable program, launched on multiple PEs, all statically allocated objects are symmetric across all PEs. Additionally, objects allocated dynamically via symmetric allocation (by calling SHMEM’s shmalloc() collectively) are also symmetric across all PEs. These symmetric data objects greatly simplify RMA operations by allowing the active side to initiate a remote operation on the passive side by specifying the address of the local symmetric data object. The RMA operation proceeds as if the address specified were the address of the passive side’s symmetric data object. In the absence of symmetric data objects, the address of the passive side’s data object must be communicated to the active side for use in the RMA (Remote) operation. In SHMEM, such objects are known as asymmetric objects. Asymmetric objects include stack variables as well as those objects dynamically allocated by nonsymmetric means (e.g., malloc()). Care must be taken in referencing asymmetric objects as the local (active side) address calculation will likely not result in the correct address on the remote (passive) side, though some implementations will not fail if asymmetric addresses strictly refer to local (active side) objects, e.g., the source for a PUT or the target for a GET. Remote access to asymmetric objects is an extension that is not supported by SGI’s SHMEM or OpenSHMEM V1.0

Symmetric data objects are not supported on the other RMA (Remote) APIs. GASNet requires that the active side specify the address of the data object on the remote side. ARMCI provides a collective memory allocation mechanism that allows each process to allocate memory locally and share the memory location with its peers. While convenient, this is no different from allocating memory locally and gathering the addresses from other PEs. SHMEM provides similar functionality via shmalloc with one difference, the memory allocated in shmalloc is symmetric (the Symmetric Heap). This allows RMA (Remote) operations to the remote memory by simply specifying addresses within the local symmetric memory. All SHMEM implementations discussed in this entry provide symmetric memory allocation. MPI-2 uses a hybrid approach by requiring processes to participate in the creation of a “window” for RMA (Remote) operations. The active side can then specify the target’s location for the RMA (Remote) operation via a displacement in the window.

It is important to note that symmetric data objects require either special hardware, compiler, or library support. The use of the address of a local data object in an RMA (Remote) operation to a remote PE indicates that the data object is symmetric. A (pre-)compiler can detect this usage and take appropriate action to ensure that the remote data object is properly addressed. SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM support symmetric data objects, asymmetric data objects, and symmetric memory allocation. C64-SHMEM supports only asymmetric data objects and symmetric memory allocation.

Code listing 1 (a slightly modified example from [8]) illustrates the use of symmetric data objects. Note the declaration of the “static short target” array and its use as the remote destination in shmem_short_put. The use of the “static” keyword results in the target array being symmetric on all PEs in the program. Each PE is able to transfer data to the target array by simply specifying the local address of the symmetric data object which is to receive the data. This aids programmability as the address of the target all PEs other than 0 need not be exchanged with the active side (PE 0) prior to the RMA (Remote memory operation) – an important reduction network data-flow, particularly for large systems with irregular data flows. Conversely, the declaration of the “short source” array is asymmetric. Because the put handles the references to the source array only on the active (local) side, the asymmetric source object is handled correctly. Note that C64-SHMEM does not support this mechanism and relies on the use of symmetric memory allocation. Other code examples are included in the examples section.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig1-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 1

RMA (Remote operation) using a symmetric data object

OpenSHMEM - Toward a Unified RMA Model. Table 1

Symmetric data objects and memory allocation

Type

SGI

MP

LC

Q

C64

GASNet

ARMCI

MPI-2

 

Symmetric Data Objects

Yes

Yes

Yes

Yes

No

No

No

No

 

Symmetric Memory Allocation

Yes

Yes

Yes

Yes

Yes

No

No

No

 

Remote Write/Read

Remote Read/Write operations are the basic building blocks of any RMA API. Generally, these are referred to as put/get operations, and in the most basic form allow one processing element (PE) to transfer data from a local memory location to a remote memory location belonging to another PE, or vice versa.

Remote Write:

In a Remote Write operation (PUT), the initiating (active side) PE is the source and the remote (passive side) PE is the target. The active side PE specifies the local (active side) PE’s (source) memory from which the data will be sent, and the remote (passive side) PE’s (target) memory to which the data will be written. MP-SHMEM, LC-SHMEM, Q-SHMEM, and C64-SHMEM all provide the semantic that after shmem_put returns, the data specified for transfer has been buffered; this local completion on the active side does not equate to remote completion of the put on the passive side. LC-SHMEM, Q-SHMEM, and C64-SHMEM also provide a non-blocking PUT operation shmem_put_nb which may return before the data is buffered for transfer. Local completion of non-blocking remote (RMA) writes are guaranteed either when shmem_test_nb returns success or shmem_wait_nb returns. Unlike MPI requests, it is invalid to wait on a non-blocking operation after test returns success.

Remote Read:

In a Remote Read operation (GET), the initiating (active side) PE is the target and the remote (passive side) PE is the source. The active side PE specifies the remote (passive side) PE’s (source) memory from which the data will be retrieved, and the local (active side) PE’s (target) memory to which the data will be written. In MP-SHMEM, LC-SHMEM, Q-SHMEM, and C64-SHMEM shmem_get returns after data has been delivered to the initiator PE’s memory. LC-SHMEM, Q-SHMEM, and C64-SHMEM provide a non-blocking GET operation shmem_get_nb which does not block until the data specified for transfer has been delivered to the active side. Local completion of non-blocking RMA (Remote) Read operations is guaranteed either when shmem_test_nb (or shmem_poll_nb on C64-SHMEM) returns success or shmem_wait_nb returns.

Table 2 details the primitive data-types supported by Remote Read/Write. Not surprisingly, ARMCI and GASNet only support byte level transfers as they are lower level APIs meant to provide a basis for higher level languages or APIs. The 16 bit, 4 byte, etc., “data-types” are simply syntactic sugar for the generalized byte level data-type supported by all APIs. None of the SHMEM libraries directly support unsigned integers or higher precision types beyond long long and double. Unsigned types can be used in put/get operations by using the appropriate byte size transfer at the cost of convenience.
OpenSHMEM - Toward a Unified RMA Model. Table 2

RMA (Remote) PUT/GET data-type support

Type

SGI

MP

LC

Q

C64

GASNet

ARMCI

MPI-2

 

byte

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

 

short

Yes

Yes

Yes

Yes

No

No

No

Yes

 

int

Yes

Yes

Yes

Yes

No

No

No

Yes

 

long

Yes

Yes

Yes

Yes

Yes

No

No

Yes

 

long long

Yes

Yes

Yes

Yes

Yes

No

No

No

 

float

Yes

Yes

Yes

Yes

No

No

No

Yes

 

double

Yes

Yes

Yes

Yes

Yes

No

No

Yes

 

long double

Yes

Yes

No

Yes

No

No

No

Yes

 

integer (Fortran)

Yes

Yes

Yes

No

No

No

No

Yes

 

logical (Fortran)

Yes

Yes

Yes

No

No

No

No

Yes

 

real (Fortran)

Yes

Yes

Yes

No

No

No

No

Yes

 

complex (Fortran)

Yes

Yes

Yes

No

No

No

No

Yes

 

16 bits

No

Yes

Yes

No

No

No

No

No

 

4 bytes

Yes

Yes

Yes

Yes

No

Yes

Yes

No

 

32 bits

Yes

Yes

Yes

Yes

No

No

No

No

 

8 bytes

Yes

Yes

Yes

Yes

No

No

No

No

 

64 bits

Yes

Yes

Yes

Yes

No

No

No

No

 

128 bits

Yes

Yes

Yes

Yes

No

No

No

No

 
OpenSHMEM - Toward a Unified RMA Model. Table 3

RMA (Remote) PUT/GET access methods

Type

SGI

MP

LC

Q

C64

GASNet

ARMCI

MPI-2

 

Contiguous

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

 

Indexed

No

Yes

Yes

No

No

No

No

Yes

 

Strided

Yes

Yes

Yes

Yes

No

No

Yes

Yes

 

Generalized I/O Vectors

No

No

No

No

No

No

Yes

Yes

 

Arbitrary Discontinuous

No

No

No

No

No

No

No

Unknown

 
In addition to data-type support, these RMA (Remote) APIs offer different access patterns for PUT/GET operations.
Contiguous Memory Access:

Contiguous memory is specified by the active side as both the source and destination for a PUT or GET operation.

Strided:

Strided access allows transfers of elements of a memory region with a stride between them. This is a compact representation suitable for regular data layouts such as multi-dimensional arrays.

Indexed:

This access pattern provides more flexibility than strided access as there is no requirement for a constant stride. This added flexibility requires that an index array be used to describe the elements within the source and target arrays, and is therefore less compact than the strided data description.

Generalized I/O Vectors:

Generalized I/O vectors allow transfers to/from discontinuous regions that are made up of contiguous elements of like length. An array of pointers is specified for both the initiator and the target. Each entry in the array specifies the starting address of a contiguous memory region of length K. Both arrays are of length N.

Arbitrary Discontiguous:

This access mode allows transfers to/from discontinuous regions with minimal restrictions. Contiguous elements within the discontinuous regions may be of any byte length. The total byte length of all contiguous elements must be equal at both the initiator and the target. This access mode uses a pair of arrays to describe the memory locations and the length of each element at the memory location. Four arrays are therefore necessary to describe the memory layout for both the initiator and the target.

Note that MPI-2 supports Indexed, Strided, and Generalized I/O Vectors through the use of derived data-types.

Synchronization

While SHMEM provides a barrier semantic which intertwines synchronization with remote completion, it also provides three somewhat more subtle tools for enforcing ordering semantics on the sequence of memory updates arriving at a PE: fence, quiet, and wait.

Fence :

SHMEM Fence ensures ordering of PUT operations to a specific PE. In SGI-SHMEM, MP-SHMEM, and LC-SHMEM, the fence operation forces ordering of the completion of PUT operations on a remote PE. Specifically, while the order of posting of PUT operations on the remote PE is not generally guaranteed, the fence does provide a guarantee that on the remote (passive) side the PUTs issued from a particular source PE before the fence will be completed before the PUTs issued by that source PE after the fence are completed. Note that this does not guarantee that the remote (passive) side will have completed the PUTs when the fence operation returns on the active side. Q-SHMEM and C64-SHMEM support shmem_fence via shmem_quiet and add the additional semantic that all outstanding PUT operations will be delivered to the target PE prior to return from the function. Figure 2 illustrates the semantics of shmem_fence.

ARMCI provides a fence operation similar to that of Q-SHMEM in that it blocks until all PUTS are delivered to the remote PE. GASNet provides blocking and non-blocking versions of PUT. A blocking PUT will not return until data is delivered to the remote PE. A wait on a non-blocking PUT will not return until data is delivered to the remote PE. Fence (and Quiet) are therefore not necessary in GASNet as both are implicit in the semantics of blocking and non-blocking PUT operations. The GASNet blocking interface combines the semantics of RMA (Remote) initializations, ordering, and remote completion. Applications that do not need the additional semantics may incur additional overhead. GASNet’s non-blocking PUT operations separate RMA (Remote operations) initialization from ordering and remote completion, but fail to separate ordering from remote completion.

MPI-2 specifies an MPI_WIN_FENCE operation, but differs dramatically from Fence in SHMEM. MPI_WIN_FENCE is a collective operation on the group of WIN requiring both the active and passive sides to participate, whereas SHMEM fence is called only on the active side. In addition, MPI_WIN_FENCE ensures that all outstanding RMA calls on a window (regardless of origin) are synchronized. Fence in SHMEM may be thought of strictly as an ordering semantic and not as a PE synchronization semantic as in MPI. Figure 3 illustrates the semantics of MPI_Win_fence.

Quiet :

SHMEM quiet ensures ordering of PUT operations to all PEs. In SGI-SHMEM, MP-SHMEM, and LC-SHMEM, the quiet operation ensures the order of PUT operations. All incoming PUT operations issued by any PE and local load and store operations started prior to the quiet call are guaranteed to be complete at their targets and visible to all other PEs prior to any remote or local access to the corresponding target’s memory or any synchronization operations that follow the call to shmem_quiet. Q-SHMEM and C64-SHMEM extend the semantic in that data is delivered at the remote PE at the return of shmem_quiet. This additional semantic intertwines remote completion and ordering which may impose unnecessary overhead to applications that only require the ordering semantic and not remote completion. Figure 4 illustrates the semantics of shmem_quiet.

ARMCI provides Quiet through a fence_all operation similar to Q-SHMEM and C64-SHMEM in that data is guaranteed to be delivered at the remote PEs when fence_all returns. Again, the ordering semantic is intertwined with remote completion.

Barrier:

The SHMEM barrier is a collective synchronization routine in which no PE may leave the barrier prior to all PEs entering the barrier. SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM additionally require the semantic that all outstanding PUT operations are remotely visible prior to the return of the barrier operation. This additional semantic intertwines PE synchronization and remote completion of outstanding write operations. C64-SHMEM does not add this additional semantic.

Wait:

SHMEM wait waits for a memory location or variable to change on the local PE. This allows synchronization between two PEs. SGI-SHMEM, MP-SHMEM, LC-SHMEM, Q-SHMEM, and C64-SHMEM all support shmem_wait the other RMA implementations do not provide a similar semantic.

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig2-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 2

Example of fence (a) SGI-, MP-, and LC-SHMEM (b) Q- and C64-SHMEM

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig3-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 3

Example of MPI_Win_fence

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig4-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 4

Example of quiet (a) MP-SHMEM, LC-SHMEM (b) Q-SHMEM

A possible improvement in these semantics would involve separating RMA initialization, RMA ordering, remote completion, and PE synchronization.

OpenSHMEM - Toward a Unified RMA Model. Table 4

Synchronization methods

Synchronization

SGI

MP

LC

Q

C64

GASNet

ARMCI

MPI-2

 

Fence

Yes

Yes

Yes

Yes

Yes

No

Yes

No

 

Quiet

Yes

Yes

Yes

Yes

Yes

No

Yes

No

 

Barrier

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

 

Wait

Yes

Yes

Yes

Yes

Yes

No

No

No

 
OpenSHMEM - Toward a Unified RMA Model. Table 5

Atomic operations

Atomic Op

SGI

MP

LC

Q

C64

GASNet

ARMCI

MPI-2

 

Swap

Yes

Yes

Yes

Yes

Yes

No

Yes

No

 

Compare and Swap

Yes

Yes

Yes

Yes

Yes

No

No

No

 

Mask and Swap

No

Yes

No

Yes

Yes

No

No

No

 

Fetch and Add

Yes

Yes

Yes

Yes

Yes

No

Yes

No

 

Fetch and Increment

Yes

Yes

Yes

No

Yes

No

No

No

 
OpenSHMEM - Toward a Unified RMA Model. Table 6

Reductions

Reduction

SGI

MP

LC

Q

C64

GASNET

ARMCI

MPI-2

 

And

Yes

Yes

Yes

Yes

Yes

No

No

Yes

 

Max

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

 

Min

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

 

Or

Yes

Yes

Yes

Yes

Yes

No

No

Yes

 

Product

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

 

Sum

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

 

Xor

Yes

Yes

Yes

Yes

Yes

No

No

Yes

 
OpenSHMEM - Toward a Unified RMA Model. Table 7

Collectives

Collective

SGI

MP

LC

Q

C64

GASNet

ARMCI

MPI-2

 

Broadcast

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

 

Collect

Yes

Yes

Yes

Yes

No

No

No

Yes

 

fcollect

Yes

Yes

Yes

Yes

Yes

No

No

Yes

 

Atomic Memory Operations

An atomic memory operation (AMO) provides the capability to perform simple operations on the passive side of the RMA operation. The SHMEM model provides five common AMO capabilities.

Swap:

Writes the value specified by the initiator PE to the target PE location and returns the target value prior to the update.

Compare and Swap:

Compares the conditional value specified by the initiator PE to the target PE location, and if they are equal, it writes the value specified by the initiator PE to the target PE location. The return value is the target PE location’s value prior to update.

Mask and Swap:

Writes the value specified by the initiator PE to the target PE location updating only the bits in the target location as specified in the mask value. The return value is the target PE location’s value prior to update.

Fetch and Add:

Atomically adds the value specified by the initiator PE to the target PE location’s value. The return value is the target PE location’s value prior to the update.

Fetch and Increment:

Special case of Fetch and Add with implicit add value of 1.

Extended Atomic Operation Support:

Cyclops-64 inclu- des a general purpose atomic operations: long shmem _atomic_op(long *target, long value, long pe,atomic _t op), which performs the atomic operation specified by atomic_t op on the specified target. The number of atomic operations supported is quite large and corresponds to Cyclops-64 intrinsic operations.

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig5-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 5

Trivial hello world in SHMEM

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig6-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 6

Trivial hello world in UPC

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig7-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 7

Trivial hello world in Message Passing Interface (MPI)

A more complete list of AMOs along with their definitions which comes from  [10] is shown in the OpenSHMEM section of this entry.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig8-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 8

Circular shift in SHMEM

Rather than providing atomics as defined above, MPI-2 provides MPI_ACCUMULATE. While MPI_ACCUMULATE provides atomic updates of local and remote variables the value of the variable prior to update is not returned to the caller. Atomic swap (and variants) are not supported by MPI-2. The following operations are available for MPI_ACCUMULATE:
  1. Maximum

     
  2. Minimum

     
  3. Sum

     
  4. Product

     
  5. Logical and

     
  6. Bit-wise and

     
  7. Logical or

     
  8. Bit-wise or

     
  9. Logical xor

     
  10. Bit-wise xor

     

Reductions

Reductions perform an associative binary operation across a set of values on multiple PEs. MP-SHMEM, LC-SHMEM, and Q-SHMEM all provide the same reduction operations differing only in the supported data-types as detailed previously. GASNet provides no reduction operations. ARMCI supports some of the reduction operations of SHMEM. MPI-2 supports all the SHMEM reduction operations as well as reductions based on logical operations.

Collective Communication

Collective communication is uniformly supported across SGI-SHMEM, MP-SHMEM, LC-SHMEM, and Q-SHMEM. C64-SHMEM and other communication libraries differ in their support for these (or similarly defined) operations.
https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig9-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 9

Circular shift in MPI

https://static-content.springer.com/image/prt%3A978-0-387-09766-4%2F15/MediaObjects/978-0-387-09766-4_15_Part_Fig10-490_HTML.gif
OpenSHMEM - Toward a Unified RMA Model. Fig. 10

Circular shift in UPC

Broadcast:

Transfers data from the root PE to the active set of PEs. The active set of PEs is specified using a starting PE and a \({\log }_{2}\) stride between consecutive PEs.

Collect:

Gathers data from the source array into the target array across all PEs in the active set. Each PE contributes their source array which is concatenated into the target array on all the PEs in the active set in ascending PE order. Source arrays can differ in size from PE to PE. MPI-2 provides MPI_ALL_GATHERV which provides similar semantics to Collect.

fcollect:

A special case of Collect in which all source arrays are of the same size. This allows for an optimized implementation. MPI-2 provides MPI_ALL_GATHER which provides similar semantics to fcollect.

Tool Support

Few performance tools have been developed specifically for SHMEM and Partitioned Global Address languages. The challenge comes as a result that traditional two-sided communication analyses do not apply to sided data accesses to “symmetric,” or shared variables as the library calls are decoupled from point-to-point synchronization. An event-based performance analysis tool [16] was developed to support both SHMEM and UPC which uses a wrapper function approach to gather SHMEM events. Tools such as CrayPat, TAU, and SCALASCA use similar approaches to profile and trace SHMEM calls. However, the wrapper function approach might not work well for future SHMEM-aware compilers since some might inline and replace SHMEM calls with direct shared memory accesses on certain architectures. GASP [17] has been used as an alternative mechanism to gather performance events for GASnet which is implicit from the communication library API. None of these approaches have been sanctioned for standardization or have been officially adopted for SHMEM. However, work is on the way.

Commercial available debuggers [1819] and compile time static checkers [2021] have limited support for SHMEM and UPC not to say tools that optimize them. In the case of OpenSHMEM, the semantics of the library can allow compilers or static checkers to report the incorrect use of SHMEM library calls. For example, compiler-based tools can easily check if arguments to the data access calls meet the “symmetric” storage for some variables, check if variables are accessed within bounds, and detect incorrect type usage in shmem_put32/64 calls. Such semantic checker tools can to be invoked as preprocessors to help the user reduce the amount of errors detected at runtime. Research needs to be done to find proper ways to optimize SHMEM in compilers. Traditional dataflow frameworks such as SSA (single-static assignment) have been extended in tools for MPI to handle two-sided communication and few for one-sided communications but not specifically for OpenSHMEM programs; and these technologies still have to be introduced in production compilers. The OpenSHMEM community needs to promote the development of tools that facilitate the usage of SHMEM programs as few tools exist for this as for now.

Toward an Open SHMEM standard

When comparing the various implementations of SHMEM and other RMA APIs, we see that semantics vary dramatically. Not only do SHMEM RMA semantics differ from other RMA implementations (as expected) but different SHMEM implementations differ from each other. Support for primitive data types varies. Discontiguous RMA operations and Atomic RMA operations are not uniformly supported. Synchronization and completion semantics differ substantially which can cause valid programs on one architecture to be completely invalid on another. Symmetric data objects that dramatically aid the programmer are unique to the SHMEM model but are not uniformly supported. There are a number of capabilities that are available from implementations that differ from the SGI SHMEM baseline version. Below is a brief list of potential future additions to OpenSHMEM:

  1. Support for Multiple Host Channel Adapters (Rails)

     
  2. Support for non-Blocking Transfers

     
  3. Support for Events

     
  4. Additional Atomic Memory Operations and Collectives

     
  5. Potential options for NUMA/Hybrid architectures

     
  6. Additional communications transport mechanisms

     
  7. OpenSHMEM I/O library enhancements

     
  8. OpenSHMEM tools and Compiler enhancements

     
  9. Enabling exa-scale applications

     

An OpenSHMEM standard can address the lack of uniformity across the available SHMEM implementations. Standardization levels can be established to provide a base level that all SHMEM implementations must support in order to meet the standard, with higher levels available for additional functionality and platform specific features.

In 2008, an initial dialogue was started between SGI and a small not-for-profit company called Open Source Software Solutions, Inc.(OSSS). The purpose of the dialogue was to determine if an agreement could be reached for SGI’s approval of an open source version of SHMEM that would serve as an umbrella standard under which to unify all of the already existing implementations into one cohesive API. As part of this discussion, OSSS held a BOF on OpenSHMEM at SC08 to discuss this plan. The BOF was well attended by all of the vendors interested in supporting OpenSHMEM/SHMEM, and many of the interested parties in the SHMEM user community. The unanimous opinion of the attendees favored continuing the process toward developing OpenSHMEM as a community standard. The final agreement between SGI and OSSS was signed in 2010. The agreement allows OSSS to use the name OpenSHMEM and directly reference SHMEM going forward. The base version of OpenSHMEM V1.0 is based on the SGI man pages. The “look and feel” of OpenSHMEM needs to preserve the original “look and feel” of the SGI SHMEM. OpenSHMEM version 1.0 has been released and input for version 2.0 is being actively solicited.

There are a number of enhancements under discussion for version 2.0 and future offerings. Some of the features specific to one implementation or another as listed in the preceding tables and elements will be incorporated into later versions of OpenSHMEM. OpenSHMEM will be supported on a variety of commodity networks (including Infiniband from both Mellanox and QLogic) as well as several proprietary networks (Cray, SGI, HP, and IBM). The OpenSHMEM mail reflector is hosted at ORNL and can be joined by sending a request via the OpenSHMEM list server at https://​email.​ornl.​gov/​mailman/​listinfo/​openshmem. Future enhancements and RFIs will be sent to developers and other interested parties via this mechanism. Source code examples, Validation and Verification suites, performance analysis, and OpenSHMEM compliance will be hosted at the OpenSHMEM Web site at http://​www.​openshmem.​org, and the OpenSHMEM standard will be owned and maintained by OSSS.

With the inexorable march toward exa-scale, programming methodologies such as OpenSHMEM will certainly find their place in enabling extreme scale architectures. By virtue of decoupling data motion from synchronization, OpenSHMEM exposes the potential for synergistic application improvements by scaling more readily than two-sided models, and by minimizing data motion, thus affording the possibility of concomitant savings in power consumption by applications. These gains, along with portability, programmability, productivity, and adoption, will secure OpenSHMEM a place for future extreme scale systems.

Acknowledgments

This work was supported by the United States Department of Defense and used resources of the Extreme Scale Systems Center located at Oak Ridge National Laboratory.

Copyright information

© Springer Science+Business Media, LLC 2011
Show all