Keywords

1 Introduction

Shared memory, which enables processes to access the same physical memory regions, is popular for implementing intra-node communication. The most well-known shared memory implementation is POSIX shmem [13] that requires the allocation of a new memory region so that other processes can gain access to it. When implementing MPI inter-process communication, POSIX shmem requires two memory copies; one copy from the send buffer to a shared memory region and another copy from the shared memory region to the receive buffer. XPMEM [9], which was first introduced on SGI supercomputers, enables other processes accessing existing memory regions. In XPMEM, the owner process of a memory region first exposes the memory region and then the process trying to access the memory region attaches the exposed region. If the memory region in which the send buffer resides is accessible from the receiver process, then only one memory copy enables the data transfer from the send buffer to the receive buffer. Thus, XPMEM is used in many MPI implementations to make intra-node communication more efficient than that of POSIX shmem. Additionally, the XPMEM communication style also enables efficient implementation of one-sided communication. Once the addresses of the origin and target buffers are known, one-sided communications can take place without any intervention of the other process. XPMEM’s big disadvantage, however, is the overhead of the attach operation. Consequently, most MPI implementations utilizing XPMEM make use of an XPMEM cache that reduces this overhead by caching attached memory regions [7].

The shared address space is another mechanism that enables processes to efficiently share memory. As of today, SMARTMAP [5], PVAS [19], MPC [18] and Process-in-Process (PiP) [10] implement this model. As the name suggests, in this model processes are created to share the same address space. Once processes are created, they can access any data in the address space and there is no need for any special operations that are required in the shared memory model, such as the creation of shared memory regions in POSIX shmem or the expose and attach operations of XPMEM. Moreover, the shared address space model can provide all functionalities of the shared memory model, but with higher efficiency. There are more benefits to the shared address space. The address of an object is unique and does not depend on the accessing process. This means that complex data structures such as linked lists and/or tree structures where objects are linked with pointers can be also accessible without any extra effort. Even execution code can be shared. This characteristic of the shared address space model is very difficult to implement with traditional shared memory since that does not guarantee the same virtual address for mapped objects. The shared address space, however, has also a disadvantage compared to the traditional shared memory. In particular, it may induce higher overhead when modifying the shard address space itself, such as by calling m(un)map() or brk() system calls that have no impact when using traditional shared memory.

The shared memory and the shared address space models can be confusing because both allow access to the data of another process. However, the shared memory model only allows access to some parts in the address space of the other processes, while the shared address space model allows for all processes to access everything in the same address space. The primary motivation of this paper is to clarify the difference between these two models.

The shared address space model can be defined as execution entities share an entire address space, not only parts of it. For example, the well-known multi-thread execution model can also be thought of as the shared address space model. However, the execution models of multi-process and multi-thread are very different, and it is not appropriate to compare the basic differences of the two memory models.

This paper highlights the differences between the two memory models both qualitatively and quantitatively. The possible optimization techniques enabled by using the shared address space are not the main concern of this paper, but some of them will be discussed in Sect. 6.

In this paper, we chose PiP as an implementation of the shared address space model because PiP is implemented at user-level and PiP can provide the XPMEM API. We chose MPICH for evaluation purposes as it is a well-known MPI implementation. We modified MPICH to create the shared address space environment, but the modification is minimized so that basic performance differences can be demonstrated. Our approach is to compare four different MPICH configurations by using POSIX shmem, XPMEM, and PiP;

  • Shmem: configured to utilize POSIX shmem for intra-node communication,

  • PiP-Shmem: configured to utilize POSIX shmem but MPI processes are spawned by using PiP,

  • XPMEM: configured to utilize XPMEM for intra-node communication, and

  • PiP-XPMEM: configured to utilize XPMEM but XPMEM functions are implemented by using PiP.

The quantitative differences will be demonstrated by using the Intel MPI benchmark programs (IMB) and various HPC benchmarks. In summary, the main findings of this paper are as follows.

  • PiP-Shmem behaves almost the same with Shmem,

  • PiP-XPMEM outperforms XPMEM at best when the communication pattern is irregular and thus the XPMEM cache is ineffective,

  • PiP(-XPMEM) has large overhead when calling the m(un)map() or brk() system calls, and

  • application behavior does not trigger the above overhead and there are cases where PiP-Shmem and PiP-XPMEM outperforms Shmem and XPMEM.

To the best of our knowledge, this is the first paper which clarifies the differences between the shared memory and the shared address space models.

2 Background and Related Work

There are two well-used parallel execution models, multi-process and multi-thread. Although the MPI standard does not define which execution model is to be used to implement MPI, many MPI implementations are based on the multi-process model. OpenMP is a parallel language utilizing the multi-thread execution model. Usually, a process has its own address space, and nothing is shared among processes. Threads share almost everything including static variables. One may realize a third execution model to take the best of two worlds of multi-process and multi-thread. In this execution model, execution entities (processes or threads) share an entire address space, but each execution entity can run independently from the others, i.e., all variables are privatized unlike the thread model. Since an address space is shared, one can access the data owned by others whenever it’s required. Specifically, nothing is shared in the process model, everything is shared in the thread model, and in this third execution model everything is shareable. Needless to say, the third execution model also provides the shared address space model. The term process may not be appropriate here because a process, as opposed to a thread, usually implies its associated address space. Hereinafter the term task is used in the contexts of the shared address space model and the third execution model.

The shared memory model does not have this feature since the mapped address of a memory segment may differ process by process. This feature is called Consistent Address View (CAV) in this paper. CAV enables the sharing of complex structures (i.e., linked list or tree) and execution codes. In most cases, those linked lists and tree structures hold pointers to refer to the other related object(s) and those related objects may widely scatter in an address space and may not fit in a memory segment. With traditional shared memory, pointers in complex data structures cannot be dereferenced as they are, and complex structures may not fit in a memory segment to share. Thus, sharing a complex structure with shared memory is difficult.

There are currently four major implementations of the third execution model; 1) SMARTMAP, 2) PVAS, 3) MPC, and 4) PiP. These are summarized in Table 1, Base indicates the base of implementation (process or thread), Partition indicates address space is regularly partitioned or not, CAV indicates if the implementation has the CAV feature, Multi-Prog. indicates if multi-program is supported or not, PIE indicates if executable must be PIE (Position Independent Code) or not, and Impl. Lv. indicates the implementation level, user or kernel.

Table 1. Shared Address Space Implementations

The implementation of SMARTMAP relies on the page table structure of the x86 architecture and its page table has a unique format. A task is mapped twice in the shared address space, one for execution itself and another for accessing from the other tasks. Thus, to access the data of the other task an offset must be added to addresses and thus CAV is not supported by SMARTMAP.

A shared address space is partitioned in SMARTMAP and PVAS. All memory segments of a task are packed into one of the partitions. Unlike SMARTMAP, PVAS is architecture independent, and it loads an executable image onto one of the unused partitions. If a program A is loaded twice or more, the images of A must be loaded at different partitions. To enable this, the executable must be a PIE. This situation of PVAS is the same with PiP, while PiP does not partition an address space.

MPC has a different approach from the others. Its implementation is based on Pthread and MPC makes threads like processes. The variable privatization is implemented by converting static variables to Thread Local Storage (TLS) variables. This translation is done by a dedicated compiler and linker. The biggest issue with this implementation is that user programs may create (OpenMP) threads and declare their own TLS variables. The converted TLS variables must be able to be accessed by any (process-like) threads, while the user-declared TLS variables must be accessed only by the thread created by the user. So, the converted TLS variables and user-declared TLS variables have different accessing scopes. To solve this issue, they implement two different TLS systems; one for the converted TLS and another for user-defined threads. Despite this mitigation, MPC tasks still have the limitations coming from the thread implementation, such as only a single program can be loaded that shares a single file descriptor table, etc.

In the XPMEM shared memory model, the procedure to access the address space of another process is; 1) the exposing process must call xpmem_make() in advance, 2) then the accessing process calls xpmem_get(), and 3) calls xpmem_attach(). The xpmem_make() function of XPMEM is to specify an address range of the caller process so that the other process can attach only the memory regions within this specified address range. The xpmem_get() function is to check if the calling process can access the exposed memory. Finally, the xpmem_attach() function must be called to specify the memory region to be shared. The times to call XPMEM functions (and the times to call POSIX shmem functions as well) are already reported by [10], showing xpmem_get() (and POSIX shmem) overhead is very high.

Hashmi investigated various optimization techniques for implementing MPI by using XPMEM [7], though the title of his thesis has the term “Shared-Address-Space.” He proposed the XPMEM implementation of MVAPICH (an MPI implementation [20] to improve P2P communications and collective operations. XPMEM cache was proposed to mitigate the high overhead. He also proposed optimization for handling MPI datatypes by using XPMEM.

SMARTMAP, PVAS and XPMEM are implemented at the kernel level. Consequently, it is very difficult to set up their environment on systems in operation if those systems do not support them already. To the contrary, PiP and MPC are implemented at user level, and it is easy to run programs under these environments on any system. As shown in Table 1, PiP is the most practical in terms of its transparency for a dedicated OS kernel, language processing system and CAV. This is the reason why we chose PiP in this paper.

The shared address space model is new and not thoroughly investigated yet. This paper is to clarify the basic characteristics of the shared address space model by comparing it with traditional shared memory. Possible applications of this model will be discussed in Sect. 6.

3 Process-in-Process (PiP)

In PiP, a normal process can spawn child PiP tasks located in the same address space of the spawning process. The parent process is called the PiP root process and the spawned tasks are called PiP tasks. When implementing MPI with PiP, the process manager process is the PiP root and MPI processes are PiP tasks. The PiP root process is also treated as a PiP task.

The PiP implementation relies on the dlmopen() (not dlopen()) Glibc function and the Linux clone() system call. dlmopen() loads a program with a new name space. By using this function, programs can be loaded into the same address space twice or more whilst maintaining their variable privatization. To load the same program but to a different location in the same address space, the loaded program must be compiled and linked as Position Independent Executable (PIE). The clone() system call is used to create a task sharing the same address space but to behave like a process, i.e., to have its independent file descriptor table, signal handlers, etc. Thus, a PiP task behaves just like a normal process except for the shared address space.

Let us explain about the variable privatization of PiP in a concrete example. Assume that a program has a statically allocated variable x and it spawns two PiP tasks derived from this program. Each PiP task has its own variable x at different location in the same virtual address space. Each task accesses its own variable x. Thus, the spawned tasks can run independently without having any collisions of accessing variables. This behavior is substantially different from that of the multi-thread mode where all static variables are shared. To access the variable x owned by another task, PiP provides several ways to pass the address of an object to another PiP task. If the address of an object to be accessed is known, then a task can simply access it without performing any extra operations.

This variable privatization makes PiP tasks much easier for programs to share an address space than that of using the multi-thread model. When a sequential program is run with the multi-thread model, static variables must be protected from simultaneous access. In the shared address model, however, all static variables are privatized and there is no need for such protection. Thus, multiple instances of a program or multiple programs can run in the shared address space without the need for any modification of their source code.

There is one big limitation of PiP that comes from the current Glibc implementation. The number of name spaces which the dlmopen() can create is limited up to 16. This number is the size of a statically allocated array of name spaces and it is hard coded in Glibc. This number is too small considering the number of CPU cores on a node. As a result, we patched Glibc so that more than 16 PiP tasks can be created. Regardless of using the patched Gilbc or not, the current PiP implementation is purely user-level, and requires neither any kernel patch nor a specific kernel module. It should be noted that this patched Glibc can coexist with the existing Glibc. Only when a PiP program is compiled and linked by using the PiP compiler wrapper scripts, the patched Glibc is used.

The PiP library also provides the XPMEM functions and the XPMEM header file so that programs using XPMEM can easily be converted to PiP-aware programs. Furthermore, a converted program can run without installing the XPMEM kernel module. The XPMEM functions implemented in the PiP library do nothing and work very efficiently because PiP tasks can access any data in the address space to begin with.

4 Shared Memory Vs. Shared Address Space

4.1 Page Tables and Page Faults

In modern CPUs and OS kernels, an address space is essentially implemented as a page table located inside the kernel. The page table holds all mapping information from virtual addresses to physical addresses on every memory page in use. Every process has its own address space (left figure of Fig. 1). This implies every process has its own page tables. When a shared region is created by using POSIX shmem, the physical memory pages that are share are mapped in the page tables associated with the processes to share the memory region. In XPMEM, an existing memory region to share must be exposed and then the region is attached by the other process.

Fig. 1.
figure 1

Page table structure difference

Let us take a closer look at the creation of new mappings. In many modern OSes including Linux, there are two steps; one to create a skeleton of the mapping upon request, followed by a (minor) page fault when accessing the memory page for the first time. This page fault triggers the creation of a page table entry of the memory page. Every step is accompanied with non-negligible overhead. Furthermore, in the shared memory model, these steps happen on every process accessing the shared memory region. Thus, the shared memory model may suffer from the setup overhead and the overhead of a large number of page faults.

To the contrary, there is only one page table regardless of the number of tasks in the shared address space model (right figure of Fig. 1). Once a page table entry is created, the corresponding memory page can be accessed by any tasks sharing the address space without triggering page faults. Nowadays the number of tasks (processes) in a node for parallel execution can be large since the number of CPU cores is increasing. Thus, the overhead of page table setup and page faults can be far less than that of the shared memory model.

4.2 Modifications to Page Tables

However, the shared address space has also a disadvantage. Suppose that there are four processes sharing the same address space (right figure of Fig. 1). In theory, the size of the shared page table is the sum of the sizes of the four independent processes. The bigger page table than that of shared memory can take longer time to walk through the page table. Additionally, the page table is shared by the four processes. To maintain consistency of the page table, the page table must be protected from the simultaneous modification by using some combination of locking. This locking renders the overhead of page table modifications even larger.

There are two well-used system calls, m(un)map() and brk(), to modify the page table when memory regions get (de)allocated. The mmap() system call is also used to allocate a shared memory region. The brk() system call extends the memory region of the heap segment. The brk() function is used by the malloc() routines. The detailed m(un)map() overhead was already analyzed and reported by the original PiP paper [10], and the overhead on PiP is almost the same with that of Pthreads which is another implementation of the shared address space model.

Table 2. Shared memory and shared address space

Summary of the Differences

Table 2 shows the summary of a qualitative comparison between the shared memory model and the shared address space model.

5 Evaluation

The objective of this evaluation is to asses whether or not the shared address space can provide performance advantages compared to shared memory in the presence of the advantages and disadvantages described in the previous section. Unfortunately, the benchmark programs evaluated in this paper have no usage of CAV which is one of the most unique features of shared address space, which we will further discuss in Sect. 6.

We chose MPICH (Version 3.4.1) for evaluation. Although various optimizations based on PiP are possible, we kept modifications to MPICH minimal to highlight the basic difference between the shared memory and shared address space models. To make MPICH PiP-aware, we modified the Hydra process manager of MPICH to spawn PiP tasks instead of creating normal processes. MPICH was configured in four ways,

  • Shmem MPICH is configured to use POSIX shmem for intra-node communication,

  • XPMEM MPICH is configured to use XPMEM for intra-node communication, if possible,

  • PiP-Shmem MPICH is configured to spawn PiP tasks and PiP tasks allocate POSIX shared memory regions for intra-node communication (although shmem is not needed with PiP), and

  • PiP-XPMEM in addition to PiP-Shmem, XPMEM code is enabled but implemented by the PiP library and the XPMEM cache code is bypassed.

A certain difference between XPMEM and PiP-XPMEM is expected because XPMEM incurs the overhead of attaching memory of other processes as well as the overhead of the XPMEM cache to reduce the number of calls to XPMEM attach. The performance of PiP-Shmem and PiP-XPMEM, however, might incur higher mmap() and/or brk() overhead.

P2P and RMA performances were measured by using Intel MPI Benchmark [11]. Six mini-apps, HPCCG [8], miniGhost [3], LULESH2.0 [1, 12], miniMD [14], miniAMR [2] and mpiGraph [15], were chosen to cover various parallel execution and communication patterns.

We confirm that XPMEM is used in MPICH when calling P2P functions (Send/Recv, Isend/Irecv and Sendrecv) with message sizes larger than or equal to 4KiB and some RMA calls (Get/Put and Accumulate). This condition is the same with PiP-XPMEM since the threshold setting was left unchanged.

5.1 Experimental Environment

To measure the four MPICH configurations, Shmem, PiP-Shmem, XPMEM and PiP-XPMEM, we needed access to a cluster where XPMEM was already installed. Unfortunately, only a limited number of compute nodes from the Oakforest-PACS supercomputer [12] could be installed with such environment. Table 3 describes our evaluation environment.

Table 3. Experimental platform information

In all cases in this section, MPI processes are bound to CPU cores with the -bind-to rr (round-robin) MPICH runtime option. No other runtime option is specified. The performance numbers of the benchmark programs compiled and linked with built-in Intel compiler and Intel MPI will also be shown in some cases, just for reference. All MPICH libraries and mini applications are compiled using GCC. All measurements were repeated ten times and average numbers are reported.

5.2 Intel MPI Benchmark (IMB) Performance

To measure and compare P2P performance in this subsection, all benchmark numbers in IMB-MPI1 and IMB-RMA were measured using only a single node. Most benchmark results did not show any big difference between Shmem and PiP-Shmem and between XPMEM and PiP-XPMEM, respectively. Here, Exchange in IMB-MPI1 (Fig. 2) and All_put_all in IMB-RMA (Fig. 3) results are shown. In the Exchange benchmark, MPI processes form a ring topology and each MPI process send messages to its neighbors by calling MPI_Isend(), MPI_Irecv() and MPI_Wait(). In the All_put_all benchmark, each MPI process puts data to all the other MPI processes. Remember that XPMEM is only effective in the P2P communication and when the message size is larger than or equal to 4KiB, and some RMA operations including get and put.

In Fig. 2, Shmem, not shown in this figure because it is the base (always one), and PiP-Shmem exhibited the very similar latency curves, except for the dip at message size of 64 KiB. This is because the latency of Shmem is exceptionally large at that message size. XPMEM and PiP-XPMEM exhibit almost the same and much better than that of Shmem and PiP-Shmem when the message size is larger than or equal to 4 KiB which is the threshold to call the XPMEM functions to communicate.

Fig. 2.
figure 2

IMB-MPI1 Exchange (-np 32 -ppn 32)

Fig. 3.
figure 3

IMB-RMA All_put_all (-np 32 -ppn 32)

Fig. 4.
figure 4

IMB-EXT Window (-np 32 -ppn 32)

In Fig. 3, Shmem and PiP-Shmem exhibited almost the same except for the dip at 1KiB. Again, this dip comes from the exceptional Shmem latency at this message size and affects the ratios of the other MPI configurations. Comparing XPMEM and PiP-XPMEM, PiP-XPMEM exhibited much better latency at the range from 4KiB to 64KiB. We believe this performance advantage of PiP-XPMEM over XPMEM comes from the XPMEM cache misses. It has been shown by Hashmi that the XPMEM cache miss overhead can only be seen on smaller message sizes [7] and Fig. 3 matches with his report.

There is a pitfall in the IMB-RMA benchmark. The measured time does not include the time to create an RMA window. The time to call MPI_Window_create() can be measured by another program IMB-EXT in the IMB suite. Figure 4 shows the results using this program. By large, PiP-Shmem and PiP-XPMEM took more than 2x compared with Shmem and XPMEM, respectively. This PiP overhead is considered to come from the contention of mmap() calls and the larger size of the page table. In general, the RMA window creation function is called at the initialization stage of an MPI program, and the RMA window creation function is not called frequently. So this overhead is diluted in real applications.

Table 4. Benchmark parameters

5.3 Mini App Performance

Table 4 lists the mini applications used in this subsection and parameters to run them. The column Maj. MPI send indicates the MPI function which is called most frequently in that application. The Perf. Index at the last column indicates which number reported by the application is used for the performance comparison.

In the following application evaluation, single-node performance and multiple-node performance are shown and compared. The number of MPI processes of LULESH2.0 must be cubic, so the number of MPI processes of LULESH2.0 evaluation is 27 for single-node, 125 for multiple nodes (five nodes). The number of MPI processes of all the other applications is 32 for running on a single node and 256 for running on eight nodes.

Figure 5 shows the single node performance ratios of mini apps based on Shmem performance. As seen, the PiP-Shmem, XPMEM and PiP-XPMEM performs within the range of few percent differences. Most notably, the performance of mpiGraph running with PiP-XPMEM outperforms Shmem at 3x, XPMEM at 1.5x.

Fig. 5.
figure 5

Application Performance Comparison

Fig. 6.
figure 6

Application Performance Comparison (multiple nodes)

Figure 6 shows the multi-node performance numbers. Unlike Fig. 5, the big performance gain of PiP-XPMEM on mpiGraph is hardly seen. Further, XPMEM outperforms PiP-XPMEM and Shmem also outperforms PiP-Shmem in miniMD. In mimiAMR, PiP-Shmem and PiP-XPMEM outperform Shmem and XPMEM by about 5%, respectively.

In the next subsection, we will try to analyze these situations in terms of the XPMEM cache miss, the number of page faults, the number of brk() calls, and the communication patterns.

Detailed Analysis

Figure 9 shows the numbers of XPMEM cache accesses and the XPMEM cache miss ratios on each application. The XPMEM cache works in such a way that firstly it searches the XPMEM cache table and if there is no cache entry then the xpmem_get() and xpmem_attach() are called to attach the memory region and the attached region is registered in the XPMEM cache. The upper bars in the figure indicate the number of XPMEM cache access and the lower bars indicate the cache miss ratios. In the mpiGraph application, the XPMEM cache miss ratio is exactly 10%, the highest among the others.

Fig. 7.
figure 7

# Page Faults (single node)

Fig. 8.
figure 8

Page Fault Ratio (single node)

Figure 7 shows the number of page faults and Fig. 8 shows the ratio of page faults during the executions. The numbers of page faults with PiP-Shmem and PiP-XPMEM are always less than those of Shmem and XPMEM respectively, as expected. Most notably, the number of page faults of XPMEM on mpiGraph is 2x higher than that of Shmem and close to 10x higher than that of PiP-XPMEM.

Fig. 9.
figure 9

# XPMEM Cache Access and Cache Miss Ratio (single node)

Fig. 10.
figure 10

Number of brk() calls

The number of mmap() system call includes the calls when loading a program and required shared libraries and the comparing the number of mmap() calls may be imprecise. Instead, the number of brk() system calls is measured in this paper. We measured the overhead of the brk() system call on 32 tasks on a single node. The overhead of the PiP-Shmem case is very high, almost 200x of the Shmem case. This large overhead may affects the application performance.

Figure 10 shows the numbers of brk() calls per node. These numbers are measured by using the Linux strace command. The brk() is always called in pairs, one to obtain the current heap address and another to increase the heap segment. So, the actual number of page table modifications is the half of the numbers shown in the table. Unfortunately, the strace command is not PIE and it is impossible to run it on the PiP environment. However, the numbers would be the same with the numbers in the graph, since the MPICH code modifications to be PiP-aware do not affect the number of brk() calls. These numbers are almost independent from single node or multiple nodes, and using XPMEM or not, except for LULESH2.0 with which the number of page faults on multiple nodes are higher than those on a single node.

Fig. 11.
figure 11

Communication patterns

Figure 11 shows the cumulative graphs of send frequency over message sizes. These numbers are obtained by using the PMPI interface. The communication pattern of miniMD depends on the number of ranks while the others are almost independent from the number of nodes. In miniMD, the larger the number of ranks, the smaller the message sizes. This is due to the fact that the miniMD parameter setting is in a strong scaling way.

Let us examine all these evaluation results. The single node performance of mpiGraph is a good showcase of how PiP-XPMEM works better than XPMEM; 1) high XPMEM cache miss ratio, 2) high number of page faults, 3) moderate number of brk() calls, and 4) exchanging large messages by which MPICH can utilize the XPMEM functionalities.

LULESH2.0 and miniAMR exhibited similarly in terms of high number of XPMEM calls, low XPMEM cache miss ratio, and high number of brk() calls. The high number of brk() calls is considered to be the disadvantages for PiP-Shmem and PiP-XPMEM, however, there are almost no performance penalties observed. On the contrary, PiP-Shmem and PiP-XPMEM slightly outperforms them. This can be explained by the fact of the high number of page faults (Fig. 7).

HPCCG and miniGhost performed constantly independent from whether using PiP and/or XPMEM or not. The number of page faults could be reduced by using PiP or PiP-XPMEM. This advantage of PiP can be considered to be canceled by the brk() overhead. The advantage of using XPMEM over Shmem on HPCCG, miniGhost and miniMD might have been spoiled by the 1% XPMEM cache miss ratio (Fig. 9).

The miniMD multiple node performance of PiP-Shmem is slightly worse than that of Shmem. miniMD exhibited almost the same with HPCCG and miniGhost in terms of the number of page faults, XPMEM cache miss ratio, and the number of the brk() calls. A big difference between miniMD and those applications can be found at the communication pattern (the miniMD graph in Fig. 11). The message sizes of running on 256 ranks are smaller than those of running on 32 ranks. The smaller the message size, the larger the impact of the brk() overhead.

6 Discussion

So far, the performance of PiP has been evaluated by using MPICH with the indispensable and minimal modifications to be PiP-aware. However, there is a lot of room for optimizing MPI implementations by using the shared address space model. In the POSIX Shmem usage of an MPI implementation, there are two mmap() calls; one to allocate a shared memory region and another to attach the shared memory region to access. In the shared address space model, however, only one call to allocate memory is enough. Once the memory region is allocated the memory region can be accessible without calling another mmap() call to attach. If an MPI implementation is optimized to utilize the full advantages of the shared address space, then the number of mmap() calls can be halved. Although the overhead of modifying a page table is high in shared address space, the smaller number of page table modifications may lead to smaller overhead.

The shared address space may improve not only intra-node communication performance, but also inter-node communication performance. Ouyang is eagerly working on MPI optimization to improve inter-node, not intra-node, communication by using PiP. In [17], Ouyang et al. proposed CAB-MPI where communication queue entries of the other MPI processes are stolen to balance communication load among MPI processes in a node. In their other paper [16], Daps is proposed so that idle MPI processes steal the asynchronous progress work of the other busy MPI processes, instead of creating an asynchronous progress thread. CAV provided by the shared address space model plays a very important role when implementing CAB-MPI and Daps. The message queue (i.e., send or receive queue) is implemented as a linked list in many cases. CAV enables CAB-MPI to access the message queues of the other processes without largely modifying the existing queue structures. To implement the asynchronous progress stealing in Daps, things are more complicated than implementing CAB-MPI. In MPICH, low level communication functions are called via function pointers to decouple the device independent code and device dependent code. Those functions must also be called by a different MPI process when implementing Daps, and the CAV nature which the shared address space model provides enables this.

Although this paper focused on the difference between the shared memory and the shared address space and reported advantages in intra-node MPI communication, there are many other potential applications which would have benefits by using the shared address space. The shared address space can also be applied to various communication libraries (i.e., OpenSHMEM [4]) and parallel programming languages (i.e., PGAS languages).

In-situ applications, visualization programs, and multi-physics applications are required to run two or more programs simultaneously and these programs cooperate with the others and exchange information among them. What if these programs run in the shared address space environment? The data exchange among programs can be more efficient than that of the conventional ways, – data exchange via file or coupling library – because data of the other program can be accessed directly. There are many open issues in this field of coupling multiple programs, however, it is our belief that the shared address space can be an answer for the question, how to connect programs in an efficient way.

Garg, Price and Cooperman proposed a checkpoint-restart system named Mana which is agnostic to MPI implementations and network devices by having a dedicated thread to save a memory image [6]. Although the authors claimed that their approach is hard to implement by using PiP, we think checkpoint-restart is a very attractive and challenging application of the shared address space model.

7 Summary

This paper has provided a detailed comparison between the shared memory and shared address space models. Although these two models appear similar since both models allow access to data owned by the other processes, their underlying mechanisms are notably different. From a qualitative point of view, the shared address space model may have fewer number of page table modifications and page faults than those of the shared memory model. To the contrary, the shared address space model may incur larger overhead when modifying page tables, e.g., when calling m(un)map() and brk() system calls. This overhead comes from the shared page table among processes and from the fact that the page table size is larger than that of the shared memory model. From a quantitative point of view, evaluations were conducted by using P2P benchmark programs and mini application benchmark programs. PiP is chosen as an implementation of the shared address model and an MPI implementation was modified in a minimal way to have the shared address space model. Four MPI configurations; 1) (POSIX) Shmem, 2) XPMEM, 3) PiP-Shmem and 4) PiP-XPMEM are compared. Shmem and XPMEM are as the representatives of the shared memory model, PiP-Shmem and PiP-XPMEM are the representatives of the shared address model.

The P2P benchmarks show that both models perform comparably. RMA benchmark reveals that RMA operation of the shared address space may outperform the shared memory model, however, the RMA window creation of the shared address space model is almost twice as costly as that of the shared memory model. Most mini benchmark programs also perform comparably. Most notably, mpiGraph performance with PiP-XPMEM outperformed Shmem by 3x and XPMEM by 1.5x.

The shared memory model is an old technology which has received a lot of attention in the literature. To the contrary, the shared address space model is newer and only a few investigations have been done so far. We believe that considering the opportunities to improve the current HPC system software, it is worth investigating the shared address space model. This paper has made the first steps towards this direction.

PiP is an open-source software and freely available at https://github.com/procinproc/procinproc.github.io. The PiP package includes the patched Glibc, PiP-aware GDB, installation program (named pip-pip), and more. PiP can also be installed by using Spack (https://github.com/spack) with the package name of process-in-process.