Simplifying non-contiguous data transfer with MPI for Python

Python is becoming increasingly popular in scientific computing. The package MPI for Python (mpi4py) allows writing efficient parallel programs that scale across multiple nodes. However, it does not support non-contiguous data via slices, which is a well-known feature of NumPy. In this work, we therefore evaluate several methods to support the direct transfer of non-contiguous arrays in mpi4py. This significantly simplifies the code, while the performance basically stays the same. In a PingPong-, Stencil- and Lattice-Boltzmann-Benchmark, we compare the common manual copying, a NumPy-Copy design and a design that is based on MPI derived datatypes. In one case, the MPI derived datatype design could achieve a speedup of 15% in a Stencil-Benchmark on four compute nodes. Our designs are superior to naive manual copies, but for maximum performance manual copies with pre-allocated buffers or MPI persistent communication will be a better choice.


Introduction
In recent years, Python has become more and more important as a programming language for scientific applications.According to the TIOBE Index of April 2023, Python is even the most popular programming language [1] and also IEEE Spectrum chooses Python as the best programming language [2].The main reason for Python's popularity is its ease of use, especially when compared to other programming languages.
However, because Python is an interpreted language, its performance lags behind that of compiled languages like C or C++.Therefore, many large-scale scientific applications still use C, C++ or even FORTRAN.
To counteract this, various libraries and packages are being developed to allow more efficient use of compute resources, which is necessary for high-performance computing.NumPy [3] uses C-like arrays to store data and provides fast functions implemented in C to speed up computation.The SciPy [4] library is based on NumPy and provides a rich set of functions for scientific computing.It typically implements time-consuming loops in C or FORTRAN and uses sophisticated wrappers to package existing optimized scientific algorithms.Most of these tools only support single-node execution, but in high-performance computing, for large-scale applications, the computation is usually distributed across multiple nodes.mpi4py [5] is a Python wrapper around the Message Passing Interface (MPI), the de facto standard for parallel processing and data transfer in HPC clusters.mpi4py can be used with all common MPI libraries and works together with Python packages such as NumPy, allowing Python applications to be parallelized across multiple cluster nodes.
Scientific applications often work on 2D or 3D domains that can be represented as NumPy arrays.Using slices, NumPy provides a powerful tool for accessing noncontiguous domains, for example by just accessing only the data of one column in a two-dimensional domain.While contiguous arrays are supported in mpi4py, efficient transfer of non-contiguous arrays forces the developer to create MPI derived datatypes (DDT).Without them, an mpi4py transfer will fail if a non-contiguous array is the source or destination.This breaks with the familiar way of working with NumPy and means extra work for the developer, creating contiguous buffers and copying data back and forth.Therefore, the goal of this work is to support the transfer of non-contiguous (NumPy) buffers directly in mpi4py, simplifying the work for application developers while still providing high performance.We are looking at various ways, including DDT, to best support this.This can further simplify the development of parallel scientific applications in Python, without the developer having to worry about implementation details.

Related work
This work covers various aspects of MPI and Python for HPC, so we divide this section accordingly.

Python and mpi4py for HPC
Multiple HPC Python frameworks such as CuPy [6] or Numba [7] are compared with a multi-domain benchmark suite for NumPy in [8].This is important because it shows the effect that frameworks can have on speeding up NumPy.
mpi4py is compared to Charm4Py in [9] by means of CPU and GPU microbenchmarks.mpi4py has a lower Python overhead for CPU-only communication,

3
Simplifying non-contiguous data transfer with MPI for Python but a higher Python overhead for GPU-direct communication.In one benchmark, Charm4Py is faster with load balancing and overdecomposition enabled, but lacking full RDMA support, mpi4py performs better overall.Evaluating possible alternatives to mpi4py is important because it allows us to better understand where we stand in terms of performance optimization.
In [10], Python extensions for the OSU Micro-Benchmark are used to evaluate the performance of MPI implementations on HPC systems using Python and mpi4py.The benchmarks show that the small Python overhead is most noticeable with small messages and that GPU-aware Python buffers perform better with CuPy or PyCUDA compared to Numba.We use this work in Sect.5.1 to compare the performance of mpi4py and C-MPI on two HPC systems.
The Data-Centric Python framework (DaCe) [11] defines a flow-based graph representation that translates Python code into high-performance programs for CPUs, GPUs and FPGAs.DaCe can communicate strided data by automatically creating a Vector-DDT if the data is non-contiguous.Our DDT solution supports a wider range of non-contiguous memory layouts by using more powerful MPI derived datatypes like Subarray and Indexed-Block.

MPI derived datatypes
We are fundamentally limited by the DDT performance of MPI implementations, so the following articles are closely related to our work.Some show performance problems, while others show possible solutions.The work in [12] highlights performance portability issues for DDT.It shows that DDT performance can vary by a factor of three between four MPI implementations (MPICH − 3.2, OpenMPI − 1.6.4 and 2.1.2,MVAPICH2 − 2.2 and 2.2b, IntelMPI-2017), also depending on the communica- tion protocols and the chosen datatype.Similar results are presented in [13], which also compares four MPI implementations (OpenMPI − 2.0.1,MVAPICH2 − 2.2, NEC MPI − 1.3.1,IBM MPI based on MPICH2 − 1.5).
The Fast and Low-overhead Communication (FALCON) designs [14] use zerocopy mechanisms to get rid of the intra-node DDT overhead.When implemented in MVAPICH2, the communication latency and bandwidth were significantly reduced in several application benchmarks, compared to common MPI implementations (MVAPICH2-X-2.3rc1,OpenMPI − 3.1.2,IntelMPI-2018 and 2019).
TEMPI [15] improves DDT performance on GPUs by optimizing pack and unpack operations.It is compiled into a dynamic library using the system MPI header and MPI applications can remain unmodified when the preloading mechanism is used.On the HPC system Summit with SpectrumMPI − 10.3.1.2TEMPI demonstrated a MPI_Pack speedup between 5.7 and 242000 and a MPI_Send speedup of up to 59000.
The same goal has [16] which eliminates the expensive driver overhead of GPU kernels in existing communication libraries by using a single optimized GPU kernel to directly load-store non-contiguous data between GPUs.Compared to common CUDA-aware MPI implementations, a speedup of up to 2.5 for PCIe and up to 4.7 for NVLink systems was shown.The proposed designs were added to MVAPICH2-GDR − 2.3.2.
Experiments with non-contiguous Sends in [17] compare manual packing and DDTs in C. It is shown that the Vector-and Subarray-DDTs provide the same performance and that the One-sided Communication performance is low.We can confirm these results by measuring identical performance for Vector-, Subarray-and Indexed-Block-DDTs in a PingPong-Benchmark.
These works show us that using DDT in MPI may not always be the best solution for non-contiguous data transfer in Python.Therefore, in this work we evaluate different techniques to find a good and scalable solution.

Background
In this section, we will discuss some of the background necessary for this work.

MPI and mpi4py
MPI (Message Passing Interface) is the de facto programming standard for communication in distributed high-performance systems.There are several implementations of the standard, such as OpenMPI [18] or MPICH [19], which provide bindings for C and FORTRAN.A correct MPI program should run on different MPI implementations without modification.In our work, we use mpi4py [5] as a Python wrapper for MPI.Compared to C-MPI, the runtime performance of mpi4py is reduced due to the interpreter overhead, but the advantage of mpi4py is rapid application development, since many advantages of Python can be used, making development generally faster and less prone to errors.A simple MPI program, shown in Listing 1 and 2, illustrates some differences between mpi4py and C-MPI.An important aspect is that mpi4py uses an object-oriented approach, which is an important feature since there are no more C++ bindings in MPI.mpi4py does not need to pass the communicator as a function argument; instead, send or receive functions are methods of the communicator class (see lines 8 and 10 in Listing 1).Another difference between mpi4py and C-MPI is the automatic initialization and finalization of the MPI environment.

3
Simplifying non-contiguous data transfer with MPI for Python mpi4py provides upper-and lowercase communication routines.The former communicate (contiguous) memory buffers and the latter plain Python objects, using pickling and unpickling, where a Python object hierarchy is converted (and deconverted) into a byte stream to allow data transfer.Non-contiguous memory buffers are supported, but require the developer to create MPI derived datatypes.The lowercase functions provide support for many Python objects and can therefore be used to initialize and set up applications.The uppercase variant can only work with Python objects that provide raw memory buffers, like NumPy arrays.In an HPC environment, it is advisable to exclusively use uppercase methods in performance-critical areas, as they provide superior performance.

MPI derived datatypes
Although we use the Python-API of mpi4py, this section describes the C-API of MPI, as it is consistent with the official documents of the MPI Standard [20] and is probably more familiar to most people.We will point out the differences when necessary.
Besides the MPI basic datatypes, which are contiguous in memory and correspond to the datatypes of C and FORTRAN, MPI also supports more complex memory layouts by using derived datatypes (DDT) [20,119].MPI provides several types of DDTs that are constructed from the MPI basic datatypes.There are many functions to construct custom datatypes, but we will only discuss the ones shown in Fig. 1.
MPI_Type_contiguous is the simplest datatype constructor and replicates an existing datatype multiple times in a contiguous block of memory (Fig. 1a).It takes two input arguments: The number of elements (count) and the old datatype.For data layouts with even spacing, e.g. the column of a matrix stored in row-major order, a suitable DDT can be created with MPI_Type_vector.The vector datatype takes as input the block length, a stride and a count as shown in Fig. 1b.For more complex layouts, indexed datatypes can be used, as shown in Fig. 1c.It's created with MPI_Type_create_indexed and takes as input a list of block lengths and a list of displacements.The count parameter specifies the size of both lists.A special case is the MPI_Type_create_indexed_block constructor.For this datatype, all blocks have the same block length, so no list is needed, only a block length parameter.Last but not least, for data that can hold different types of data, such as in a C structure type, you can use MPI_Type_create_struct as shown in Fig. 1d [21,3].As shown later, our DDT design uses both the Subarray datatype which is an extended version of the Vector datatype and the Indexed-Block datatype.

NumPy array slicing
Standard Python data structures store objects without any memory contiguity.For example, a Python list is an array of pointers to objects and the underlying list data does not form a contiguous block of memory, which limits access performance.For this reason, NumPy defines an array interface standard with contiguous buffers, which has become very popular in many high-performance Python libraries.These arrays can also be accessed using slices [8,65].Simplifying non-contiguous data transfer with MPI for Python In Python, the term slice1 describes an object that contains part of a sequence.A slice can be specified either as a slice object pylist[slice(0, 2, None)] or as colon separated numbers pylist[0:2].
The numbers describe start (0), stop (2), and step size (1) of a slice and the default values are used for any missing argument.A Python slice like pylist[0:2] always returns a copy of the first two elements, so it can only respect the first dimension.
NumPy extends this concept to multiple dimensions by separating the slices with a comma.Unlike Python, NumPy does not provide a copy of the data, but a view of the original data.Therefore, slices are often non-contiguous in memory.NumPy's slicing syntax makes it easy to describe portions of an array, and the term slice can refer to a single slice object as well as the view of an array it describes.The examples in Fig. 2 illustrate how this works.All examples are based on a two-dimensional NumPy array a with the shape (6,6) and are self-explanatory.
Computing time can be drastically reduced if Python loops can be replaced by NumPy slices, thus implicitly enabling vectorization.However, a fast uppercase mpi4py communication call will fail if a non-contiguous view of the data is used as communication buffer.Therefore, the data must be copied into a contiguous array before transfer.

Designs
This section describes the implementation of non-contiguous array support for mpi4py.In preliminary experiments, five different designs were tested to find good candidates, three of which were based on One-sided Communication.One uses MPI's active target synchronization with Fence, Win-create and Put, another one Overall, the performance of One-sided Communication was low (see Appendix A), which matches the results shown in [17].From these experiments, we learned that MPI derived datatypes (DDT) and NumPy copies are best suited for non-contiguous data transfers.Therefore, we will not examine the designs that use Onesided Communication in detail, but will focus on the designs that provide the best performance.
The following designs add support for non-contiguous data to almost all Pointto-Point communication routines, but collective routines like MPI_Bcast or MPI_Allreduce only work with the NumPy-Copy design.We have also experimented with automatically decomposing a global domain into equal-sized blocks with MPI_Scatterv and vice versa with MPI_Gatherv as needed by many parallel programs, but we could not integrate it into mpi4py and make it easy to use.

Overhead for contiguous transfers
To allow the transfer of non-contiguous data, each Send or Receive requires a check to see if the array is C-contiguous.This adds overhead even for contiguous data, which should not be neglected.
Throughout this section, we present various optimizations to reduce the overhead introduced by this check.In the results section 6, we will discuss the overhead of the different designs.

MPI derived datatype design
Our first design is built around DDTs and uses MPI_Type_create_subarray or MPI_Type_create_indexed_block depending on the layout of the noncontiguous data.
Figure 3 shows the DDT selection flow when a developer calls an mpi4py communication function.If the buffer is contiguous in memory, the code follows the normal call path until the MPI function is called.Since datatype creation is expensive, we have created a non-thread-safe cache to reuse datatypes.Therefore, we first check whether the needed datatype is stored in our cache.A cache lookup creates the tuple (shape, strides, dtype) as a Python dictionary key, as these are sufficient to map a NumPy slice to an MPI datatype.
If the DDT is found in the cache, the corresponding MPI function is called using the cached datatype.Otherwise, a stride comparison of the NumPy slice and its base array decides which DDT to create.Each stride of a[4:, 4:] is 1 3 Simplifying non-contiguous data transfer with MPI for Python in a.strides (48, 8), so a subarray datatype is created.If a single stride is not found in a.strides, as is the case for a[2::2,::2].strides(96, 16), an indexed block datatype is created.Both examples are taken from Fig. 2.
For its creation the subarray datatype needs an array of subsizes which is built from the strides and shape attributes of the two arrays and the indexed block datatype needs a list of displacements which is created with NumPy's as_strided function from an index array and the slice's shape and strides, the block length used is always 1.
The caching logic which returns the tuple (buf, count, datatype) is shown in Listing 3.

Fig. 3 DDT selection flow
The DDT design supports the most common NumPy slices, but lacks support for structured NumPy arrays.For these, mpi4py provides the utility module dtlib for manual datatype translation between NumPy and MPI datatypes.

NumPy-Copy design
For the second design, we use NumPy functions to copy non-contiguous data.These functions are also highly optimized.To maximize the performance of our copybased implementation, we first compare different NumPy functions to see which is the fastest way to copy a non-contiguous NumPy array into contiguous memory, since NumPy provides several ways to do this.We compare these methods in Fig. 4 which shows that the runtime advantage is most noticeable when copying small arrays.The percentages given above the bars show the slowdown compared to the fastest function.
In the diagram, a stride of 16 bytes corresponds to the notation data[::2], where the array data consists of 64-bit floating-point numbers.In this case, a view of data containing every second array element is passed to the examined NumPy functions and the view's data consumes either 2 KiB or 256 KiB of memory.
Various tests with the Python module timeit have shown the lowest runtime for np.ndarray.copyfollowed by np.ascontiguousarray.The other functions are slower because of many overheads in NumPy, like processing of internal function calls and function arguments or the support for custom array containers.
A prior manual check to see whether data is contiguous takes very little time compared to running np.ascontiguousarray in every case.For each non-contiguous MPI_Send, this function is called once, and for each non-contiguous MPI_Recv we create an empty array by calling numpy.empty_like(data).After the data is Simplifying non-contiguous data transfer with MPI for Python received into the empty buffer, the original buffer is populated with the newly received data.
To reduce the overhead of the non-contiguous check, we choose two NumPy-Copy designs that put the non-contiguous data handling in different places.One design is called numpy-copy-ifelse because it uses the default if-else contiguity check.other design is called numpy-copy and does not need this check, but instead uses an exception handler to handle non-contiguous data.This works because the low-level Python method PyObject_GetBuffer throws an exception when called with a non-contiguous buffer object.This has the advantage that the if-else-check is not required for every communication call.This is the only difference between both designs as shown here: The code behind # non-contiguous handling (lines 5 and 15) is shown below.It uses the argument SRN to handle not only mpi4py Send and Recv calls, but also persistent Send_init calls.The self argument refers to an mpi4py internal message object of type _p_msg_p2p and is used to store message buffers.

Optimized persistent design
In our first experiments, we learned that it is advantageous to manually copy data into contiguous send/recv buffers that can be reused, because it avoids reallocating memory.We will show this also in the results section.We tried several approaches to add direct buffer reuse to mpi4py, but found out that it's difficult to add a mechanism that is safe and not just reliable for certain cases and still doesn't add too much overhead to the application.So we looked for the next best thing to allow buffer reuse within MPI.
MPI persistent communication reduces the communication overhead between the MPI process and the MPI communication controller by using separate routines for initialization and communication [20,94].The initialization creates an MPI request, which can be used for several successive communication routines with the same parameters (send/recv buffer, receiver/transmitter, datatype etc.).In mpi4py, this object is of type Prequest.
This design combines persistent communication with the non-contiguous handling of the numpy-copy design.From a developer's perspective, it must be used like normal persistent communication.We use the Prequest object to store the contiguous buffer.This allows the buffer to be reused.
The disadvantage of this implementation is that it requires changes in the applications, since it requires a conversion from classical communication to persistent communication, while the other designs do not require this and the code can be used unchanged.
However, the advantage is that the developer does not have to worry about whether the communication buffer is contiguous, and the allocation overhead is paid only once.
Since persistent communication is not commonly used, we will illustrate its use in this non-contiguous PingPong example.

3
Simplifying non-contiguous data transfer with MPI for Python

Experimental setup
In general, the performance of a given DDT is not portable between different MPI implementations [12,1].Therefore, we run the benchmarks against multiple MPI implementations on two HPC systems, as shown in Table 1.However, our goal is to compare our designs in different environments, not to compare the performance of different MPI implementations.

OSU Micro-Benchmarks
The OSU Micro-Benchmarks2 are a popular tool for comparing different MPI implementations, and Python support has recently been added [10].While an MPI comparison is not the goal of this work, it is helpful to know the performance implications of using mpi4py instead of C-MPI.The benchmark results for JUSUF and CLAIX are shown in Fig. 5.We only show the results for message sizes up to 64 KiB, as the overhead becomes less significant for larger message sizes.The call overhead introduced by mpi4py compared to C-MPI is less than 1 s for message sizes smaller than 32 KiB, regardless of the computer system or MPI implementation.However, you should still keep in mind that on some systems this overhead is larger than the actual transfer time with MPI, which illustrates some of the problems that still exist with Python.The overhead for larger messages varies quite a bit, and there is no general trend.

PingPong-Benchmark
A PingPong-Benchmark is the simplest possible benchmark and measures communication latency.To estimate the overhead of our approach for both, contiguous and non-contiguous data, we run the benchmark for both.
For contiguous data, we send the first row of a two-dimensional (n, n) array back and forth.For non-contiguous data, process 0 sends the first column of that array  Simplifying non-contiguous data transfer with MPI for Python and process 1 receives it into the first column of its own array, and vice versa.We compare our designs with a version that uses manual copying outside of MPI, with and without pre-allocated buffers.
The following code examples show the difference between our design and the benchmark using the traditional method of first copying the data to a contiguous buffer.Our design allows for a simplification of the code.
To get statistically robust benchmark results, the PingPong-Benchmark is designed similarly to the experiments in [13].Those experiments show that it's enough to use a single basic datatype like MPI_DOUBLE because the communication performance is unrelated.We execute mpirun ten times and the number of iterations is based on the message size for faster benchmarking.The results are median values, because this way outliers have less effect [13,[101][102][103].
All designs use a single PingPong for warmup.For the mpi-ddt design, this skips the time needed for DDT creation because the datatype cache is already filled when the measurement starts.

Stencil-Benchmark
Compared to the PingPong-Benchmark, the Stencil-Benchmark is closer to realworld applications.In many parallel simulation schemes, such as iterative stencil loops, scientists need to add extra cells to the borders of process-local arrays to get correct results.All processes communicate these border cells to their neighbor processes at runtime.
As an example, we use a 3D Stencil with three-dimensional domain decomposition.First, a simulation array of size (64, 64, 64) is created and divided according to the number of processes using MPI_Dims_create and MPI_Cart_create.It has no significant effect whether the MPI process topology is periodic or nonperiodic, so we choose a non-periodic one.
The non-blocking routines MPI_Isend and MPI_Irecv in conjunction with MPI_Waitall are used to exchange border cells to avoid deadlocks.
The process-local array has six borders, but only two of them are contiguous in memory.Each border is selected using NumPy slices and sent individually to the corresponding neighbor.The transferred messages are very similar to the PingPong messages.For the manual data transfer, we add an optimized version of manualcopy, where pre-allocated buffers store the contiguous copies.Again, we run mpirun ten times, with each run collecting 50 runtimes of 20 iterations of a Sevenpoint Stencil and writing the median value to a file.The default MPI process placement is used for this benchmark.

Lattice-Boltzmann-Benchmark
A paper published in 2019 presents a parallel Lattice-Boltzmann implementation using mpi4py [22], and its code is used for example in petrochemical research [23].The authors provide several variants of the lid-driven cavity problem (a square with three rigid sides and a tangentially moving top side), which is a basic 2D validation example for fluid dynamics solvers.
The opt1 code uses Python only, and the opt2 code uses a faster C++ collision kernel.By default, the communication uses MPI_Sendrecv along with the usual manual-copy without pre-allocated buffers.We replace the communication function with one that doesn't require explicit calls to copy, and also add another that uses persistent communication.
The benchmark is not run with manual-copy-prealloc, as this would require an extensive refactoring of the Lattice Boltzmann code, compromising comparability with other designs.The default MPI process placement is used for this benchmark.

Results
The diagrams in this section show results for JUSUF and CLAIX side by side for better readability, but any comparison of the two systems or their MPI implementations may lead to false conclusions.Instead, we focus on the performance of the different designs.
The manual-copy solution serves as a reference because it's the standard way to transfer non-contiguous data.OpenMPI 4.1.1does not work correctly on CLAIX, so we can't use it.
The first finding from the PingPong benchmark is that the lowercase mpi4py methods are about 30 µs slower on all MPI implementations and on both systems, regardless of whether the data is contiguous or not.Offsetting the non-contiguous data, such as sending the second or third column has no effect, and it makes no significant difference whether PingPong runs on one or two compute nodes or uses a 2D or 3D array.
For contiguous data, Fig. 6 shows the 2D results.The fastest solution uses MPI persistent communication.It's called optimized-persistent because it uses the noncontiguous handling of numpy-copy which means that the developer doesn't have to distinguish between non-contiguous and contiguous communication.However, as 1 3 Simplifying non-contiguous data transfer with MPI for Python mentioned above, the developer has to set up the persistent communication manually, i.e. each Send is initialized with MPI_Send_init and for each PingPong MPI_Start and MPI_Wait are called.
The numpy-copy design is very close to the performance of manual-copy because no exception is thrown for contiguous data, while the numpy-copy-ifelse and the mpi-ddt designs are slower because both require a non-contiguous check, which adds overhead.We also vary the border width from one to four elements and for contiguous data, there is nothing interesting to see.This is different for non-contiguous borders.For larger borders, the mpi-ddt design improved on CLAIX, as shown in Fig. 7.
We skip the absolute latency results here for the sake of clarity.The numpy-copy design is slower than numpy-copy-ifelse because Python's exception handling adds some overhead.
Both manual-copy and manual-copy-prealloc perform the same, so using preallocated buffers has no effect in this case.The worse performance of the DDT design using OpenMPI on JUSUF is related to the performance of MPI derived datatypes in OpenMPI, as all other designs show similar performance compared to the other tests.Using another MPI-Implementation on the same machine shows better performance in most cases.Therefore, we will skip OpenMPI in the next benchmarks.
The PingPong-Benchmark shows that basically all non-contiguous solutions are faster than manually copying the data, especially for small to medium-sized messages.To gain insight into mixed contiguous and non-contiguous data transfers, we need to look at the Stencil-Benchmark.Figure 8 shows the 3D Stencil results, which again show that optimized-persistent is about 30% faster than manual-copy.In addition to that, we see the benefit of using pre-allocated arrays in a manual design, which achieves 10% performance gain compared to the normal manual version.In the Stencil-Benchmark, numpy-copy-ifelse is faster than numpy-copy because there are many non-contiguous transfers that hit the slow exception handler.
The MPI process division generated by MPI_Dims_create prefers to assign more processes to the first dimension where contiguous data is transferred.One might assume that this benefits numpy-copy and indeed it does, but overall the exception handler has a higher runtime cost for the non-contiguous transfers than the if-statement has for each transfer.
On CLAIX the performance of manual-copy appears to degrade as the number of compute nodes increases, while on JUSUF the performance is more linear compared to the non-contiguous designs.Overall, our design simplifies the transfer of noncontiguous data and can provide a performance improvement over a naive implementation without pre-allocated buffers.However, the performance of the transparent design is to a hand-optimized version that uses pre-allocated buffers.
The Lattice-Boltzmann-Benchmark uses only non-contiguous transfers because the x and y coordinates are stored in the second and third dimensions of a threedimensional array.The first dimension stores the direction of a fluid particle and is not involved in the communication step.
The performance results are shown in Fig. 9.We observe that numpy-copy-ifelse performs closest to manual-copy.All designs that allow transparent transfer of Fig. 9 Lattice Boltzmann results for 10 steps, array size (300, 300) and type MPI_DOUBLE.Each compute node uses all of its processor cores non-contiguous data are no more than 5% slower.The performance of numpy-copy is poor because each message is non-contiguous and the exception handler is hit every time a data transfer is initiated.
The Stencil-and Lattice-Boltzmann-Benchmark use small messages, roughly equivalent to PingPong with 8 or 16 elements per message.In the PingPong results mpi-ddt is slower than numpy-copy-ifelse which matches Lattice-Boltzmann.However, in the Stencil-Benchmark, mpi-ddt is faster than numpy-copy-ifelse.One reason is that Lattice-Boltzmann uses MPI_Sendrecv and Stencil uses non-blocking MPI_Isend and MPI_Irecv for communication, which allows better use of our optimization.For this reason, the persistent design also shows the best performance, as it requires a transformation from the blocking sendrecv call to the non-blocking persistent communication calls.

Conclusion
In a Python HPC context, our work simplifies source code by eliminating the need for manual copying, and the developer does not have to worry about whether the data is contiguous or not.For example, each dimension of an iterative stencil loop can be sent and received in the same way.
We have shown that there is essentially no performance loss, with which the simplification is "bought" and the Stencil-Benchmark even showed a few percent performance gain.By using persistent communication, the performance can be improved for all use cases.
At this time, we recommend trying mpi-ddt and numpy-copy-ifelse which are best for automatic non-contiguous data transfer.Performance is highly dependent on the MPI implementation and the support for derived datatypes.For maximum performance with mpi4py, we recommend MPI persistent communication and manual copy with pre-allocated buffers.Compared to normal persistent communication, our optimized-persistent solution automatically copies non-contiguous data into a contiguous buffer.
material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/.

Fig. 4
Fig. 4 Performance of different NumPy functions on JUSUF

Fig. 6 Fig. 7 Fig. 8 3D 3
Fig. 6 Contiguous PingPong sends the first row of the local array within a single compute node (MPI_DOUBLE)

Table 1
Hardware and software used in the experiments