1 Introduction

While current supercomputers provide hundreds of PFLOP/s [1], the speed of I/O has not grown as much. Furthermore, often applications do not fully exploit the parallelism of parallel file systems. Especially for voxel-based simulations on regular grids, the data required for checkpointing grows proportionally with the domain size. Hence, running large-scale simulations often implies large amounts of data, which need to be written to disk, necessitating high-performance I/O for current and future HPC applications. Such applications find use in many research areas, e.g., engineering, weather and climate research or material science, as they allow a deeper insight into the physical mechanisms and give predictions for future processes and design of process chains. The focus of past HPC optimizations is primarily found in computation and communication, whereas I/O tends to be neglected in optimization efforts. Considering all this, investigating and improving I/O for real applications are of great importance. The objective of this paper is to give the end-user as well as the application developer heuristics to tune the striping on Lustre file systems such that high-performance I/O is achieved. This paper first re-capitulates the state of the art in parallel I/O and then describes the software and hardware used for the performance measurements. Experiments are then conducted in which the time taken for a parallel write call to finish is measured, investigating the parameters of block size, processor count and striping configuration. Based on the analysis of the experimental results, heuristics employing the varied parameters are determined such that the write rate is substantially improved, up to a factor of 32. Finally, it is shown that the heuristics are transferable between similar I/O strategies as well as between similar Lustre setups.

2 State of the art in parallel I/O

The I/O stack of a parallel application is shown in Fig. 1. On the top of the parallel I/O stack, an application solves a numerical problem on a multi-dimensional discrete grid. At each grid point, a number of quantities such as velocity, concentrations, temperature and order parameters may be stored. The data structure fits to the numerical model and to the data structure of the code, not to the pattern in the file and the hardware where the data is actually stored.

High-level I/O libraries such as HDF5 or PnetCDF may be employed to facilitate data exchange with other scientists. Below this layer, MPI I/O provides the portable middleware for parallel file I/O. MPI I/O is part of the MPI-2 [2] standard and was introduced in 1997. The I/O forwarding layer bridges the gap between the application and the file system and may aggregate the I/O. Parallel file systems show a single, unified high-performance storage space while abstracting away many storage devices and servers. While there are many different parallel file systems, almost only Lustre and GPFS/IBM Spectrum Scale are used in the top 100 supercomputers [1]. The focus of this paper is on Lustre as it allows the end-user to manipulate the striping of files, which GPFS does not, thus obviating the need to investigate. At the very bottom of the I/O stack is the actual hardware, classical spinning hard disk drives (HDD) or solid-state drives (SSD), where user data is actually stored [3, 4].

Fig. 1
figure 1

I/O stack of a parallel application. The higher levels are designed to provide the developer with an interface to implement I/O, while the lower levels are designed to maintain the access to the hardware [4]

Almost all of the above-mentioned layers have adjustable parameters in order to enhance I/O performance. In this paper, we are primarily concerned with the parameters available to the user. Similar concerns were investigated by [5,6,7]. In [5] experiments were conducted on the Texas Advanced Computing Center’s Stampede employing Lustre. The authors investigated the influence of stripe count, number of aggregators and stripe size for a fixed number of processes and file size. They found that if the stripe count and the number of aggregators are not chosen appropriately together, performance drops abruptly. This problem is avoided by choosing the number of aggregators greater than or equal to the stripe count. No significant influence of stripe size was found in [5] for the chosen process count and file size. Based on these findings, the authors implemented a parallel I/O library to improve I/O performance called “TACC’s Terrific Tool for Parallel I/O.” In [6], the authors investigated the performance of the HDF5 and NetCDF-4 parallel I/O libraries on a Lustre system. Among other things, it was found that the highest I/O performance was found at transfer sizes as big or bigger than the stripe size for both libraries. The authors of [7] conducted experiments on various systems with several I/O benchmarks. The experiments form a base for a performance model employed in an autotuning framework to optimize I/O performance. Parameters include the file size, stripe size, stripe count and number of aggregators for a fixed process count.

2.1 I/O access patterns

The I/O access pattern determines the possible performance of a parallel I/O application. Thakur et al. [8] defined various I/O access patterns and provided a classification of the different ways to implement I/O with MPI I/O. They classify the access patterns into four levels, level 0 to level 3, and explain why users should implement level 3 MPI I/O access patterns for performance reasons. In the following, we represent the classification of I/O access patterns and discuss the advantage and disadvantage of three typical I/O access patterns for parallel applications.

2.1.1 Classification of I/O patterns

Based on the work of [8], the four levels of parallel I/O are recapitulated. In level 0, the application uses Unix-style I/O. Each process does independent I/O: To write a local array to disk each process will perform an independent write for each row in the local array. Level 1 uses collective I/O functions: All processes write in a shared file by using a collective call without the knowledge of what the other processes do. To describe the non-contiguous file access pattern in the level 2 file access, each process creates an MPI-derived data type, defines a file view and performs independent write calls. Similar to level 2, MPI-derived data types in level 3 are used to describe the non-contiguous access pattern and a file view is defined, but a collective write call is used to perform the I/O.

2.1.2 Mapping data onto the file system

There are mainly three methods of mapping file I/O calls onto a file system in parallel programs. In the first method, each MPI process creates and writes to a separate file, the one-file-per-process I/O method or the N:N model as depicted in Fig. 2a. The implementation of this method is simple because no MPI communication is needed. The drawback of this method is that a large number of files are hard to manage [3]. Furthermore, the method does not scale to a large number of MPI processes as the number of metadata operations for the file creation is a bottleneck and the number of simultaneous disk accesses creates contention for file system resources. Recently, newer parallel filesystems, e.g., GekkoFS [9] or DAOS [10], are being developed in order to alleviate these problems, but they have not yet found widespread adoption. Adding onto this, reading the data back into the application is only simple for the same number of processes. Reading the data back into the application with a different number is complicated and error-prone [3]. The second method is the so-called spokesperson model or the 1:1-model. One MPI process receives all data from the other MPI processes and writes it to one file in the file system as shown in Fig. 2b. This approach leads to a poor performance for a large number of MPI processes because of the communication congestion of the all-to-one communication pattern. A second limiting factor is the memory space available for one process to handle all the data of the other MPI processes [3]. Even worse, one process cannot saturate the available bandwidth of a parallel file system. Because of the mentioned drawbacks, the spokesperson pattern will not scale, i.e., the time increases linearly with the amount of data and increases with the number of MPI processes. In order to alleviate some of the performance issues, it is possible to define groups of processes, which aggregate the data to one process (master) within a group. Each group master writes a separate file, which leads to a N:M model with M<N, as shown in Fig. 2c. This approach increases I/O performance relative to the 1:1 model, but incorporates problems of the N:N model such as increasing number of metadata operations and a more complex implementation. Parallel I/O to a shared file performed by all MPI processes as depicted in Fig. 2d can overcome the limitation of the one-file-per-process model and the spokesperson model. For the single shared file (SSF) model (N:1 model), one file is opened by all MPI processes and each MPI process performs I/O to a unique portion of the single file. The file could be physically distributed among disks, but appears to the program as a single logical file. With sufficient I/O hardware, a parallel file system and an efficient MPI I/O implementation, this model scales to a large number of MPI processes [3]. Figure 2e shows a modification of the SSF model, in which only a subgroup of MPI processes performs the I/O to the SSF.

Fig. 2
figure 2

Different I/O access patterns for I/O in MPI-parallel applications [3]

Besides mapping the data per checkpoint onto files, there is another degree of freedom involved in how to store time series. This can be done in one file per checkpoint approach, e.g., VTK, or with many checkpoints per file as employed in the proprietary format used by Pace3D [11].

2.2 Two-phase MPI I/O

In collective I/O, all MPI processes, or a subgroup of processes, perform the I/O operations in one collective MPI call and provide a comprehensive view of the data movement over all processes. With the SSF model, shown in Fig. 2d, e, collective I/O is possible. A very successful process collaboration strategy is the two-phase I/O. It was first proposed by [12] and is used in the MPI I/O implementation ROMIO [8]. In the first phase, called the request aggregation phase or communication phase, a subset of MPI processes is selected as I/O aggregators. The file is divided into non-overlapping sections, called file domains, and each file domain is assigned to an aggregator. The non-aggregators send their data to write or request to read to the aggregators. In the second phase, called the file access phase or I/O phase, the aggregators perform the write or the read operations on the file system [3, 8].

Figure 3 exemplarily shows how the two-phase I/O works and how it improves the performance of the I/O. A \(5\times 8\) two-dimensional array is partitioned by a block–block pattern among four MPI processes. As each local array is non-contiguous in the file, if every MPI process were to write their portion of the file independently many small and non-contiguous file operations would be necessary. A collective I/O operation circumvents this by collecting data from all MPI processes to a smaller number of aggregators, which write to contiguous blocks in the file and thus improving I/O performance.

Fig. 3
figure 3

A collective file operation with a two-phase I/O for a \(5\,\times \,8\) two-dimensional array, distributed among four MPI processes in a block–block pattern and the data layout in the file. The MPI processes 0 and 2 are selected as I/O aggregators [3]

3 Experimental environment

In this section, we will describe the hardware and the software environment on which all experiments were performed.

3.1 Cray XC40 Hazel Hen

All experiments are performed on the Cray XC40 supercomputer (Hazel Hen) at the High-Performance Computing Center Stuttgart (HLRS). The system consists of \(7\,712\) compute nodes, each with two sockets and \(128\hbox { GiB}\) of main memory. The nodes are connected via the high-performance Cray Aries interconnect. Each socket is equipped with a 12-core Intel® Xeon® E5-2680V3 (Haswell) processor with a base frequency of \(2.5\,\hbox {GHz}\), so each node has 24 cores. This leads to a homogeneous massively parallel system with \(185\,088\) compute cores and approximately \(987\,\hbox {TiB}\) of main memory. The Cray XC40 at HLRS has a peak performance of \(7.4\,\hbox {PFLOP}/\hbox {s}\) and \(5.64\,\hbox {PFLOP}/\hbox {s}\) in the LINPACK-benchmark. In the HPCG benchmark, the system at the HLRS reached \(0.138\,\hbox {PFLOP}/\hbox {s}\). The HPC system is connected to three Lustre file systems with a capacity of about \(13.5\,\hbox {PiB}\) in total. The technical details of the file system on which the experiments were performed are presented in the next section. The employed MPI I/O implementation is the Cray MPI-I/O, which is based on the MPICH ROMIO implementation [8].

3.2 The Lustre file system at HLRS

Fig. 4
figure 4

Configuration of the Lustre file system ws9 at the HRLS

The Cray XC40 supercomputer at the HLRS is connected to three Lustre file systems. In this section, we only present the technical details of the file system ws9, which we used for the experiments, and explain the basic options to adapt the Lustre file system settings to the I/O pattern of the application. ws9 is a Lustre-based file system, specifically a Cray Sonexion 2000 Data Storage system. The schema of the file system configuration and its connection to the Cray XC40 are shown in Fig. 4. Each compute node runs a Lustre client in order to access the Lustre file system.

The compute nodes are connected via the high-performance Cray Aries interconnect with a maximum bandwidth of \(8\,\hbox {GiB}/\hbox {s}\) for node-to-node communication. Hazel Hen has several service nodes, which connect to both the Aries high performance network (HPN) and to external networks. Fifty-four of these service nodes are used as Lustre routers (LNET routers). The LNET routers are connected to an InfiniBand switch as are the 54 Object Storage Servers (OSS). Each OSS has one Object Storage Targets (OST) with a capacity of about \(169\,\hbox {TiB}\). The connection between the LNET routers, the InfiniBand switch and the InfiniBand switch with the Object Storage Servers is based on an FDR InfiniBand network with a bandwidth of \(14\,\hbox {GiB}/\hbox {s}\). The number of OSTs sets a limit on the amount of parallel I/O throughput that can be achieved, which would ideally be the throughput which one OST can achieve times the number of OSTs.

The data of the files are stored in the OSTs. Each OST consists of 41 HDDs and is connected to a GridRaid which is a technique for distributing data across multiple HDDs. The OSS provides file I/O services and network request handling for OSTs [13]. Two OSS and two OST together build one Scalable Storage Unit (SSU). For failover inside of a SSU, an OSS is connected via Serial Attached SCSI (SAS) to both OSTs. If one of the OSS in a SSU fails, the other OSS takes control of both OSTs. One SSU has a throughput of \(7.5\,\hbox {GB}/\hbox {s}\) read and write when using the IOR (Interleaved or Random) benchmark, respectively, one OST has a bandwidth of \(3.75\,\hbox {GB}/\hbox {s}\) [14]. As there are 54 OSTs available on ws9, the maximum available I/O bandwidth is 202.5 GB/s. A GridRaid is used to reduce the rebuild time in case of drive failure and to improve the performance when running in degraded state [14].

To get parallelism, and thus I/O performance, the file is distributed among a number of OSTs, which is called file striping. Figure 5 shows the principal idea of file striping. The user can define, for directories or for a single file, the size of the pieces into which the file is divided (stripe size) and the number of OSTs in which the file is to be distributed (stripe count). The stripe size and the stripe count are the basic options for the user to adapt file system setting to the I/O pattern of their application and improve the I/O performance. The stripe count can be varied from 1 to the maximum number of OSTS (54), whereas the stripe size can be varied between 64 KB and 4 GB in increments of 64 KB [13]. The number of aggregator nodes during two-phase I/O is automatically set to \(\texttt {stripe count} \times \texttt {multiplier}\) by the Cray MPI-I/O library.

Fig. 5
figure 5

A file is automatically divided into pieces and distributed among a number of OSTs (file striping). Users can adapt the size of the pieces (stripe size) and the number of OSTs on which the file is being distributed (stripe count) [13]

The metadata of the files are stored in seven Metadata Targets (MDT), and seven Metadata Servers (MDS) provide access for the Lustre clients to the metadata.

4 Methods

In this section, the HPC application Pace3D and its I/O method are described. Following this, a small test application mimicking the I/O method of Pace3D is described. Finally, the experiments and measurements are detailed.

4.1 The Pace3D framework

The massive parallel Pace3D framework (“Parallel Algorithms for Crystal Evolution in 3D”) [11] is developed to study the microstructure evolution in different materials. The simulation framework is based on the phase-field method [15] and contains multi-physical coupling to fields such as temperature, concentration or stresses to include their effect on microstructure evolution. The solver contains modules for diffuse interface approaches (Allen–Cahn, Cahn–Hilliard) [15], grain growth [16,17,18], grain coarsening [19, 20], sintering [21, 22], solidification [23, 24], mass and heat transport, fluid flow (Lattice–Boltzmann, Navier–Stokes) [25], mechanical deformation (elasticity, plasticity) [26,27,28], magnetism [29], electrochemistry [30] and wetting [31, 32]. The equations are discretized in space with a finite difference scheme and in time with different time integration methods such as explicit Euler schemes and implicit Euler via conjugated gradient methods [22, 33]. To efficiently compute the complex evolution equations, the solver is parallelized using domain decomposition based on MPI. Selected models are also manually vectorized and achieve up to \(32.5\,\%\) single-core peak performance on the Hazel Hen as shown in [22]. Explicit kernels within the solver scale up to 98304 cores with 97% parallel efficiency [22]. In [22], early results of the present paper were used to massively improve I/O performance for large-scale simulations.

The framework utilizes several proprietary data formats in order to store the 3D voxel-based data. The principal structure is shown in Fig. 6: Each file has a header containing general data, and each checkpoint (henceforth called frame) consists of frame-specific metadata, the data frame itself and an additional another block of frame-specific metadata. The data frame contains the data for each voxel, which may be one of the following: A single-precision floating point value (4 bytes), a single-precision floating point vector (12 bytes) or a block of phase-field values (64 bytes). For all kinds of data, the write process of each file works as follows: The first part of the metadata is written by a single process. Afterwards, each process converts its double-precision data to single-precision data and writes it into an output buffer. During this conversion, further field- and frame-specific data are also determined. Whenever the output buffer is filled entirely during this process, the buffer is flushed to the disk via MPI_File_write_all. The output buffer is scaled by the local amount of data, ensuring that MPI_File_write_all is called the same number of times from each MPI rank. Once the field has been traversed entirely, the output buffer is flushed in parallel and the final part of the metadata is written by a single process. For the Pace3D measurements, the output buffer size is equal to the biggest local output size, and hence, there is only one call to MPI_File_write_all per frame.

The write time is measured via clock_gettime calls placed around the entire I/O function. Thus, the results for Pace3D include the double-to-float conversion and its accompanying memory operations as well the actual I/O operations.

Fig. 6
figure 6

The voxel format used by Pace3D. Each file consists of a header followed by frames representing the state of a field at a certain timestep. Each frame consists of metadata, followed by the actual field represented as N elements of either a 4-byte float, a 12-byte vector or a 64-byte block

4.2 mpiiotester

In order to decouple the I/O from any application effects, a small test program called mpiiotester is developed. The goal of this program is to replicate the I/O pattern of the HPC application Pace3D without suffering from start-up and calculation noise. Its source code can be found at:

https://git.scc.kit.edu/xt5201/mpiiotester/-/tree/master

It allows both reading and writing to files on a level-3 I/O pattern with different I/O styles as explained in Appendix. The resulting times can either be output per process or aggregated into a single average time. For the investigation of independent writes by single processes, a header may also be written and timed alongside the normal output.

The write time measurement is done via MPI_Wtime calls around the relevant sections, allowing per-process timing. For the following results, the relevant section is simply the call to MPI_File_write_all. Thus, the time spent in this call is measured, which is also the time an end-user would want to minimize, or conversely, maximize the write rate.

4.3 Measurement details

The previously described applications measure the time elapsed during their specific I/O actions. Since the data written to disk is known, the write rate (WR) can be calculated with \(\texttt {WR} = \frac{\texttt {data written}}{\texttt {time elapsed}}\). Note that the measurement does not include the file opening and closing since this is done only once for many writes. If the write operation is only conducted once per file as in the VTK format, the opening and closing have to be included in the measurements as well as shown in Appendix A. Besides the application-specific timers, Cray MPICH provides per-file write rates which are enabled via the environment variable MPICH_MPIIO_TIMERS=1. Further statistics per file are gathered with MPICH_MPIIO_STATS=1. Generally there is a significant difference between the write rates calculated via the timers and those reported by the Cray MPICH library. However, the differences are of a systematic kind and show a similar response behavior when a parameter is changed. Thus, the heuristics derived based on the results of one timer are similar to those of the other timers. In the rest of this paper, the application-specific timer is chosen since it can isolate the write time for repeated writes into the same file, which allows the gathering of more data per run.

The measurements are first collected by a series of scripts and then aggregated into a SQLite database. This database is accessed via a Jupyter notebook from which the analysis is done as well. ReducedFootnote 1 versions of the databases and Jupyter notebooks employing them are available at:


https://git.scc.kit.edu/xt5201/io-on-hazelhen/-/tree/master

The following results section will contain variation of several I/O parameters: The number of employed processes, the local domain size per process (block size), the stripe count, the stripe size and the aggregator node multiplier in the novel Lustre Lock-ahead (LLA) locking mechanism [34]. The ranges of the parameters can be found in Table 1, but note that the space spanned by these was subsampled based on prior results. The experiments without LLA were conducted at least 30 times per configuration at varying times of day and weekdays, with each run containing 20 write operations (frames). Between each frame a single time step of a simple, explicit kernel is calculated with this time not being measured since calculation time is not of interest in this paper. The rank-wise write rates for each frame are averaged by taking their median, which yields 20 values of write rates per run. As explained in Appendix, the first timing of these is dropped with the rest providing 19 data points per run; hence, each point is the median over at least 570 values. For the experiments with LLA and thus large core counts, only 1-3 runs were conducted per configuration due to time constraints, which is reflected in their much larger confidence intervals.

The investigated local domain sizes range from \(20^3\) voxels to \(200^3\) voxels per process, which represent the span of domain sizes typically employed in Pace3D. With one double-precision float per voxel, the data written per process range between 64 KB and 64000 KB, which will henceforth be called block size. The lower end of block sizes might very well stay cached during the experiments. But this in turn means that the user does not need to care about I/O configuration as the data will be flushed in the background while computation resumes.

Table 1 Investigated parameters and their respective limits

5 Results

In the following, the experiments conducted with both the mpiiotester and Pace3D are detailed. First, the simpler mpiiotester is used to investigate the general write performance. It is shown that the default striping is insufficient to fully exploit the parallel filesystem, especially for simulations employing more than one compute node. The striping configuration in terms of stripe count and stripe size is varied for different experimental configurations in terms of the block size and processor count. Based on these results, heuristics for both the stripe count and stripe size are derived. With both the stripe count and stripe size heuristics determined, the influence of the multiplier with active LLA is investigated. Finally, the derived heuristics are afterwards validated by employing Pace3D, and satisfactorily matching is observed.

5.1 mpiiotester performance

This section will show the performance results on ws9 on Hazel Hen with plots showing the median performance and its 95% confidence interval (CI), unless noted otherwise. Confidence intervals are employed for visual checks on whether the parameter variation caused significant differences or not. We note that the measurements were done at a time when ws9 was not commonly accessible, i.e., the measurements were largely done in isolation in terms of file I/O but were still affected by network noise. A thorough analysis of the raw data is given in Appendix, determining vital information on the distribution to enable quantitative analysis.

5.1.1 Stripe count

The single -node write performance of the default striping on ws9 is investigated by running the application described in section mpiiotester with the single-file output style and without a header. The default striping consists of a stripe count of 8 OSTs as well as a stripe size of \(1\,\hbox {MiB}\). In Fig. 7, the influence of the data written per MPI process, henceforth called block size, is shown. A general trend of increasing performance with increasing block size can be seen, with many configurations showing only small but still significant improvements at block sizes exceeding \(4096\,\hbox {KB}\). Furthermore, at least 16 cores seem to be required to saturate the write rate for a given block size.

Fig. 7
figure 7

Single-node performance of mpiiotester, \(\log\)-scaled x-axis. An increase in block size yields higher write performance up to a block size of \(4096\,\hbox {KB}\), beyond which the effect is small but significant. At least 16 cores out of 24 seem to be necessary in order to saturate the write rate for a given block size

Fig. 8
figure 8

Multi-node performance of mpiiotester, \(\log\)-scaled x-axis. The single-node performance is strictly below that of any multi-node performance, and thus, the bandwidth was not fully utilized in the single-node case. However, the write rate does not scale beyond 48 cores

Going on to look at multi-node performance, the write rate increases as shown in Fig. 8 with more than one node. Hence, the bandwidth was not fully utilized in the single-node case. However, the write rate does not scale beyond 48 cores. In order to achieve I/O scaling for multiple nodes, it is necessary to adjust the striping of the file. The striping consists of both the stripe count and the stripe size, of which the stripe count will be investigated first. This is done by varying the stripe count on a single node in order to determine a heuristic for the number of OSTs per node for good performance. Figure 9 shows the results of this study for 24 cores, indicating a performance peak at 8 OSTs.

Fig. 9
figure 9

Single-node performance of mpiiotester with variable striping count employing a full node corresponding to 24 cores. A performance peak is evident for block sizes above 64 KB. All of the results of the block sizes above 512 KB are very close together. This suggests that the peak is independent of block size for block sizes above of 512 KB

This may suggest that for any number of nodes, using 8 OSTs per node would show the highest observed performance, which is tested in a multi-node study utilizing multiples of 8 as the stripe count. The result of this study is shown for a block size of \(4096\,\hbox {KB}\) in Fig. 10. Two things are evident: The performance peak indeed moves to higher OST counts with more cores, but the peak itself also widens. Up to 4 nodes (96 cores), the 8 OST per node model describes the performance peak reasonably well, but for 8 nodes, the peak is reached before the predicted value at 64 OSTs. As ws9 only has 54 OSTs, the predicted position cannot even be reached; however, for small node counts, the estimate describes the peak adequately. Determining the OST count showing the highest performance for each line of processor count and fitting a linear function for, it yielded the function \(\#OST = 5.6N + 4.9\), with N representing the node count. This is valid for node counts up to 8, after which the maximum of OSTs (54) should be used. A previous study [35] on the predecessor of ws9 with 168 OSTs showed that the peak is best described with a linear function \(\#OST = 5N + 3\) with N being the number of nodes. This function also fits the current data from ws9 but tends to underpredict the number of OSTs. By adjusting the striping the multi-node, write rate was improved from 2.5 GB/s to up to 9.5 GB/s and thus almost a fourfold improvement over the default striping.

Fig. 10
figure 10

Multi-node performance of mpiiotester with variable striping count and a block size of \(4096\,\hbox {KB}\). As expected, the write rate rises with increasing OST count up to a maximum but drops off afterwards. With increasing core counts, the peak moves to higher OST counts and widens

5.1.2 Stripe size

In order to analyze the effect of the stripe size, a study with varying block sizes, stripe sizes and striping counts is done for a fixed core count of 24 (single node). This is done in order to maximize the throughput at the OSTs, as the network link is assumed not to be a limiting factor. Figure 11a shows the results for a fixed striping count of 8 and Fig. 11b for a block size of \(4096\,\hbox {KB}\). The plots introduce a new parameter \(\texttt {bsToss} = \frac{\texttt {block size}}{\texttt {stripe size}}\), the ratio of block size to stripe size, as we are interested in stripe sizes showing good performance for a specific block size. For block sizes above \(4096\,\hbox {KB}\), the write performance levels off at a ratio of 4, whereas for smaller block sizes the performance has a peak at a ratio of 1. These results are independent of the striping count being used in this case, as shown in Fig. 11b. This result can in fact be explained by considering the data gained via the environment variable MPICH_MPIIO_STATS=1. It yields, among other information, the number of system writes as well as how many of these writes were stripe-sized. Plotting the percentage of stripe-sized writes for the data from Fig. 11a yields Fig. 12. The highest performance is achieved at or close to 100% stripe-sized writes, which can be achieved by setting the stripe size to a small divisor of the block size. Thus, we suggest the following heuristic for the stripe size (SS) based on the block size (BS): \(SS = kBS, k \in \{1/1,1/2,1/4\}\), rounded to the nearest multiple of 64 KiB. The following studies will be using a ratio of 1/4, i.e., the data of one process is distributed among 4 stripes. For a block size of 64 KB, a stripe size of 64 KiB is employed since that is the minimum available stripe size.

Fig. 11
figure 11

Effect of the stripe size on the write performance for a single node, \(\log\)-scaled x-axis. Generally, having a block size as big or bigger than the stripe size yields good performance

Fig. 12
figure 12

Variation of the percentage of stripe-sized writes for 8 OSTs on a single node with different block- and stripe sizes. The best performance is generally reached at or close to 100% stripe-sized writes, which can be achieved by setting the stripe size to a small divisor of the block size

5.1.3 Multiplier

As we have already seen, the performance for 8 nodes levels off around the maximum number of OSTs available, most likely due to lock contention between aggregators. In order to get higher throughput from the OSTs, CRAY provides a novel locking mechanism called Lustre Lock-ahead (LLA) on ws9, which allows multiple aggregating I/O processes to write concurrently to the same OST [34].

This locking mechanism is activated via MPI I/O hintscray_cb_write_lock_mode=2:cray_cb_nodes_multiplier=x where x is the multiplier, i.e., the number of aggregators writing to the same OST. We shall first show that on ws9, increasing OSTs up to the maximum should be done before activating LLA by varying both OST count and the multiplier. The results are shown in Fig. 13 for a block size of \(4096\,\hbox {KB}\) and 3072 cores: For 27 OSTs to reach the same performance as 54 OSTs without any multiplier, a multiplier of 16 is required, whereas 54 OSTs with such a multiplier outperform the 27 OST case. Hence, the maximum performance is reached when using all available OSTs, and thus, in the following we only need to investigate the effect of different multipliers for the maximum number of OSTs.

Fig. 13
figure 13

Effect of multiplier on different stripe counts for a block size of \(4096\,\hbox {KB}\) at 3072 cores. The stripe counts below the maximum show no performance benefit compared to using the maximum number of OSTs

Figure 14 shows the results of varying the block size, the core count as well as the multiplier. Note that a multiplier of 1 refers to keeping LLA turned off, as running LLA with a multiplier of 1 shows strictly worse write rates than without LLA. The multiplier was increased until either more aggregators than cores would have been used or a multiplier of 64 was reached. The results for block sizes \(64\,\hbox {KB}\) and \(512\,\hbox {KB}\) should be taken with a grain of salt, as the writing time per frame for these was mostly below \(0.1\,\hbox {s}\). For block sizes \(64\,\hbox {KB}\) and \(512\,\hbox {KB}\) and high core counts (\(>6144\)), smaller multipliers were not investigated as these were expected to take excessive amounts of CPU time. In case of a \(64\,\hbox {KB}\) block size, using a multiplier of 4 when using more than 384 cores seems to give the peak write performance. Increasing the block size to \(512\,\hbox {KB}\), the picture becomes much less clear, with the peak write performance being reached at a multiplier of 4, 8, 16, 32 and 64 for 768, 1536, 3072, 6144 and \(\{12288,24576\}\) cores, respectively. The highest investigated core count 49152 reaches the highest performance at its lowest investigated multiplier of 16. A further increase in block size to \(4096\,\hbox {KB}\) yields a less muddled picture, with core counts above 6144 clearly exhibiting a significant peak at a multiplier of 32. Finally, a block size of \(32768\,\hbox {KB}\) shows this effect even more pronounced starting from 3072 cores. In total, large block sizes \(\ge 4096\,\hbox {KB}\) show pronounced performance increases (factor of 2-8) when employing a multiplier of 32 to 64. Below this block size, a performance increase of up to 2 is possible, but the optimal multiplier non-trivially depends on the employed core count.

Fig. 14
figure 14

Effect of the multiplier for 54 OSTs for different block sizes, \(\log _2\)-scaled x-axes. While for block sizes \(> 4096\,\hbox {KB}\) there seems to be a clear pattern for the highest observed performance, the lower block sizes show an erratic behavior

5.2 PACE3D performance

We have performed similar weak scaling runs as in the previous section for the materials microstructure simulation framework Pace3D. There are two key differences to the mpiiotester configuration: A 3D domain decomposition (3DDD) is employed and the time required for filling internal I/O buffers is measured with the internal timers. Thus, these results cannot be directly compared to mpiiotester results; in advance, we can easily predict that the writing rate will be lower. In total, the results showed a similar response behavior to the investigated parameters, and the writing rates were indeed smaller than for mpiiotester. Heuristics derived from the Pace3D I/O data for stripe count and stripe size yield very similar results as previously established for mpiiotester. Consider Fig. 15 as an executive summary, which shows the performance of both Pace3D and mpiiotester on a large number of nodes when the multiplier is varied. Both figures show a dependence of the multiplier effect on the core count with higher core counts being able to reach higher writing rates, given a sufficient multiplier. For the two highest process counts, a higher multiplier than in the previous study was also tested since no clear peak was visible. While this configuration showed a lower write rate, the confidence intervals overlap, suggesting that there is no significant difference between employing a multiplier of 32, 64 or 128 for these. Thus, a multiplier of 32 to 64 is likely to yield good performance for the more usual 3DDD, given a sufficiently big local domain, as well.

Fig. 15
figure 15

Comparison of the writing rates achieved by Pace3D and mpiiotester on a large number of nodes for a block size of \(32768\,\hbox {KB}\). The difference is mainly attributable to the 3DDD as well as the I/O buffer filling being measured as well

6 Conclusion and outlook

By performing a large range of write performance measurements with a specialized application, heuristics for good write performance for parallel I/O on the ws9 filesystem of HLRS were determined. The striping configuration yielding the highest write rate while writing a spatially distributed array was found to be dependent on both the block size and the number of employed compute nodes. The stripe count or equivalently OST count should be set to \(\#OST = 5.6N + 4.9\) with N being the number of employed compute nodes. The stripe size (SS) should be set to \(SS = kBS, k \in \{1/1,1/2,1/4\}\) with BS representing the number of bytes to be written per processor, rounded to the nearest multiple of 64 KiB. These striping heuristics enabled up to fourfold improvement over the default striping configuration. Further performance was gained by employing Cray’s LLA locking method, which introduces a new parameter called multiplier. Various choices of this multiplier were investigated. No general heuristic considering both block size and the number of employed compute nodes could be derived. However, for block sizes equal to or above 4096 KB and above 3072 cores (128 nodes), a multiplier of 32 was found to show the best performance, yielding another factor of 2–8 from the optimized striping configuration. Thus, a total factor of 10–32 of sustained throughput increase was gained for the single shared file model, reaching in these experiments up to 85 GB/s which corresponds to \(42\%\) of the total I/O bandwidth of 202.5 GB/s. Finally, the striping heuristics were shown to be transferable from a similar Lustre setup [35] as well as to a general application code.

The transferability of heuristics between similar Lustre setups has been shown with this paper and previous work [35]. What needs to be done next is to test the transferability further for different Lustre setups and determine how to proceed from a Lustre setup to good I/O performance, obviating the need for experimental runs. Furthermore, the effect of burst buffers needs to be considered, as prior research [36, 37] has shown these to provide great performance benefits. With the coming of the next generation file system at HLRS, both points will be investigated.