1 Introduction

This research examines the use of artificial neural networks (ANNs) in optimisation of parallel high performance computing (HPC) input/output (IO). HPC or super-computing machines are normally clusters of many-core CPU nodes and storage disks. These are interconnected via fast LAN cables (e.g., InfiniBand, Intel OmniPath, etc.) as the medium of communication at hardware infrastructure level [1, 2]. Figure 1 shows the typical overview of a HPC cluster hardware structure. The diagram depicts the multiple compute nodes connected with each other and multiple storage, over the fast network, which is generally a structure of any HPC cluster. Parallelization has been highly effective in exploiting the power of a such cluster via message passing interface (MPI) standard software library written on C/C++ platform [3]. MPI is a distributed memory parallel programming framework that comes with a broad set of functionalities to utilize a cluster’s parallelism potential. When an application using MPI is built, the MPI–IO part within the program is often leftover to consider for optimization thus, causes the poor IO throughput, resulting in the overall performance issues. Since, the IO side needs to be manually tuned by different configurations for performance optimization, this creates an extra overhead at the user end to perform this task. It becomes even more challenging for the user to select particular settings since not knowing the IO performance outcome. To overcome this challenge, in this research, we have proposed the technique to predict the IO bandwidth performance and auto-tune the related configuration parameters before an IO operation. This extra software layer between user and MPI–IO operation takes away the burden of manually tuning the configuration settings. Additionally, the powerful ANN based ML prediction model supports in selecting new value settings with highly expected improvement in parallel IO performance.

Fig. 1
figure 1

Basic view of a cluster hardware structure

Despite MPI being an efficient parallel computing framework, the parallel MPI–IO side of the HPC cluster suffers from application performance degradation. The first reason is the slow pace of development in the advancement of IO storage processing hardware in contrast to computing hardware. The second reason is, in HPC systems parallel IO depends on particular parameters to determine its bandwidth from the software side. These critical parameters are related to different components which support parallel IO on the clusters. Parallel IO and storage are normally managed by a specific parallel file system (PFS) in the clusters at a low level. In this research case our PFS is lustre file system (LFS) [4]. The common factors which effect IO bandwidth performance are; the number of MPI processes, the number of parallel discs, the file access patterns and other properties. These factors and IO optimizations are not generally taken into account by the parallel application programmers. This results in poor IO and consequently overall application performance degradation. The parallel IO side is often neglected and left solely to manual tuning by researchers to devise techniques for the improvement of parallel IO at the software side, as being done in the past [5, 6, 8]. These approaches suggest the different strategies around data-alignment based configuration settings that improve IO but do not cover all the scenarios as explained in [9]. In this research, we require an approach to optimize IO for the maximum possible scenarios within MPI–IO applications. To achieve this, we propose to auto-tune parameter settings from current values, estimate maximum possible bandwidth values, before any IO execution.

The need of HPC and machine learning (ML), is crucial to areas such as Internet of Things (IoTs) based smart city environment, that critically relies on the efficient data IO, and has shown recent advancement [101217, 18]. In this research, the ML technique is applied to HPC MPI–IO operations, for the bandwidth performance optimization. It is the extension of the previous work completed in relation to MPI–IO operations bandwidth prediction only, on the related parameters settings [9]. In extending this previous research, our innovation lies in auto-tuning those parameters on the basis of MPI–IO bandwidth prediction to the optimized values. Previous research shows significant benefits of ML based IO performance prediction and auto-tuning over different parameters settings at different environment experimental setups [19, 20, 21, 23]. This has provided the motivation to apply similar approach to improve the MPI–IO application bandwidth performance, in this research scenario. Therefore, the prediction models are generated through the ML process, and used in this research to support the auto-tuning approach. The work presented in [24, 25, 26, 28], explains the basic working of ANNs as ML technique, application to improve file access times in HPC, and shows the significant prediction accuracy of using ANNs, specially when models constructed in PyTorch framework. This provided us further motivation to use ANNs based ML technique with PyTorch API. In result of this, the generated ANN models in this research, have proven significantly beneficial for optimizing the MPI–IO performance through auto-tuning parameter settings against the default configuration test cases.

Initially, we re-generated and re-executed the MPI–IO READ/WRITE benchmarks for bandwidth profiling over the related parameters, as done in [9]. This was to ensure that we replicated similar IO bandwidth patterns from varying the configuration settings. Then we re-generated separate ANN models for READ/WRITE operations, having the precise similar prediction accuracy to the previous work. Subsequently, the further work presented in this research is innovative since we have designed our auto-tuning strategy based on those ANNs, which is a critical first contribution of this research. This research presents an ANN ML based approach to select and auto-tune the parameters settings to get optimized bandwidth. The results have shown notable improvements in bandwidth through a detailed statistical analysis which represents the second key contribution contained in this research. Finally, since having significant IO bandwidth performance outcomes, the most beneficial configurations have been noted and highlighted in detail as the third key contribution.

The results presented later in this paper demonstrate that the proposed auto-tuning strategy using ANN models contributes to a significant gain in IO performance. The remaining structure of this paper follows with the Related Literature in Sect. 2, Research and Design Strategy in Sect. 3, Performance Evaluation in Sect. 4, Summary of the Work in Sect. 5 and Conclusion in Sect. 6.

2 Background and related research

Earlier research studies have been explored, involving the use of ML based predictive IO modelling, and how best to address poor IO performance issues. This is primarily achieved by tuning related parameters on certain cluster environments. These research studies have been a motivation for us towards auto-tuning of our related parameters predicting maximum possible IO bandwidth performance for files striped across parallel networked lustre disks in HPC clusters.

In [9], it was shown that the parameters relating to MPI, LFS and file properties can cause increases or decreases in IO bandwidth. Therefore, ANN models were created to predict IO bandwidth performance for the MPI–IO READ and WRITE operations scenarios. The models’ prediction accuracy values were 62.5% and 72.1% for READ and WRITE operations, respectively. The parallel HiPlots have also been used to visualize bandwidth changes with different parameters settings values [29].

In [19], the study demonstrates that parameters like the IO scheduler, number of IO threads and CPU frequency affects HPC–IO performance. The IO behaviour is predicted and determined upon these factors as parameters through extrapolation and interpolation techniques. This was completed by large-scale experiments using a data analytics framework. Their performance evaluations were conducted by prediction accuracy calculations with the unseen testing system configurations. Afterwards, the system used a Bayesian Treed Gaussian process variability map with different regression techniques. This supported parameter selection through HPC variability management and insights of current statistical methods.

The method demonstrated in [20], allows for the adaptive scheduling of parallel IO requests within a HPC system running application. This was achieved by tuning the time window based parameters of the current executing workload. This adaptive technique is implemented through reinforcement learning by the scheduler. It achieved 88% precision in parameter selection on runtime after access pattern observation and classification using neural networks by the system for few minutes. Subsequently, its IO performance would be optimized by the system for its remaining lifetime, as declared in the study.

The work presented in [22], examines bandwidth predictive modelling for MPI WRITE collective operations via random forest regression. The prediction accuracy values are extremely high, ranging around 82–99%, which depends on depth setting maximum value. The data sets for training and testing were relatively small, which would be require more examination. Having greater variation in the data, the accuracy values could be significantly lowered.

In [21], the IO performance optimization of parallel applications was proposed for HDF5 format files. It was tested on a range of varied HPC clusters employing LFS and general PFS (GPFS). The auto-tuning played a key role on the basis of predictive IO modelling. The predictive models were trained with lustre IOR and other benchmarks data using a nonlinear regression technique. The IO performance notably increased with new parameters selection via auto-tuning. Comparatively, Reference [23] also demonstrated predictive IO modelling for LFS IOR benchmarks but using a Gaussian process regression (GPR).

The work carried out in [26], performs prediction for time of file accesses on LFS storage disks. The file access time for a series of tests are recorded to be used for developing prediction models. The evaluation shows the generated ANNs yield 30% less error in average prediction as compared to linear predictive modelling. The file access times distribution and its evaluation were conducted with regard to similar parameters used to access files. It also discovered that the typical file access times usually differed by a degree of magnitude, depending on the different IO paths.

Additionally, some other research studies were inspected in the area of MPI application optimization using ML prediction and auto-tuning parameters, however, the IO side was mostly ignored [30, 31].

3 Experimental design and implementation

This research has been conducted by applying a series of steps in sequence. (1) Identification of the key parameters, (2) the generation of benchmark data, (3) generating two ANN based prediction models for each READ and WRITE operation, (4) designing and applying the auto-tuning strategy, (5) analysing IO bandwidth improvements statistics and (6) identifying the common configuration settings selected by auto-tuning the system model. Figure 2 explains the main components that summarise the sequence of steps involved in our proposed approach with the experimental setup, further elaborated and explained in Listings 1 to 5.

Fig. 2
figure 2

Research methodology of the experimental setup to train the ANN ML models and auto-tune the configuration parameters settings based on the IO bandwidth predictions

3.1 Identification of key and tunable parameters

Initially, we must identify the key configuration parameters with fixed possible value settings. This is an essential step prior to the execution of READ/WRITE bytes of data from the file on disk in the series of benchmarks. Each benchmark execution should be completed in order to profile bandwidth, as an output against each READ/WRITE operation with a specific set of configuration values settings. This bandwidth profiling data is then split and used as training and testing sets separately. The data is split in a 80:20 ratio of the total benchmarks results for these training and testing sets. This approach then allows us to learn and validate our ANN models. It should be noted that configuration parameters are either tunable or non-tunable, when considering them for auto-tuning in later.

Table 1 contains the identified key tunable and non-tunable parameters against their corresponding values settings specified for benchmarks execution. The total number of value settings available to benchmark is shown to total 30,720, which gives an indication of the scale of configurations for the ML process. However, the IO operation specified in this table is not essentially a configuration parameter. This is to just specify that these configurations are executed for both READ and WRITE operations separately. They are marked as “Not Applicable” from the tunability point of view. Arising from this, we have a total of 15,360 possible configurations (30, 720/2) for each operation type (READ and WRITE). Thus, we have an intuitive process for the generation of datasets relating to both the READ and WRITE bandwidth patterns. This can then be modelled by two separate ANN models for each operation type.

To summarise, a single configuration setting row in a list of data set can be represented as follows:

$$\begin{aligned}&['number\_of\_MPI\_nodes=16',\\&'MPI\_processes\_on\_each\_node=1',\\&'lustre\_stripe\_count=4',\\&'lustre\_stripe\_size=1MiB',\\&'file\_size=32GiB','chunk\_size=0.5GiB',\\&'file\_access\_pattern=collective'], \end{aligned}$$

where it runs for a READ and WRITE operation separately. Similarly, all configuration settings in that list are fetched and benchmarked iteratively and separately for both IO operations (READ/WRITE).

Table 1 Tunable/non-tunable parameters with possible values settings

3.2 Generating MPI–IO benchmarks results data

In this section we describe the procedure for generating MPI–IO benchmarks in order to then use this as a training and testing set for our ANN models.

The Listing 1 shows the method Generate_Benchmark_Results() used to generate IO bandwidth profiling data by means of benchmarking against each configuration setting. It takes Configuration_Settings as an argument, which has the list of value settings for each parameter. The first step (Line 2) is to generate a complete list of all possible configuration settings to be benchmarked. Each row of (MPI_{i}O_Configs_List) have one set of configuration settings, as stated by the example mentioned in Sect. 3.1 previously.

During the execution of MPI–IO benchmarks, the MPI processes read/write the corresponding file chunks. Prior to execution, the file is striped and generated over stripe count of LFS employed disks with the specified value of stripe size in the case of READ operations. This is stated from Lines 11 to 16. For WRITE operations these lustre settings are applied to an empty file beforehand.

When the file is written on disks it is distributed on the same stripe count of disks with the stripe size value. A single MPI read/write execution can be collective or non-collective. In the collective operation each process accesses other running parallel processes address space, therefore making a non-contiguous access. Whereas, in non-collective operation each MPI process can have access to its own address space only and keeps the access contiguous.

Before every benchmark execution the pre-benchmarks settings are applied in regard to LFS striping as mentioned earlier, and the Darshan utility. Darshan is the tool for characterization of HPC–IO for capturing bandwidth during execution [32], as the setting applied on Line 18.

At this point, we apply pre-benchmark settings as a means of measuring the bandwidth performance of a specific configuration in regard to specific file striping. Once the execution of a configuration set completes on Line 21, the IO bandwidth performance gets captured by Darshan and parsed to retrieve on Line 24. The benchmarked bandwidth value for each IO execution with its configuration parameters values settings are added at the end of file in the YAML dictionary style format on Line 25.

Afterwards, the post-benchmark setting is applied to that benchmarked file by deleting it to free the cache, on the last Line 28. This completes a single benchmark iteration over a configuration set of parameter settings for each READ and WRITE operation.

figure a

Figure 3 shows the benchmark results which depict the bandwidth patterns for all configuration settings available. In general, the bandwidth values increase with respect to the increasing number of MPI nodes and processes per node.

Fig. 3
figure 3

MPI–IO benchmarks results

3.3 Creation and development of ANN models

In this section, we present the training and development of ANN models that are used for bandwidth predictions. These ANNs have two hidden layers with 256 and 128 nodes respectively, as outlined in Table 2. The input layer and output layer has 7 and 1 nodes against the 7 input feature parameters from Table 1 and 1 output feature as the predicted IO bandwidth value in all models. Table 3 shows the mappings of 7 input layer nodes to the 7 configuration parameters. The 1 output layer node maps to the IO bandwidth value. The IO operation parameter from Table 1 is excluded because the models are created separately for READ and WRITE operations. Therefore, we have two distinct ANNs models, \(ANN_{READ}\) and \(ANN_{WRITE}\). Table 4 represents the values for hyper-parameters being used during the ANNs training, to support the accuracy during validation.

The ANN models are developed and trained using PyTorch [27] through the pseudo-code ANN_Training() in Listing 2. The hyper-parameters values are applied according to Table 4. Prior to the ANNs training, the MPI–IO benchmarks results data is scaled using MaxAbsScaler [33]. Afterwards, data is shuffled and split between training and testing sets. For each ANN model 80% of benchmarks results are used for training purposes and 20% for testing.

Initially, Lines 5 to 14 initialized the ANN model with random weights, where 0.05% dropout is applied on the hidden layers in the case of training \(ANN_{WRITE}\), and 0.00% for \(ANN_{READ}\). Additionally, the Rectified Linear Unit activation function [nn.ReLU()] is also applied from layer to layer in order to get the expected decimal value at the output [34]. The loss function is defined as the MSELoss() on Line 16 to compute the loss between actual and predicted value during training, which is a standard mean squared error, MSE [35]. The other hyper-parameters are; learning rate 0.002, and weight decay \(1e{-5}\) are applied on Line 18. This relates to using (Adam) as our gradient descent optimizer. Then X and y are set as the input and output features from the training set on Lines 20 and 21.

Subsequently, the Lines 23 to 34 control the main loop which trains the ANN model until the specific number of iterations defined as MAX_Limit is reached. During loop iterations, the bandwidth value is predicted on Line 25. Then, Line 27 computes the loss by MSE value. The Adam optimizer zeros the gradients on Line 29 before the back propagation can occur. Once complete, loss is propagated backward on Line 31 and the optimizer updates the weights which are eventually updated by the model calling its train function on Line 34.

Once the model is trained it has been saved in “.pt” file on Line 37. It is then loaded back to memory at the time of prediction to support the auto-tuning process later.

The ANN models run on the test set, and the predicted values are re-scaled to their original values to compute accuracy with the profiled benchmark bandwidth values. The prediction accuracy denoted by P.A for both ANN models, is defined by the following equations:

$$\begin{aligned}&Let\;X = testing\;set, \\&let\;y = actual\;bandwidth\;values\;against\;X, \\&r = model(X)\;gives\;predicted\;values\;against\;X, \\&P.A = 100.0 - \frac{1}{n} \sum _{i=0}^{n} \left|\frac{(y_{i} - r_{i})}{y_{i}} \right|\times 100.0\%, \end{aligned}$$

where y and r are actual and predicted n number of total bandwidth values, respectively, and i is the i th row of X, y and r thus, \(y_{i}\) is the i th actual bandwidth value and \(r_{i}\) is the i th predicted bandwidth value computed by running model() on \(X_{i}\) the i th configuration parameters values set.

The prediction accuracy of \(ANN_{READ}\) yields \(\approx 63\)%, and for \(ANN_{WRITE}\) yields \(\approx 79\)%. Figure 4 shows the model predictions on a randomised 50 configuration settings for each READ and WRITE operations. It is evident from these figures that both ANN models are well fitted for future predictions on unseen configuration data. Later on, these models play a vital role in auto-tuning the parameters required for overall bandwidth performance gain.

Fig. 4
figure 4

MPI–IO bandwidth predictions

Table 2 ANNs description table to model READ/WRITE bandwidth behaviour
Table 3 ANNs input and output nodes mapping to configuration and output parameters for READ and WRITE bandwidth predictions
Table 4 Hyper parameters applied to ANNs
figure b

3.4 Complete auto-tuning design applied to test cases

In this section, we will outline the complete auto-tuning process and procedure to collect our statistical evaluation data relating to performance. The ANNs which we have previously described, can be used for auto-tuning configuration parameters.

It should be noted that there are a number of tunable parameters that relate to IO operations (READ or WRITE), and many remaining which are non-tunable parameters, as mentioned in Table 1 3rd column.

The tunable parameters are marked “Yes”, and non-tunable parameters are marked “No” with respect to READ/-WRITE operation. The common non-tunable parameters in both IO operations are Number of MPI nodes, MPI processes on each node and File size. The reason is MPI nodes and processes cannot be altered once the application is initialized and the file size also cannot be changed being the user requirement. In case of READ operation the tunable parameters are Chunk size and File access pattern values. The additional two non-tunable parameters are Lustre stripe count and Lustre stripe size. As the file will be already distributed over a specific number of lustre disks to be READ therefore, re-striping on runtime will not affect the file distribution pattern on disks. In a situation where we write the file, which is then subsequently read, this is not optimal behaviour as it will add unnecessary IO overhead. Whereas, for WRITE operations the tunable parameters are Lustre stripe count, Lustre stripe size, Chunk size and File access pattern. All these configurations can affect WRITE bandwidth at the runtime.

Considering the configuration parameters tunability, the summarized steps to complete the execution flow are: (1) retrieve the existing applied set of parameters settings, (2) predict the IO bandwidths on current and all other possible settings of tunable parameters specified in Listing 3 with the given non-tunable parameter settings, (3) compare all the predicted bandwidth values with each other. (4) Select the maximum predicted IO bandwidth settings and (5) apply the selected settings to configure the tunable parameters. These steps are further elaborated in Listing 3 and 4.

3.4.1 Parameters selection

figure c

In this section, we present a procedure on how to select new parameters settings that can predict maximum bandwidth.

The function Select_Parameters() in Listing 3 returns the new tunable configuration settings. It takes the arguments for the given current settings, IO operation (READ or WRITE) and the file path to saved ANN model (“*.pt”). The Line 1 imports the torch package to load the saved and trained ANN model in PyTorch on Line 5. The Line 9 to 11 loads the current settings in variable X, then sets the max_bandwidth to predicted value using model() on current settings X, and max_settings hold the current settings. Lines 14 to 17 represent the possible value settings for tunable parameters for both IO operations, also specified in Table 1.

The mechanism to select new configuration settings is to check every possible combination of tunable parameters values in nested loops on Lines 20 to 23. The stripe_counts and stripe_sizes are checked when the current IO operation is WRITE, from Lines 26 to 28. By predicting on Line 29 and comparing the new bandwidth value with current max_bandwidth value gives the maximum possible bandwidth and its new configuration settings in max_settings as on Lines 30 to 32. If the current IO operation is READ then the last two nested loops for checking stripe_count and stripe_size values are not required. In that case, the break statements are applied on Lines 33 and 34, respectively, to terminate the loops for READ operation.

Finally, the new configurations predicting maximum possible IO bandwidth will be returned on Line 35. These new settings return to code shown in Listing 4. According to steps in Listing 3 it is possible that parameters may not be tuned to new values. This can be due to the existing configurations predicting the maximum possible IO bandwidth, by checking other value settings in range.

3.4.2 Auto-tuning IO operations

figure d

In this section we present the auto-tuning procedure on the MPI side, based on the new parameter values selection. After the selection of these new parameter value settings, they are used to tune the configurations before executing parallel IO.

The Listing 4 elaborates the overview of how the auto-tuning task is carried out at MPI application level in the function defined as Auto_tune_and_run_{i}O(). It takes the current settings, intended IO operation, target file path or name:f_name and ANN model file path as arguments on Line 3. Line 5 initializes the MPI space and environment using MPI_{i}nit(). Subsequently, Line 9 returns the current process ID or rank. The current settings, IO operation and model file path are passed to the method Select_Parameters() on Line 13, and defined in Listing 3. If the current process ID or rank is 0 only then Select_Parameters() function executes, as checked on Line 12. This returns the new parameters values in max_settings predicting maximum possible bandwidth with the given non-tunable parameters.

Since the max_settings are not returned, all other processes are waiting on Line 16. This is enforced by MPI_Barrier() method which is explained in [36]. Once the new values are returned to the rank 0 process, it will send the max_settings object message to other MPI processes or ranks. Those ranks receive the message in their respective local max_settings object. This is done on Line 17 using the MPI broadcasting method MPI_Bcast(), also explained in [36].

At this point, all processes will have received their max_settings. Therefore, each process updates its configurations with a new set of value settings (Lines 19 to 20). The access_pattern and chunk_size are required by all processes to read or write the file with same access pattern and number of bytes. While the stripe_count and stripe_size are updated for all processes, they are only required at the rank 0 process when IO operation is WRITE. This is essential, for avoiding a race condition on the same file when applying a new file striping order (Lines 22 to 24). In the case of the Rank 0 process, this removes the previous stripe settings by removing the existing empty file on Line 23 via Remove_Previous_File() method. Then the new stripe settings are applied on the file (Line 24) using the Apply_New_Lustre_Striping() method.

It should be noted here, that all processes except rank 0, wait again on Line 27 until the rank 0 process is applying the new lustre stripe settings on the file. This creates a new empty file on the same path with new stripe settings. When the WRITE operation executes afterwards by all processes the file will be written with new settings applied to it. On completion, the bandwidth will be recorded using Darshan, as mentioned earlier. It is also shown in Listing 5 for auto-tuning the default configuration test cases, using this method defined in Listing 4.

3.4.3 Statistical data collection upon auto-tuning test cases

figure e

In this section we present the experimental setup for auto-tuning the test cases and collecting the key statistical values for our system bandwidth performance. The working functionalities in Listings 3 and 4 perform auto-tuning on default configuration test cases, as mentioned in Table 5. These default test cases with parameters values settings make a total of 1458 configuration settings to be tested. The tunability of parameters is exactly the same as in Table 1.

The purpose of auto-tuning a number of default test cases is to analyze the improvement in IO bandwidth in terms of percentage, through the use of ANN models. The bandwidth values have been collected using the Darshan characterization tool, upon auto-tuning the test cases. Listing 5 presents the methodology to collect all this necessary statistical data for performance evaluation presented later in this paper.

Table 5 Default configurations test cases for auto-tuning

It should be noted that while we trained the ANN models on smaller file size and chunk size values, in our evaluation approach, we used larger file and chunk size values in our test cases. These are shown in our default configuration test cases to auto-tune in Table 5.

Multiplying the number of value settings results in 1458 possible combinations of the default configuration settings in one complete list. The description of steps to collect statistics of IO performance improvement data, starts with the function definition of Stats_Data_Collection(). This method provides seven arguments: (1) Default_Configurations, the lists of all default parameters values, (2) io, READ/WRITE operation and (3) model_path, the path of ANN model (“.pt”) file to use. This is followed by first generating a complete list of possible default configuration settings in X on Line 5, such that each element of X is a set of different parameters with values as stated earlier by example in Sect. 3.1. Then Line 6 computes the length or size of that list in memory n. On Line 7 the Repetitions count is set to the value 3, which controls the iterations of the nested loop on Line 12. This determines the repeated execution of the default configurations and the auto-tuned configurations respectively.

Line 9 initiates the main loop to process default and auto-tuned configuration settings against each set of parameters in X. Lines 10 and 11 create the default_bandwidth and tuned_bandwidth lists. The purpose of these lists is to save the default settings bandwidths and the auto-tuned settings bandwidth after their execution. The bandwidth values of the default and auto-tuned configuration settings are saved in their respective lists three times from Lines 14 to 27. The reason for three repetitions was to mitigate against individual variations across bandwidth values due to any reason, and therefore provide an average. After this inner loop, both bandwidths are averaged and appended in old_bandwidths and new_bandwidths lists, respectively on Lines 29 and 30. The bandwidth improvement or dis-improvement count is also maintained in c_{i}m and c_di on Lines 31 and 32, respectively.

Once the main loop is finished, all the default and auto-tuned bandwidths are averaged on Lines 34 and 35. The overall bandwidth improvement in percentages is calculated (Lines 37 to 38). Then all remaining statistics are calculated (Lines 40 to 51). These include percentages of improved and dis-improved test cases, maximum, minimum, median, standard deviation and variance in both default and auto-tuned bandwidth values for comparison. We provide a comprehensive study of performance in the next section using the data gathered in these steps.

4 Experimental results and evaluation

In this section, we outline the runtime complexity analysis of using ANNs predictions for parameter selection, the improvement results in READ/WRITE bandwidth, achieved by auto-tuning the default configuration test cases in Table 5, and the common configuration settings applied by the system model during the auto-tuning process.

The auto-tuning of the test cases is conducted on 16 compute nodes of the Intel Xeon Gold 6148 Skylake processors cluster (KAY) and its 16 LFS employed disks as IO object storage targets, OSTs [37]. The ANN models have been supported by training and testing through tensors construct of PyTorch, on NVIDIA Tesla V100 GPU cards [27, 38].

4.1 Runtime complexity analysis of ANNs predictions

Since this procedure of parameters value selection involves a brute force to check all combinations, its runtime cost should be analysed. Line 29 of Listing 3, runs the ANN feed forward propagation pass to predict the value using model(X). In our case, there are three steps involving: (1) input layer to Hidden Layer 1, (2) Hidden Layer 1 to Hidden Layer 2 and finally (3) Hidden Layer 2 to output layer. Table 7 presents all the steps involved, alongside the number of computations against multiplications (Mul.) and additions (Add.).

In the first step, there are roughly 3584 computations of multiplications and additions. Similarly, in second step there are approximately 65,536 computations. This is the most expensive phase in this feed forward propagation process in both computation and memory consumption terms. This is due to the reason that the hidden layers and their weights matrices have the greatest number of nodes and memory allocation, respectively.

Finally, in the third and last step, it performs 256 computations. By summing these computations, we determine 69376 runs in total of a single feed forward propagation pass to predict a value. In case of a READ operation the forward propagation pass runs 8 (\(4\times 2\)) times making a total execution of 555,008 instructions to select parameters predicting maximum bandwidth. Whereas, in case of a WRITE operation the forward propagation pass runs 160 (\(4\times 2\times 4\times 5\)) times which makes a total of 11,100,160 computations to select parameters. This is according to the code logic of Listing 3.

In order to analyze how fast these millions of computations can be processed, this relies on memory resources usage during the feed forward propagation. Table 6 shows the matrices and the memory space used during a forward propagation pass. These matrices were part of the ANN model which is created with a default data type float32. This represents a 32-bit or 4-bytes floating point number. There are seven matrices used in a feed forward propagation process as can be noticed.

The total memory required by the ANN model is 140,320 bytes or almost 137 KiB. If the model is created using a 64-bit or 8 bytes double precision floating point number then this value will be doubled to almost 274 KiB. However, with either of the decimal datatypes or sizes used, these matrices are easily cacheable in the CPU RAM cache memory. Therefore, running the millions of computations as stated earlier, approximately takes a negligible execution time in the unit seconds of \(10e^{-4}\) including the loading time of the required libraries. However, for a first time execution, the program loading can take around 30 s. Afterwards, it reduces to less than a second, as tested on a KAY’s compute node [37].

For the sake of simplicity and ease, we coded the parameters selection logic as a separate Python script to execute using PyTorch module, which interfaces with the caller MPI based C++ program and applies the new configuration parameters settings and run the READ or WRITE operation, as stated in Listing 4. Otherwise, the trained ANN model could be coded in the same MPI based program as well by using its C++ version of PyTorch library which is more complex than the Python scripting however, it could be even faster by avoiding the extra program loading during the already executing program.

Table 6 Matrices description and memory usage during ANNs feed forward propagation pass
Table 7 ANNs feed forward propagation pass to predict bandwidth

4.2 READ auto-tuning and common configurations analysis

In this section, we provide a detailed discussion of the auto-tuning performance with respect to READ operations.

Table 8 shows the READ improvements after execution the procedure represented in Listing 5. The first observation, is that the number of test cases improved during the testing process. These are 1206 indicating a 82.7% of the total cases (1458). On further examination, we observe that the bandwidth values also show significant improvements. The mean bandwidth value of the tuned settings is 62,630.5 MiB/s, which is significantly 1.65\(\times\) times greater than the mean bandwidth value 37,798.1 MiB/s of the default settings. This is a clear indication of how significant our tuning approach can be with respect to the critical metric of bandwidth. The overall READ bandwidth improvement is shown to be approximately 65.7% which is a significant optimized performance gain. The remaining statistics are presented with maximum, minimum, median and standard deviation of bandwidth values for default configurations against the tuned configurations.

Furthermore, the Fig. 5 shows the graph of default versus tuned bandwidths throughout all 1458 test configuration settings. It can be seen that the auto-tuned bandwidths demonstrate far better values than the default bandwidths in many cases.

Table 8 READ auto-tuning improvement results
Fig. 5
figure 5

READ improvements

The ANN predictions have leveraged the auto-tuning process and demonstrate these significant READ bandwidth performance optimization. It is also worth noting the common auto-tuned parameters values selection throughout the test cases. Table 9 shows the performance of three common tuned parameters values versus the default configuration settings test cases. This performance is stated with respect to the relative improvements and dis-improvements. When examining the tuned configurations, we note that the first uses a chunk size of 1 GiB and a non-collective file access pattern.

This setting has been selected in the majority of test cases, 1332 times. This resulted in 1156 occurrences of improvement while there were 176 occurrences of dis-improvement. Therefore probabilistically, this configuration setting has a 86.8% chance to improve bandwidth while indicating a 13.2% chance of dis-improving.

The second frequently tuned setting involved a chunk size of 2 GiB with the non-collective file access pattern. This setting was selected 114 times in which it yielded 48 improvements and 66 dis-improvements. In this case, the model prediction resulted in a 42% chance of improvement and a 58% chance of dis-improvement.

The third and last setting selected for auto-tuning was 1 GiB of chunk size with the collective file access pattern. This was selected the least number of times, showing 2 improvements and 10 dis-improvements. This has indicates a chance of 16.7% for bandwidth improvement. Therefore, it is the most unfavourable setting to apply in auto-tuning.

We note that, the non-collective file access pattern was a common factor in the improvements recorded. The non-collective access setting has been used mostly and proven successful with two different chunk size values throughout the test cases. This yielded close to 83% chance of improving bandwidth according to Table 9. Therefore, according to our findings, non-collective access is the most favourable pattern for READ operations with a view to maximising bandwidth.

Table 9 Common READ auto-tuned configurations

4.3 WRITE auto-tuning and common configurations analysis

This section presents a series of results related to MPI WRITE operation optimization. It is evident from Table 10 that majority of test cases improved when auto-tuning is applied. The number of overall improvements observed were 1353 which is 92.8% of the total number (1458) of the default configurations test cases. The remaining data shows the mean bandwidth throughput values of the default and tuned configurations. It can be seen that the mean bandwidth of the tuned configuration cases is 9798.1 MiB/s which is 1.83\(\times\) times greater than the mean bandwidth of the default configurations (5346.9 MiB/s). These numerical figures indicate an overall bandwidth performance of 83.2%. This shows the WRITE bandwidth performance throughput has been significantly optimized with our devised ANN prediction based auto-tuning strategy.

We note that the maximum tuned bandwidth is greater than the maximum default bandwidth. This significant increase can be seen in the minimum and median bandwidth values from the default to tuned settings. The minimum value increased by almost 49\(\times\) and the median value by 2.2\(\times\) due to auto-tuning. The standard deviations remain comparable between both default and auto-tuned results.

To support our observations, we observe the data in Fig. 6, which shows the default versus tuned bandwidths. We note that the auto-tuned bandwidths show consistently higher values than the default bandwidths in the majority of the cases. Therefore, it is safe to conclude the significant performance gain in WRITE operation bandwidth optimized by auto-tuning based on the ANNs predictions.

Table 10 WRITE auto-tuning improvement results
Fig. 6
figure 6

WRITE improvements

The configuration settings which are most commonly selected by our auto-tuning process are shown for WRITE operations in Table 11. This table shows the five common settings applied during the auto-tuning of default configuration test cases. For auto-tuning WRITE operations we have four tunable parameters: stripe count, stripe size, chunk size and file access pattern. The most common configuration setting was selected 864 times which shown first in the table. This contains stripe count of value 16, stripe size of 2048 MiB and the non-collective or contiguous file access pattern. This setting resulted in 794 improvements and 70 dis-improvements which are both the largest when compared to rest of the settings. Probabilistically, the chance of improving bandwidth through using this setting is 91.9%, with a 8.1% chance of dis-improvement.

The second, third and fifth configurations in Table 11 show improvements of 103, 145 and 97 respectively. Due to these settings being less frequently selected by the system arising from auto-tuning, they show comparatively small observations of dis-improvements of 5, 17 and 11. However, in terms of probability percentage, the second configuration has a 95.4% chance of improving bandwidth with a 4.6% chance of dis-improving.

This shows better more promising results than the first configuration with respect to probability. Whereas, the third and fifth settings have \(\approx\) 90% chance to improve bandwidth, with a \(\approx\) 10% chance to dis-improve it. These marginally less optimal than the first configuration.

The fourth configuration has 214 improvements and only 2 dis-improvements. Therefore, probabilistically, it has a 99% chance of improving bandwidth with only a 1% chance of dis-improving it. Interestingly, this configuration has a common feature with results that were recently published in this area. The selected stripe size of 1 MiB, was identified and suggested as a means of increasing the WRITE bandwidth from the parallel-plots in [9].

Another factor which should be noted is that the stripe count value of 16 and non-collective file access pattern are common across all these selected configurations. The stripe count value of 16 was the maximum in the range specified in Listing 3, for the selection. Therefore, it can be inferred to keep the maximum stripe count available. This means bandwidth is expected to increase by increasing number of stripe count value. Since the file access pattern is set to non-collective in all these settings, it is best for maximising overall bandwidth.

From studying these settings, we can apply them or similar ones to maximise bandwidth throughput in the event of the prediction model not being available for parameter selection.

Table 11 Common WRITE auto-tuned configurations

5 Summary of the work

The results presented in this paper indicate strong performance gains arising from using ANN models to auto-tune HPC MPI–IO configurations. These improvements have been demonstrated with respect to bandwidth performance primarily.

The tuning process is focused on a number of key parameters that directly effect IO bandwidth. Since we are optimising at runtime, tunable parameters can be modified i.e., the chunk size per MPI process, Lustre striping parameters, etc. Therefore, we identified certain tunable parameters and settings for READ and WRITE operations separately.

The optimization of HPC IO using ML predictions has been a common approach in recent studies [19, 20, 21, 23, 26, 13, 14, 16]. The parameters before IO execution can be modified or tuned to improve IO bandwidth. We note that the predictive IO modelling with multiple configurations is a primary requirement of the existing problem. The most convincing technique for predictive IO modelling is ANN due to their high forecasting accuracy [24, 26, 28].

In previous work, the key parameters were identified and executed as benchmarks [9]. However, in the research outlined in this paper, the research has been further extended to identify the tunable and non-tunable parameters. The previous benchmarks were re-executed on the generated list of all possible configurations as mentioned in Table 1. This was done to regenerate the ANNs through a ML process, as mentioned in Table 2. This process was conducted with respect to their specific hyper-parameters in Table 4, for both READ and WRITE operations. The mappings of the input layer nodes to configuration parameters are also described in Table 3. The models predict bandwidth precisely and match with the pattern of the changing configurations. The benchmarks and prediction results are presented in Figs. 3 and 4.

Subsequently, the auto-tuning approach was applied on the default configuration test settings, as mentioned in Table 5, with respect to READ and WRITE operations.

We observed the auto-tuning of configuration parameters based on ANN predictions, and these have shown significant optimized IO bandwidth performance results. The READ operation has shown 65.7% improvement in bandwidth while having 83% of the test cases improved. Whereas, the WRITE operation has shown 83% improvement in the overall IO bandwidth while having almost 93% of the test cases improved.

Subsequently, we discussed the common configuration settings as a result of auto-tuning for both READ and WRITE operations. It was observed that the non-collective or contiguous file access pattern was selected in the majority of the READ test cases. Furthermore, the stripe count was set to a maximum available value 16 in all the WRITE test cases, in order to have maximum possible bandwidth value.

These configurations can be considered with variations of other parameters values for tuning MPI IO operations, in the event that a ANN model is not available, to optimise significant IO bandwidth performance.

6 Conclusion

In this paper, we analyzed MPI–IO bandwidth performance improvements by auto-tuning configuration parameters via ANNs predictions. The paper explains the requirement of an efficient and adaptable optimization technique and extends the work from [9]. This aligns with already established research which examines the challenges of HPC–IO performance through ML prediction [19, 20, 21, 23, 26, 13, 14, 16].

There are three main contributions outlined in this research paper. First is the auto-tuning strategy based on the ANNs MPI–IO bandwidth prediction on various configuration parameters. Second is the statistical analysis of the overall auto-tuned cases with respect to bandwidth performance. Third is the identified common tunable parameters settings, selected and used by ANN based system model.

From the results presented in this paper, we have shown an overall READ bandwidth improvement of 65.7% with almost 83% test cases improved. Whereas, overall WRITE bandwidth improvement yielded upto 83% with almost 93% test cases improved.

Furthermore, from the configurations identified in this research, it is shown that there are clear benefits to firstly using the non-collective file access pattern in both READ/WRITE cases. Secondly, we note the benefits also of using the maximum lustre stripe count value in WRITE operations. Throughout our results these observations were noted as being prominent in the most optimal settings.

In the event that an ML optimisation approach is not available, these configuration policies which we have identified could be easily applied to benefit the bandwidth utilisation challenges noted in previous research. Furthermore, the approach and results outlined in this paper show the considerable advantages to professionals in this area. We have demonstrated substantial improvements on HPC MPI–IO bandwidth performance, thereby enhancing the overall efficiency of large sized data processing task completion.