1 Introduction

Recently, with the increasing number of fields to generate and analyze a large amount of data, the demand for large-capacity storage has increased. In particular, research on exa-scale storage is actively underway. Exa-scale storage has increased throughput, capacity, system size, power usage, number of files, frequency of failure, etc., in the range of tens to hundreds of times compared to the existing storage. Therefore, various studies are required to efficiently handle the rapidly increasing of these demands [9, 15, 20, 22].

In particular, replication has been widely used to ensure reliability in storage. Because replication copies data to many different devices, it requires more physical storage than is actually required. For example, triple replication requires three times more physical storage than needed. When replication is used on exa-scale storage, the scale of the system will grow to several times the required capacity, which will significantly increase the cost of deployment and management. Therefore, space efficiency is becoming notably important to reduce the size of the system in exa-scale storage [4, 14].

Erasure coding (EC) is a method of encoding data using an erasure code and recovering the original data by decoding upon data loss [25]. EC significantly improves the space efficiency. For example, 8 + 2 EC guarantees identical failures recovery as a triple redundancy but more than doubles the space efficiency. EC is divided into the client-based encoding and data-server-based encoding depending on the position where the parity is calculated. GlusterFS [7] is a client-based encoding file system; HDFS [1] and Ceph [2] are data-server-based encoding file systems. In this study, the data-server-based encoding was selected to minimize the network communication cost between the client and the data server with considerations for thin clients.

However, although EC is highly space efficient, there are I/O performance degradation factors such as the Parity Calculation, Data Distribution cost, small I/O problem, read-modify-write, and degraded I/O (repair I/O) [8, 21, 31, 33]. Because of these performance issues, the storage industry is recommending EC as storage for Cold I/O, such as backup and archive [5, 8, 30]. In the recent years, various technologies such as single instruction multiple data (SIMD) and multi-channel I/O have been introduced to solve this problem.

The existing research has mainly focused on improving the performance of the Parity Calculation in erasure coding. In [3, 10, 23], the authors proposed parallel erasure coding to reduce the overall erasure coding processing time. It also uses multi-core or GPU to efficiently handle parallel processing. In [16], the authors proposed recovery using a minimal amount of data. In [32], the authors proposed the Shingled Erasure Code (SHEC), which was designed to efficiently recover from multiple disk failures and could be customized and user-adjustable with various local parity group layouts.

Previous studies have not mentioned the Data Distribution cost, small I/O problems, etc., which may be a problem in the actual storage. In this study, we identified the problems and bottlenecks when erasure coding was used in real storage. First, we measured the write performance of various erasure codes to find the suitable erasure codes for real scale-out storage. Then, we checked the weight of the Parity Calculation in EC and analyzed the write cost in the real storage to find the most important degradation factor.

Thus, we measured the write time for each processing step in the EC volume at various I/O sizes and networks speeds. Then, we analyzed them to identify the key problems when EC was applied to real storage and predicted the write time of the EC volume in the exa-scale storage and identified the expected problems.

2 Load analysis of EC in real storage

In this section, we discuss the measurement of the write performance of various erasure codes on real scale-out storage. We found the suitable erasure codes for real scale-out storage and checked the weight of the Parity Calculation in EC.

2.1 Concept of EC

EC is a method of encoding data using an erasure code and recovering the original data by decoding upon data loss. The encoding means Parity Calculation from the data. The encoding in the EC is performed on an encoding unit basis. EC divides the encoding unit into K original data blocks and generates M parity data blocks by encoding from K original data blocks. It is called K + M EC. Figure 1 shows an example of the 4 + 2 EC volume to illustrate the configuration of the K + M EC volume.

Fig. 1
figure 1

Example of 4 + 2 EC volume

As shown in Fig. 1, in the 4 + 2 EC volume, the original data are stored in three stripes through three encoding operations. The encoding unit is divided into 4 original data blocks, and 2 parity data blocks are generated by the Parity Calculation. When data smaller than the encoding unit are entered, encoding is performed by filling the zero bit in the size of the encoding unit. A stripe is the minimum storage unit of the EC volume, which is a set of original data blocks and parity data blocks associated with a single encoding operation. A chunk represents a file of a specified size stored on a storage device such as a disk and stores data blocks of the same location in a stripe. A chunk set is a set of chunks where a single stripe is divided and stored, and a file consists of one or more chunk sets. In Fig. 1, a file “sample” consists of 2 chunk sets. A chunk set consists of 4 original data chunks and 2 parity data chunks, and a stripe consists of 4 original data blocks and 2 parity data blocks.

2.2 ExSTorus-FS

ExSTorus-FS [18] is a scale-out distributed file system based on FUSE [6] for exa-scale storage, which supports Torus-based network configurations, EC-based Hot I/O, and multiple-MDS support [17]. ExSTorus-FS was developed on the basis of MAHA-FS, which is used for providing cloud storage services of hundreds of Peta-bytes [19].

ExSTorus-FS supports K [2, 4, 8, 16] +M [1, 2, 4] EC volumes [17] and Reed Solomon (RS) [24], Cauchy Read Solomon (CRS) [28], Liberation (LIB) [27], Advanced Vector Extensions (AVX):CRS, and AVX2:CRS using the Jerasure library 1.2 [26], and Intel ISA library 2.10 [11, 12] by erasure codes. CRS and LIB are performance-optimized encoding methods that optimize RS. AVX:CRS and AVX2:CRS are an SIMD [29] instruction set based on CRS.

ExSTorus-FS is a data-server-based encoding file system. In this study, the data-server-based encoding was selected to minimize the network communication cost between the client and the data server with consider the thin clients.

ExSTorus-FS consists of the metadata server (MDS), data server (DS), and client. The MDS manages the metadata of files and monitors and manages the system. The DS processes the file I/O requests and periodically reports the status of the storage device and load of the server to the MDS. The client processes the user’s request on the basis of POSIX by connecting it with ExSTorus-FS. Figure 2 shows the process of writing data in the EC volume of the ExSTorus-FS.

Fig. 2
figure 2

Process of writing data in the EC volume of the ExSTorus-FS

As shown in Fig. 2, when the write is requested in the application, the client performs a file layout allocation through the MDS. The MDS checks the EC volume configuration to allocate the required chunks for the requested file. The client analyzes the file layout and transmits the written data to the DS to play the Master role. This DS is referred to as the Master. The Master is assigned one of the DSs with a chunk-by-chunk set and is responsible for the Parity Calculation and Data Distribution. After the data are encoded by the Master, the data and the parity are distributed to the DS to play the slave role. This DS is referred to as a Slave. A Slave is responsible for data recording to the storage. The Slave reports the result of the data recording to the Master. The Master collects the processing results of all Slaves and reports the final result to the client. Figure 3 shows the recording from several clients.

Fig. 3
figure 3

Process of recording from several clients

As shown in Fig. 3, a write request occurs to four files: file1 and file2 are executed in client1, and file3 and file4 are executed in client2. Data encoding sends a write request to the Master determined in the layout for each file. Master operates file1 and file3 in DS1, file2 in DS2, and file4 in DS3.

2.3 Write performance for various erasure codes

In this section, we describe the sequential write performance according to the erasure code. Experiments were performed on seven servers, each of which was equipped with two Intel Xeon CPUs E5-2609 2.40 GHz, 64 G memory, 1 G/10 G Ethernet, and ten 4 TB 7200-rpm HDDs, and the installed software included CentOS 7.0, ExSTorus-FS 1.0, and iozone 3.465. Experiments were performed on a 4 + 2 EC volume with a stripe size of 6 K and a chunk size of 64 M.

This experiment measured the write throughput and CPU usage while changing the erasure code to RS, CRS, LIB, AVX2:CRS, and NONE in the 4 + 2 EC volume of ExSTorus-FS. NONE indicated that no encoding function call was performed during write processing. These experiments were performed with 1 MDS, 6 DSs, and 1 client in 10 G Ethernet. For the performance comparison, we fixed the Master DS and performed writes in a single-threaded and a multi-threaded.

As shown in Fig. 4, AVX2:CRS had the most similar performance to NONE. The performance of LIB and CRS was reduced by 15%, and RS was reduced by 29% compared to NONE. In other words, the Parity Calculation accounted for 4–29% of the total IO. As shown in Fig. 5, AVX2:CRS, LIB, and CRS had similar utilization to NONE. RS differed from NONE by only 9%. The experimental results showed that HW encoding was better than SW encoding.

Fig. 4
figure 4

Single-thread average throughput by erasure code

Fig. 5
figure 5

Single-thread average CPU utilization by erasure code

The multi-thread sequential write performance test was performed 10 times through the iozone [13], e.g., iozone -i 0 -r 128 K -s 5 G -t 10 -+n -w. To increase the parallel processing performance of ExSTorus-FS, we set the number of channels to 10 in the client. Figures 6 and 7 show the average throughput and average CPU usage according to the erasure code in multi-thread sequential writing.

Fig. 6
figure 6

Multi-thread average throughput by erasure code

Fig. 7
figure 7

Multi-thread average CPU(4-cores) utilization by erasure code

As shown in Figs. 6 and 7, AVX2:CRS had the most similar performance to NONE, and RS had the worst performance. LIB was similar to AVX2:CRS in throughput, but the CPU utilization was high. In contrast, CRS was similar to AVX2:CRS in CPU utilization but had low throughput. In other words, the Parity Calculation accounted for 4–26% of the total IO.

In this section, we discussed the measurements of the sequential write performance of various erasure codes on real storage with single-thread and multi-thread. This experiment revealed that the EC encoding reduced the throughput by up to 30% and increased the CPU utilization by up to 20%. However, we have confirmed that using AVX2:CRS enables the storage system to perform EC with no significant performance degradation. The performance degradation because of Parity Calculation was notably small (approximately 4%). In this study, by analyzing the writing process, we identified the expected issues of EC from the exa-scale storage.

3 Write cost of EC volume in exa-scale storage

In this study, we examined the effects of the Parity Calculation, Data Distribution cost, and small I/O problem among various performance degradation factors. In this section, we discuss the analysis of the write time to check the EC availability on exa-scale storage. Assuming that we use the commodity server to build exa-scale storage, we limited the change in the exa-scale storage to the bandwidth. To facilitate the analysis, we divide the write process into five steps. Figure 8 shows the write process for the EC volume.

Fig. 8
figure 8

Write time classification for the EC volume

As shown in Fig. 8, the write process was divided into five steps: Client Overhead, Master Process, Parity Calculation, Data Distribution, and Slave Process. Client Overhead excluded the time that DS processes in the entire write time. Master Process is the time excluding the Parity Calculation time, Data Distribution time, and Slave Process time in the total DS processing time. Parity Calculation is the processing time of the EC encoding function. Slave Process was the time to receive and process data from Slave DS. Data Distribution excluded the Slave Process time from sending data to the slave and receiving the processing result.

We made several assumptions for deriving the formula to calculate the write time. First, the write was a single-thread sequential write. Second, the record size of I/O was fixed to R. Finally, the Master DS that performed the encoding was fixed. The write time (WT) is the sum of the times for each I/O processing step. Therefore, the time to write n records can be expressed as follows:

$$ {\text{WT}} = \mathop \sum \limits_{r = 1}^{n} \left( {{\text{CO}}_{r} + {\text{MP}}_{r} + {\text{PC}}_{r} + {\text{DD}}_{r} + {\text{SP}}_{r} } \right) $$
(1)

where COr is the Client Overhead time when processing the specified records, MPr is the Master Process time, PCr is the Parity Calculation time, DDr is the Data Distribution time, and SPr is the Slave Process time.

The Client Overhead was the processing of the client, which performed several tasks such as I/O request processing, chunk allocation, layout gathering, layout analysis, file status check, data transfer, processing result gathering, file metadata update, etc. These tasks were roughly divided into three areas depending on the processing subject. First, the Client Tasks to be processed inside the client, such as the I/O request processing, layout analysis, and file status check; second, MDS Tasks to be processed through communication with MDS such as chunk allocation, layout gathering, and file metadata update; and third, DS Tasks that are processed through communication with the DS such as data transfer and processing result gathering.

Therefore, the Client Overhead could be divided into Client Tasks, MDS Tasks, and DS Tasks. CO was sum of the Client Tasks time (CT), MDS Tasks time (MT), and DS Tasks time (DT). Therefore, COr could be expressed as follows:

$$ {\text{CO}}_{r} = {\text{CT}}_{r} + {\text{MT}}_{r} + {\text{DT}}_{r} $$
(2)

where CTr refers to the internal processes without network communication, which can be replaced by a constant value according to the system specifications. MTr is a small message-based network communication that will cost less than data transmission. Therefore, MTr can be replaced by a constant value according to the system specifications. DTr is defined as the latency according to record size R for the time to transmit data and wait for results. Thus, DTr is expressed using formula (3), where \( {\text{LM}}_{R} \) is the transfer latency of record R between Client and Master. And r is a record

$$ {\text{DT}}_{r} = r*{\text{LM}}_{R}. $$
(3)

The Master Process is the time to perform Master processing tasks such as chunk-set locking, memory allocation, memory copy, and encoding function call in Master. Parity Calculation is the time when the Master performs data encoding according to the erasure code. Slave Process is the time to perform slave processing tasks such as data recording, logging and I/O locking in the slave. As MPr, PCr, and SPr are internal processes without network communication, they can be replaced with a constant value according to the system specifications.

Data Distribution is the time required to transfer data from the Master to the slave and collect the processing result from the slave. Thus, DDr is defined as the latency time to distribute data from the Master to the slave and collect the processing results from the slave. In particular, DDr transfers the data to K + M  − 1 Slaves except the Master and waits for all the Slaves to complete processing. Therefore, DDr is the maximum processing time of each slave. As record R is divided into K parts according to the characteristics of the EC, the size of the data block transmitted to each slave is \( {\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}} \). Hence, DDr can be expressed as follows:

$$ {\text{DD}}_{r} = r*{\text{MAX}}\left( {s_{1} *{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} , s_{2} *{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} , \ldots ,s_{K + M - 1} *{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} } \right) $$
(4)

where \( {\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} \) is the transfer latency of record \( {\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}} \) between Master and Slave. Then, the write time of the EC volumes in the storage is can be expressed as follows:

$$\begin{aligned} {\text{WT}} & = \mathop \sum \limits_{r = 1}^{n} \left( {{\text{CT}}_{r} + {\text{MT}}_{r} + r*{\text{LM}}_{r} } \right) + {\text{MP}}_{r} + {\text{PC}}_{r}\\ & \quad + r*{\text{MAX}} \left( {s_{1} *{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} ,s_{2} *{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} , \ldots ,s_{K + M - 1} *{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} } \right) + {\text{SP}}_{r} \end{aligned} $$
(5)

where CTr, MPr, PCr, and SPr are constant values according to the system specifications, so (CTr + MPr + PCr + SPr) is converted into a specific constant value Vr. Assuming that all the Slaves have identical latencies in DDr, we obtained WT as follows:

$$ {\text{WT}} = \mathop \sum \limits_{r = 1}^{n} \left( {V_{r} + {\text{MT}}_{r} + \left( {r*{\text{LM}}_{R} } \right) + \left( {r*{\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} } \right)} \right) $$
(6)

According to formula (6), WT is affected by Vr, MTr, LMR, and \( {\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} \). In general, the distributed storage occupies a large portion of the network communication in I/O. In particular, if K or M increases in EC, \( {\text{LS}}_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} \) increases because the number of Slaves increases and the I/O size decreases. However, if there is a high throughput as in the case of exa-scale storage, the proportion of network communication can be reduced. In other words, LMR and \( LS_{{\left( {{\raise0.7ex\hbox{$R$} \!\mathord{\left/ {\vphantom {R K}}\right.\kern-0pt} \!\lower0.7ex\hbox{$K$}}} \right)}} \) may become smaller when the network speed increases, which may result in a relatively larger proportion of Vr and MTr. Therefore, in Sect. 4, we discuss the experimental verification of the weight of each write step and the prediction of future problems.

4 Performance evaluations

In this section, we discuss the validation of the effectiveness of EC in the storage by measuring the performance on the real EC-based storage. In particular, we compared the performance in 1 G/10 G/40 G Ethernet to predict the performance in exa-scale storage. We will look into possible problems in the future.

This experiment measured the running time per processing step in the 4 + 2 EC volume of ExSTorus-FS. The EC volume had a 6 KB stripe, a 64 MB chunk, and the AVX2:CRS erasure code. These experiments were performed with 6 DSs, 1 MDS, and 1 client in 1 G/10 G/40 G Ethernet. The experiment was conducted using seven servers, which is the minimum number of required nodes to minimize the performance interference when measuring the performance in 4 + 2 EC by encoding the processing stage. For the performance comparison, we fixed the Master that performed the encoding.

4.1 Sequential write performance on 1 G Ethernet

In this section, we describe the measured write time for the EC volumes in a real distributed storage with 1 G Ethernet. The writing time was divided into five steps as classified in Sect. 3. In this experiment, the dd command was executed, while the I/O size changed from 4 to 128 K, e.g., dd if =/dev/urandom of =/exa/file1 bs = 4 K count = 50,000. The write time was the average value of 10 executions of the dd command. Table 1 shows the average sequential write times per processing step with 1 G Ethernet.

Table 1 Average sequential write time per processing step on 1 G Ethernet

Figure 9 shows the execution time in Table 1 expressed in unit time divided by the number of write times (50,000).

Fig. 9
figure 9

Unit time per processing step in 1 G Ethernet

As shown in Fig. 9, all the unit times were proportional to the I/O size. The Client Overhead and Slave Process were approximately four times the difference between 4 and 128 K. Master Process and Parity Calculation were approximately 17 times the difference between 4 and 128 K. Data Distribution was approximately nine times the difference between 4 and 128 K.

The Data Distribution, Client Overhead, and Slave Process were performed in the record, whereas the Master Process and Parity Calculation were performed in the stripe. Therefore, the Master Process and Parity Calculation significantly increased the execution time because the number of operations increases when the I/O size increased. In addition, the Data Distribution increased the processing time because the I/O size increased and the transmission time increased. Because the Client Overhead operated in the write-back mode, the time did not significantly increase even when the I/O size increased.

Figure 10 shows the percentage of execution time per processing step at 4 K and 128 K for bottleneck identification.

Fig. 10
figure 10

Percentage of execution per processing step in 1 G Ethernet

As shown in Fig. 10, the proportion of Data Distribution was the highest at approximately 93.1%. The proportion of Client Overhead was the second highest at approximately 3.1%. The proportion of Parity Calculation was the lowest at approximately 1%. With 1 G Ethernet, the transmission speed was notably low, so the Data Distribution took up most of the total execution time. The Parity Calculation was sufficiently low to be ignored.

4.2 Sequential write performance on 10 G Ethernet

In this section, we describe the measured write time for the EC volumes in real distributed storage with 10 G Ethernet. The write time was divided into five steps as classified in Sect. 3. In this experiment, the dd command was executed, while the I/O size changed from 4 to 128 K, e.g., dd if =/dev/urandom of =/exa/file1 bs = 4 K count = 50,000. The write time was the average value after 10 executions of the dd command. Table 2 shows the average sequential write times per processing step with 10 G Ethernet.

Table 2 Average sequential write time per processing step on 10 G Ethernet

Figure 11 shows the execution time in Table 2 expressed in unit time divided by the number of write times (50,000).

Fig. 11
figure 11

Percentage of execution per processing step in 10 G Ethernet

As shown in Fig. 11, all the unit times were proportional to the I/O size. The Client Overhead and Slave Process were approximately 4.5 times the difference between 4 and 128 K. The Master Process and Parity Calculation were approximately 13 times the difference bet ween 4 K and 128 K. The Data Distribution was approximately three times the difference between 4 and 128 K.

Figure 12 shows the percentage of execution time per processing step at 4 K and 128 K to identify the bottlenecks.

Fig. 12
figure 12

Percentage of execution per processing step in 10 G Ethernet

As shown in Fig. 12, the proportion of Data Distribution increased to 78.7% with a smaller I/O size. The proportion of the Client Overhead, Slave Process, Master Process, and Parity Calculation increases when the I/O size increased. The Client Overhead was the second highest with a maximum of approximately 19.2%. The Parity Calculation was the smallest with a maximum of approximately 4.1%. When the network speed increased, the Data Distribution decreased, and the Client Overhead, Slave Process, Master Process, and Parity Calculation increase.

Figure 13 shows the execution time for each step of 1 G and 10 G for 128 K I/O, which had the longest encoding time.

Fig. 13
figure 13

Comparison of processing steps of 1 G and 10 G

As shown in Fig. 13, the Slave Process, Master Process, and Parity Calculation, which were not affected by the network speed, had similar performances between 1 and 10 G. However, the performance of the Client Overhead and Data Distribution, which were affected by the network speed, varied between 1 and 10 G. In particular, the time for the Data Distribution decreased with an increase in the network speed from 1 to 10 G, whereas the time for Client Overhead increased.

4.3 Prediction of write time of EC in exa-scale storage

In this section, we discuss the estimation of the write cost on exa-scale storage based on the experimental results presented in Sects. 4.1 and 4.2. The Master Process, Parity Calculation, and Slave Process were the same irrespective of the network performance in EC performance result. Data Distribution and Client Overhead were related to the network throughput. In particular, exa-scale storage increased the throughput, capacity, system size, power usage, number of files, frequency of failure, etc., in the range of tens to hundreds of times as compared to the existing storage. Among these, network performance was the most important factor affecting the EC performance.

We performed additional experiments on 40 G in system environments identical to those described as Sects. 4.1 and 4.2 to increase the reliability of the write cost estimation results in exa-scale storage. Table 3 shows the average sequential write time of each processing step for 1 G/10 G/40 G for the 128 K I/O with the longest encoding time.

Table 3 Average 128 K sequential write time per processing step

Moreover, as shown in Fig. 13, we assumed that Master Process, Parity Calculation, and Slave Process were the same irrespective of the network performance. Data Distribution and Client Overhead were related to network throughput. However, as Table 3 shows, the Data Distribution decreased when the network speed increased, but the Client Overhead increased as network speed increased.

Figure 14 shows the percentage of execution time per processing step at 128 K to identify the bottlenecks. Figure 15 shows the performance graphs of Data Distribution predicted based on the ratio of 1 G, 10 G, and 40 G.

Fig. 14
figure 14

Percentage of execution per processing step in 1 G/10 G/40 G

Fig. 15
figure 15

Prediction of Data Distribution time

As shown in Fig. 14, the proportion of Data Distribution reduced to 47.7% with an increase in the network speed. The proportion of the Client Overhead, Slave Process, Master Process, and Parity Calculation increased when the network speed increased. The Client Overhead was the second highest with a maximum of approximately 27%. The Parity Calculation was the smallest with a maximum of approximately 6.6%.

We assumed that a commodity server and 100 G were used in the exa-scale storage for write cost prediction. As shown in the experimental results presented in Table 3 and Fig. 14, Master Process, Parity Calculation, and Slave Process in the 1 G/10 G/40 G were almost the same irrespective of the network performance. Moreover, the Data Distribution and Client Overhead were related to the network throughput. The Client Overhead was similar to 40 G because the performance difference was not large. However, Data Distribution varied considerably in performance depending on the network throughput. Figure 15 shows the predicted Data Distribution time.

As shown in Fig. 15, the performance ratio of 1 G, 10 G, and 40 G was applied to 100 G to predict the execution time of Data Distribution.

Figure 16 shows the percentage of execution time per processing step for the bottleneck identification.

Fig. 16
figure 16

Prediction of write time of EC in exa-scale storage

As shown in Fig. 16, assuming 100 G, we found that the Data Distribution was reduced to approximately 17%, but the Parity Calculation, Master Process, Slave Process, and Client Overhead relatively increased. In particular, Client Overhead increased by approximately 46% at 100 G. The proportion of Data Distribution sharply decreased from 93.1 to 17.2% but remained the largest. The proportion of Parity Calculation sharply increased from 1 to 9.2%, but it was still the smallest.

More importantly, Client Overhead rapidly increased by 15 times. Assuming that Client Overhead increases at 100 G as well as at 40 G, we found that the proportion of Client Overhead increased sharply. As described in formula (2), the Client Overhead was composed of CT, MT, and DT. If CT was similar to Master Process and DT was reduced like Data Distribution, MT increased rapidly.

5 Conclusions

In this study, we measured the various I/O performances of erasure codes when EC was used in storage, and we confirmed that encoding could be performed with small performance degradation when the SIMD was used. In addition, the network transmission cost was the largest performance hurdle factor, and the proportion of the encoding cost was not sufficiently large for a cost analysis of the EC.

Finally, we found that the Client Overhead could be a major problem in the future through EC cost estimates based on the bandwidth of the exa-scale storage. In detail, assuming 100 G, we found that the cost of the Client Overhead increased by approximately 46%, and the client process time was two times larger than the Data Distribution time. In the future, it will be necessary to more specifically analyze the Client Overhead area and identify the bottlenecks.