Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

AIR: an approximate intelligent redistribution approach to accelerate RAID scaling

  • 20 Accesses

Abstract

Nowadays, videos and images are becoming the predominant format of data storage, which take up more space than conventional plain texts. This rapid increment brings a high requirement on scalability in large data centers, where disk arrays (also referred to as “RAID”) are the main devices to store the numerous data. To improve the scalability of RAID systems, several scaling approaches are proposed to guarantee a uniform data distribution, such as decreasing migration I/Os and speeding up the scaling process. However, typical approaches are offline and ignore the impacts of concurrent application I/Os, which plays an important role on RAID scaling (i.e., data migration, data distribution, etc.). For example, an exact uniform data distribution among disks doesn’t mean an even I/O accesses to these disks. To address this problem, in this paper, we propose an approximate intelligent redistribution (AIR) approach to accelerate RAID scaling. The main idea of AIR is utilizing the dynamic data access patterns from concurrent application workloads, and providing an approximate data distribution to guarantee a uniform I/O accesses to various data disks. To achieve this goal, AIR utilizes the prevailing machine learning algorithms to identify hot data from application workloads, and gives an intelligent migration approach to minimize the data movements. By this way, AIR can sharply cut down the migration I/Os. To demonstrate the effectiveness of AIR approach, we conduct several simulations via Disksim. The results show that, compared to traditional RAID scaling approaches such as FastScale and GSR, AIR saves up to 99.3% I/O cost and reduces the data migration ratio by up to 95.8%, which speeds up the scaling process by a factor of up to 30.3X.

Introduction

Redundant Arrays of Inexpensive (or Independent) Disks (RAID) Chen and Lee (1994), Oracle Corporation (2013), Patterson et al. (1988) is a prevailing choice in the large scale data centers to supply both high reliability and high performance storage services with acceptable spatial and monetary cost. Massive data are stored in large amounts of redundant disks to facilitate users to access them in parallel. Naturally, with the information explosion and accumulation, RAID scaling is a popular option for typical data centers.

In addition to increasing physical capacity of RAID systems, adding extra disks into an existing array is an economical and cost-effective solution. Furthermore, high scalability enables a large amount of parallel accesses to storage devices, which is a common demand in various scenarios like cloud computing. It can avoid the extremely high downtime cost and improve the entire performance finally. Therefore, scalability plays an important role in RAID systems making it necessary to develop efficient and reliable scaling methods.

A number of scaling schemes have been proposed in recent years. Some of them are already deployed in real data centers. Unfortunately, existing methods suffer from various problems inherent in their design, where migration I/Os are typical concerns. Several classic methods like Round-Robin (RR) Gonzalez and Corte (2004), Brown (2006), Zhang (2007) and Semi-RR Goel et al. (2002), which are easy to be implemented and have high I/O cost on data migrations and parity modifications Wu and He (2012), Zheng and Zhang (2011), Wu et al. (2012) . Advanced methods like FastScale Zheng and Zhang (2011) and GSR Wu and He (2012), which are predefined solutions and achieve minimal migration I/Os with efficient redistribution algorithms during scaling processes.

However, the main problem of existing approaches is that they are static solutions, which ignore dynamic changes of storage systems affected by application workloads. Specifically, given an erasure code and the corresponding layout, the migration process is predefined in previous methods. They achieve a balanced workload by obtaining the same amount of data for each disk, which is an absolute uniform. Obviously, they ignore the data access patterns form upper level applications. For a typical application with skewed accesses to an array with an even distribution, the overall I/Os cannot be balanced. Furthermore, the accessed data can be redirected to the new disks, which has potential to save migration I/O to balance the workloads. Although several approaches like CRAID Miranda and Cortes (2014) identify the hot data from dynamic workloads, they focus on the cache or I/O accelerations, which are inefficient for I/O intensive applications.

To address the above problem, in this paper, we propose an Approximate Intelligent Redistribution (AIR) approach. The main idea of AIR is providing approximate even data distribution among various disks, which are based on the detailed analysis on real workloads. The migration order and path are adjusted dynamically on the fly to minimize the data migration and parity modification costs. After the approximate redistribution, each disk may have similar amount of data, but globally providing a uniform data accesses for various workloads. And we make the following contributions in this paper,

  • We propose an Approximate Intelligent Redistribution approach (AIR), which provides an approximate data distribution, and speeds up the scaling process of RAID and cuts down I/O overhead significantly.

  • We conduct evaluations and simulations AIR in miscellaneous scenarios compared with several general schemes to demonstrate the effectiveness of AIR.

This paper is organized as follows, Section 2 briefly overviews the background and clarifies our motivation. AIR is illustrated in detail in Section 3. The evaluations are given in Section 4. Finally we conclude this paper in Section 5.

Background and motivation

In this section, we discuss the background of the scaling schemes and machine learning algorithms, problems in existing approaches and the motivations of our work. To facilitate our discussion, we summarize the symbols used in this paper in Table 1.

Table 1 Parameters in paper

Essential features for RAID scaling

In existing scaling approaches, we have to achieve a uniform data distribution among various disks. However, to efficiently balance the numerous I/O accesses and reduce the cost of data migrations and parity modifications, an approximate data distribution is a proper choice instead of previous absolute distribution (illustrated in detail in Section 2.3). Therefore, in order to scale the disk array in an efficient manner, considering the conditions in existing scaling approaches Zheng and Zhang (2011), Wu and He (2012), Gonzalez and Corte (2004), we list several essential features for scaling as below:

  • Minimal data migration: The expected data migration ratio for a uniform data distribution is \({m}/{(n+m)}\), where the original number of disks is n and the number of added disks is m. The minimization of data migration is remarkably important during the whole scaling process.

  • Minimal parity modification: In RAID-5, RAID-6 or triple disk fault tolerant arrays (3DFTs), the migration of data blocks can cause tremendous modification on parities, which is a large amount of I/Os during the scaling. Therefore, we need to avoid parity modifications as much as possible.

  • Minimal computation cost: To modify parities, several computations need to be handled based on the construction of various RAID forms or erasure coding schemes, where a low computation cost is highly desired.

  • Uniform I/O accesses: For application workloads, I/O accesses should be allocated among various disks in a balanced manner. Thus how to adapt the dynamic change of application workloads is critical for online scaling.

  • Approximate uniform data distribution: Each disk should have a comparative amount of data to keep an approximate equally distributed workload.

  • High flexibility on scaling: The scaling approach should be available in different cases with various choices of n and m.

Existing scaling approaches in RAID

Existing approaches to improve the scalability of RAID system include Round-Robin (RR) Brown (2006), Gonzalez and Corte (2004), Zhang (2007), Semi-RR Goel et al. (2002), FastScale Zheng and Zhang (2011), GSR Wu and He (2012), etc. To clearly illustrate various strategies in RAID, the default data and parity distribution is right-asymmetric.

Round-Robin (RR) Traditional RR Brown (2006), Gonzalez and Corte (2004), Zhang (2007) scaling approach is based on round robin order. This order makes nearly all data migrated except the first stripe (nearly 100% data migration), as shown in Fig. 1. RR is simple to implement on various RAID forms (e.g., RAID-0 and RAID-5) and has been used in some products Hitachi Data Systems (2011). Additionally, all parities need to be regenerated after data migration. Therefore, the overhead is high due to the large amount of data migrations.

Fig. 1
figure1

Round-Robin method in RAID-5 system with a scale-up from 4 to 5 disks

Semi Round-Robin Semi-RR Goel et al. (2002) is proposed to decrease high migration cost in RR scaling as shown in Fig. 2. Unfortunately, by extending multiple disks, the data distribution is not uniform after scaling. It can easily cause the load balancing problem, which is an important issue in disk arrays.

Fig. 2
figure2

Semi Round-Robin method in RAID-5 system with a scale-up from 4 to 5 disks

FastScale FastScale Zheng and Zhang (2011) is a RAID-0 Hetzler (2008), Goel et al. (2002) scaling approach with low overhead and high performance. However, as shown in Fig. 3, it is a solution for RAID-0 and cannot be used in other RAID forms such as RAID-5.

Fig. 3
figure3

FastScale in RAID-0 system with a scale-up from 4 to 5 disks

GSR Global Stripe-based Redistribution (GSR) Wu and He (2012) approach is designed to accelerate RAID-5 Gonzalez and Cortes (2004) scaling. The purpose of GSR is to minimize the data migration, parity modification and computation cost from a global view on all stripes, not limited to operations on any single data/parity block. Similar to FastScale, GSR only can be used in RAID-5, which cannot be adapted to other RAID forms (Fig. 4).

Fig. 4
figure4

GSR in RAID-5 system with a scale-up from 3 to 5 disks

ALV ALV Zhang (2010) is a different approach comparing to RR. As shown in Fig. 5, it changes the movement order of migrated data and aggregates these small I/Os. However, ALV is essentially based on round-robin order. Generally, It can not decrease the total I/Os caused by data migration and parity modification.

Fig. 5
figure5

ALV method in RAID-5 system with a scale-up from 4 to 5 disks

Minimal Data Movement (MDM) MDM Hetzler (2008) is an approach that eliminates the number of parity modifications and computations. It also decreases the migration cost, but lead to some new problems like limited performance. Fig. 6 illustrates an example for MDM method.

In MDM approach, all parity blocks are maintained but the storage efficiency could not be improved as we add more disks, and the layout after scaling becomes much more complex than typical RAID-5. The performance is limited because the number of data blocks in a parity chain remains unchanged.

Fig. 6
figure6

MDM method in RAID-5 system with a scale-up from 4 to 5 disks

Besides previous approaches focus on RAID-0 or RAID-5, there are many famous and useful scaling approaches or coding in RAID-6 and in 3DFTs, like BDR Jiang et al. (2016), POS Wu et al. (2015). These approaches are combined with several classic erasure codes such as STAR Huang and Xu (2008), RS6 Zhang et al. (2015), Gu et al. (2019), TIP Zhang et al. (2015), etc. and other varies MDS codes Reed and Solomon (1960), Corbett et al. (2004), Plank (2008) in RAID-6, or codes like AZ-Code Xie et al. (2019) and Approximate Code Jin et al. (2019), etc.. To adapt the dynamic change on application workloads, CRAID Miranda and Cortes (2014) is a general approach for disk arrays. Some other adaptive industrial products integrated with scaling approaches are RAID-x Hwang et al. (2000) and AutoRAID Wilkes (1996).

Our motivation

We summarize the existing scaling approaches in Table 2. From the table, although existing methods can accelerate the scaling processes, they have several disadvantages. Here we list the main drawback for common scaling methods:

  • High I/O cost. It is a serious problem in several methods like RR and Semi-RR. First, these methods migrate nearly all data blocks during scaling. Second, the parity modification results in high I/O cost as well.

  • Low flexibility to adapt various RAID forms. Typical scaling approaches like FastScale only concerns on RAID-0, which cannot used in other RAID forms like RAID-5.

  • Unreasonable load balancing. Except CRAID, other existing approaches cannot provide any consideration on upper level application workloads. Therefore, even if the disk arrays in the bottom level achieve uniform data distribution, the overall I/Os are not balanced in the whole storage system.

According to the analysis on existing approaches, we can find that dynamic approximate load balancing based on the application workloads is a feasible solution, which motivates us to propose a novel approach AIR in this paper.

Table 2 Summary on various scaling approaches

Approximate intelligent redistribution approach

In this section, we present our approximate intelligent redistribution(AIR) scaling method. We give an overview of the method at first in section 3.1 to introduce briefly the method and its constitution. Then we present the four modules of AIR from section 3.23.5. Finally we give explain the algorithms of these modules in section 3.6.

Overview

AIR is proposed to reduce the data migration, parity modification and computation cost from a global point of view on various workloads in the storage arrays, such as the upper level application and bottom level storage device workloads. AIR includes four modules, access trend prediction, block selection, migration rules, and new writes treatment. In the following we give the detailed description for these four modules (Fig. 7).

Fig. 7
figure7

Overview of AIR

  • Access trend prediction: In order to get a balanced I/O, first we need to know the current I/O access patterns among various disks, and try to characterize them to predict future data accesses. In this module, we use a LSTM neural network to learn and predict the disk access trend.

  • Block selection: Knowing the I/O access possibilities of all disks and blocks, we can select certain hot blocks for data migration. Obviously, the extending disk(s) gains a huge amount of I/O accesses.

  • Migration rules: For the sake of minimizing parity modification cost, we need to specify the data migrations, or in other words, we need to set up proper migration rules for high scaling efficiency.

  • New write treatment: When the data distribution is imbalanced seriously, new write request from applications will be written into the new disks to save I/O cost and increase load balancing. AIR redirects the new write requests to the extending disk(s) until an approximate uniform data distribution is achieved.

Here we explain these four modules in detail as mentioned above.

Access trend prediction

Although former researchers point out that LSTM is suitable on natural language processing (NLP) Tai et al. (2015), Sundermayer et al. (2012), LSTM has the excellent ability to deal with the sequential data in theory. Thus we select LSTM and do some modifications to meet the demand of AIR.

Recurrent networks arrange hidden state vector \(h_{t}^{l}\) in a two-dimensional grid, considering both layer and time. The bottom row of vectors \(h_{t}^{0} = x_{t}\) and each vector in the top row \(h_{t}^{L}\) is used to predict an output vector \(y_{t}\). The hidden state vectors or intermediate vector \(h_{t}^{l}\) is computed with a recurrent formula based on \(h_{t}^{l-1}\) and \(h_{t-1}^{l}\) as below:

$$\begin{aligned} h_{t}^{l}=\textit{tanh}\ (W^{l} \bigl ({\begin{matrix} h_{t}^{l-1}\\ h_{t-1}^{l} \end{matrix}}\bigr )) \end{aligned}$$
(1)

The parameter matrix \(W^{l}\) varies between gate layers but is shared through a chronologic order in the same layer. Additionally, tanh function is applied element-wise in the use of transfering the product input into the numerical value in the interval of \((-1, 1)\). As shown in Eq. 1, the inputs from the below layer in depth (\(h_{t}^{l-1}\)) and before in time (\(h_{t-1}^{l}\)) are utilized.

In AIR, the gradients in common RNN is prone to vanish and RNN cannot learn long-term dependencies efficiently. To solve this problem, even though traditional LSTM applies similar repeating module, it maintains four and interacts in a special way instead of one tanh neural network layer. For example, the cell state \(c_{t}^{l}\) is maintained by a key component. At each time step the LSTM can choose to the destination of read/write requests, or reset the cell using the explicit gating mechanisms. Equation 2 decides abortion data to next layer:

$$\begin{aligned} f_{t}=\sigma \ (W_{f} \bigl ({\begin{matrix} h_{t}^{l-1}\\ h_{t-1}^{l} \end{matrix}}\bigr )) \end{aligned}$$
(2)

Equations 3 and 4 decide what new information to store in the cell state. A sigmoid layer called the input gate layer, which decides the values of incoming update data. A tanh layer creates a vector of new candidate values that could be added to the state:

$$\begin{aligned} i_{t}= & {} \sigma \ (W_{i} \bigl ({\begin{matrix} h_{t}^{l-1}\\ h_{t-1}^{l} \end{matrix}}\bigr )) \end{aligned}$$
(3)
$$\begin{aligned} \tilde{C}_t= & {} \textit{tanh}\ (W_{C} \bigl ({\begin{matrix} h_{t}^{l-1}\\ h_{t-1}^{l} \end{matrix}}\bigr )) \end{aligned}$$
(4)

Equation 5 shows the update operation from the old memory cell to the latest one:

$$\begin{aligned} c_{t}^{l}=\ f_{t} *c_{t-1}^{l} + i_{t} *\tilde{C}_t \end{aligned}$$
(5)

Finally, we decide the output. This output is based on our cell state after filtering.A sigmoid layer which decides what part of the cell state to output. Then, tanh layer help us only output the parts we decided to:

$$\begin{aligned} o_{t}= & {} \sigma \ (W_{o} \bigl ({\begin{matrix} h_{t}^{l-1}\\ h_{t-1}^{l} \end{matrix}}\bigr )) \end{aligned}$$
(6)
$$\begin{aligned} h_{t}^{l}= & {} \ o_{t} *tanh \ c_{t}^{l} \end{aligned}$$
(7)

Our model is composed of a single LSTM layer followed by an average pooling and a logistic regression layer. In AIR, the input to the model is vectorized blocks. Each block has its unique T-dimensional vector V, where T is the maximum time period consistent with the parameters in LSTM model. \(V_{i}[j]\) is the access times of block i in time \((j-1) *t\) to \(j *t\), here t is the time step. Speaking of the data accesss time, the read operation and write operation works differently there. Only the read operations are counted for the data access time, and the every write operation creates a new block. Different from conventional language processing schemes which need map character to some specific numbers. Our input vector are digital numbers and we slightly modify the vectors to converge the interface into our model.

Recurrent is capable of sequence-to-sequence mapping and thus we can get the data access times based on a chronological order. Instead of focusing on one value, we extend to analyze access times in three time step. Of course, this choice is also related to the time step t we choose before. Furthermore, we normalize the predictions on all blocks from present blocks. The normalized value, called heat degree written as D, indicates the probability of a block which will be accessed in the near future. The figure below shows the overview of this step (Fig. 8).

Fig. 8
figure8

Overview of access trend prediction

Block selection

Based on the results from access tendency prediction, we select proper blocks to migrate. Different from existing scaling approaches, the migration process in AIR is dynamic from the following aspects:

  1. 1.

    The number of migrated block is uncertain;

  2. 2.

    The migration order and destination are adjusted according to the real-time feedback.

During this period, we assume only one disk is added at a time which is a basic operation for complex scenarios. Obviously, \(\sum _{i}^{B}\sum _{j}^{N}D_{i,j} = 1\) and \(\sum _{i}^{B}D_{i,j} = D_{j}\). Optimal access possibility distribution among disks is the average probability for each disk \(Q = \frac{1}{N+1}\).

The block can be classified into mainly three groups, super hot block, hot block and cool block. The classification is based on their access possibility predicted in last step. Considering that the amount of block migration should be as minimal as possible to achieve an approximately balanced disk array, we choose the hottest blocks satisfying \(1.3 *Q \ge D_{j} - D_{i,j} \ge 0.9 *Q\) among disks in round-robin order. The parameter 1.3 and 0.9 are the interval with the average 1.1 and the tolerance of incertitude of 0.35, which mean we allow the original disks to undertake more data accesses than the average value Q or having a little less data access, but approximately equal to the average value Q. After one traverse among disks, the \(D_{j}\) subtract the \(D_{i,j}\) of chosen blocks. These blocks are grouped into enhanced hot block. The left blocks whose \(D_{i,j}\) is less than 0.001 is grouped into cool block.

Super hot blocks is ordered by their disk original \(D_{j}\). We want to decrease the workload of the busiest disk as much as possible. The \(D_{i,j}\) of chosen blocks are added not exceeding \(\alpha *Q\), because we cannot get blocks whose sum happens to be Q. \(\alpha\) is the threshold less than 1 and the chosen of \(\alpha\) is based on the I/O ratio of the workload. After super hot blocks are all selected, we select other blocks in the order of their heat degrees and maintain the threshold in new disk. It should be noted that parity blocks will never be selected to avoid breaking the parity chains

We take a simplified example in RAID-5 to clarify how the selection works. From Fig. 9 the access distribution of the disk array is respectively: 0.18, 0.350, 0.33, 0.359.

Fig. 9
figure9

An example of block selection in RAID-5 array

As we can see the disk 2 is the hottest disk in the following cycles. We choose the hottest block in disk 2 and check whether disk 2 becomes cool that the selection is improper. Block (2, 2) is chosen, but this is a parity block which should not be migrated. Thus we choose Block (3, 2) and \(D_{2}\) decreases to 0.353 satisfying \(1.3 *Q \ge D_{j} \ge 0.9 *Q\), here \(Q = \frac{1}{5} = 0.35\) while \(D_{4}\) increases to 0.10 not exceeding \(\alpha *Q\) with \(\alpha\) here 0.75. After this cycle, disk 3 is out of our acceptable range. Now, we focus on the disk 3. At first, we choose the hottest block (0, 3). However, the \(D_{3} - D_{0,3}\) is less than \(0.9 *Q\). Thus we explore on the second hottest block (2, 3). Unfortunately, this block does not fit for our demand because \(D_{4}\) becomes larger than 0.15. Finally the block (1, 3) is chosen and all \(D_{j}\) are updated and each one habits in the range. In the end, block (3, 2) and (1, 3) are chosen as shown in Fig. 9.

Migration rules

The migration of selected blocks should be in a good order and follow certain rules.

First of all, the migration of several blocks should be done simultaneously or in other word, processed in parallel in order to accelerate the migration. Every new disk should create a new queue and AIR provides I/O balancing based on the global scheduling on various queues.

Secondly, as the number of selected blocks are usually tiny among the whole disk array, the column of new disk is much bigger than the amount of blocks, so we have a kind of flexibility when choosing new places for them. For RAID-5 scaling, we need to take parity blocks into the consideration. Since the parities are involved in each parity in RAID-5, parity block depends only on the blocks in a same stripe. Therefore, in order to reduce the cost of parity modification, we give the highest priority to migrate selected blocks in a same stripe if the place is not occupied. If the place is already occupied by other blocks, the migration destiny is set to the most top valid block.

Moreover, during the data migration period, the upper layer typically has write requests which need to modify the their corresponding parities. In order to increase the data reliability, when a parity block is decided to update, we cannot migrate the blocks in this parity chain.

An example of migration rules are shown in Fig. 10, we migrate the two selected blocks into the added disks without changing its stripe number because the stripe has free space for parities.

Fig. 10
figure10

An example of block migration in RAID-5 array

New write treatment

According to the Pareto’s principle Pareto and Page (1971), 20% of the blocks possess 80% of the access in disk. Therefore, the hot blocks we selected are far from fulfilling the uniform distribution in extended disks like existing approaches. In order to achieve a relatively balanced workload, we need to migrate a large amount of blocks into the new disk.

During the data migration, the upper layer applications are still reading and writing to disks. For those writing requests, we write the new content directly into the extended disks in a same stripe preferentially and the original block is tagged as expired. By this way, we can redirect the new write blocks to the added disks to save the I/O cost. After enough write requests are redirected (not exceed half capacity of the extended disks), we can say that an approximate uniform distribution is achieved.

Moreover, in order to avoid unbalanced distribution caused by write redirection, we can adjust the probability of write redirection. For each coming new write request, the write block has a possibility p to be migrated, with a small growth s step added after each successful migration. The initial value of p is decided by the I/O ratio of the workload from 1 to 5%.

Algorithm overview

Here we introduce our algorithm with pseudo-codes. The main part of our method composes three small algorithms. They are hot block selection, data migration, and new write redirection, which corresponds to the second, third, and fourth module, respectively.

figured
figuree
figuref

Evaluation

In this section, we evaluate AIR by testing it’s prediction accuracy and comparing the scaling performance with other approaches in RAID-0 and RAID-5 to show its advantages on scaling.

Evaluation methodology

We compare AIR approach to Round-Robin(RR) Gonzalez and Corte (2004), FastScale Zheng and Zhang (2011) and GSR Wu and He (2012) scaling approaches used in RAID-0 or RAID-5, where FastScale can only be used in RAID-0 and GSR is mainly designed for RAID-5.

In a typical RAID-5, the minimal size of n is 3. Thus we choose N from 3 to 5 which is relatively small and simple for scaling, and we choose m from 1 to 3 because those are usually the basic operations of disk array scaling. The number of blocks in one disk is \(B=512\), which makes the calculation simpler.

The experiment was done through the disk simulation software Disksim4.0 Bucy et al. (2008), which is an efficient, accurate, highly-configurable disk system simulator. In Disksim, all disks are configured as the Quantum Atlas 10 K with a storage capacity of 9.1 GB.

Besides the experiment environment, we apply the appropriate sample data for proceeding the experiments. Here we use several traces to simulate the I/O operations in disk arrays. To cover complex situations in the run of the disks, different traces were used to test different kinds of scenarios.

There are mainly two kinds of traces were used for the experiment samples. The frist trace was collected for MSN Storage metadata server for a duration of 6-h named “MSN”. We took a small part of it to do the experiment. The second trace was an Exchange of server for the Microsoft company in 2007. We chose two small parts of the trace “Exchange.12-13-2007.01-37-PM.trace.csv” and “Exchange.12-13-2007.01-06-PM.trace.csv”, which differ from its read percentage. Finally we merged the two traces “Exchange” and “MSN” together to simulate the two applications running simultaneously. For simplifying the name of those traces, we named the four traces as “MSN”, “Exchange1”, “Exchange2” and “Merge”, respectively (Table 3).

Table 3 Detail of the traces as the application workloads

In our comparison, we take mainly the following metrics in our experiments.

  1. 1.

    Prediction accuracy: the rate of accuracy that our prediction for the disk accesses.

  2. 2.

    Prediction time: the time spent on training and predicting for the disk accesses.

  3. 3.

    Hot block selection: the rate of accuracy of hot blocks

  4. 4.

    Data migration ratio (\(R_{d}\)): the ratio of the number of migrated data blocks to the total number of blocks.

  5. 5.

    Scaling I/O cost: the total number of I/O operations during scaling.

  6. 6.

    Computation cost: the number of total XOR operations on parity modifications during scaling.

  7. 7.

    Load balancing ratio: the ratio between the highest disk accesses and the lowest disk accesses in an array, which illustrates the balance of distribution.

  8. 8.

    I/O distribution standard deviation: the standard deviation of total I/O requests among all disks during the scaling time.

  9. 9.

    Average I/O response time: The average response time for upper level applications.

  10. 10.

    Scaling time: the total time spent on the scaling.

The parity modification ratio and the computation cost are only evaluated for RAID-5 scaling because RAID-0 has no parity block.

We take data migration ratio in RAID-5 as example, AIR migrates approximately \(\frac{m}{n+m} *\frac{M}{2}\) blocks. Therefore \(R_{d}=\frac{m}{n+m} *\frac{1}{2}\), while that of GSR is \(\frac{m}{n+m}\). We can see that AIR migrate half less blocks than GSR.

Regarding to the scaling time, which cannot be evaluated directly by using disksim, in our experiments, it is calculated by the I/O cost multiple the average response time. And we take account the prediction time into the scaling time in our AIR approach.

For other scaling methods, the scaling time \(t_{sca}\) is simply the product of the average response time \(t_{ave}\) and the number of I/O cost n:

$$\begin{aligned} t_{sca} = t_{ave} * n \end{aligned}$$

But in AIR, it should contain the time spent on the training and prediction phase \(t_{pre}\). Under this consideration, the total scaling time is expressed as:

$$\begin{aligned} t_{sca} = t_{ave} * n + t_{pre} \end{aligned}$$

Numerical results

In this section, we give the numerical results of AIR scaling approach comparing with existing scaling methods. The representations of coordination on x-axis in the graphs are accorded with the format of (nm).

Prediction accuracy

The prediction is proceeded with the steps below:

Here we show one of our calculation of trace access distribution prediction in the following Table 4, which contains the prediction on three-disk, four-disk and five-disk distribution.

Table 4 The comparison between the predicted access and the real access data with prediction accuracy on the trace MSN

We also measure the prediction accuracy under other traces. The overall result on the Table 5 shows that our prediction model fits good.

Table 5 Overall result about trace access prediction accuracy

Prediction time

In this section, we present the prediction time on our method.

The following Table 6 presents the time spent on training and predicting data accesses. As it is highly related to the performance of the computer, we list the result here just for reference only but the relative comparison could be observed.

Table 6 Prediction time on four different traces (s)

Hot block selection

The selection for the hot blocks is based on the prediction stage. Our numerical analysis is proceeded the same as prediction accuracy. The only difference is that for prediction accuracy, the accuracy is calculated among disks, while in this section, it is based on the real access rate of selected hot blocks. The following Table 7 is an example of this processus.

Table 7 Hot block access prediction on the trace MSN and merge with different scenarios

Data migration ratio (\(R_{d}\))

Figure 11 shows the results of data migration ratio. It is clear that AIR provides the minimal data migration ratio among all approaches.

  • RAID-0: The migration ratio of AIR can be saved by up to \(91.5\%\) compared with RR and \(50\%\) compared with FastScale.

  • RAID-5: AIR can reduce the migration ratio by up to \(89.4\%\) compared with RR and \(34.5\%\) compared with GSR.

Fig. 11
figure11

Comparison among various scaling approaches in terms of data migration ratio

Scaling I/O cost

The evaluation results on I/O cost are shown in Fig. 12. As we can see, AIR has tiny I/O cost comparing with RR, GSR and FastScale, which is because the only I/O cost for AIR is the migrationon a small number of hot blocks.

In Fig. 11, AIR saves the I/O cost by up to 99.3% in different configuration of disk arrays. Compared to FastScale, AIR achieves better effects when the number of extending disks is smaller. It is obvious that AIR has less I/O cost comparing to other scaling methods.

Fig. 12
figure12

Comparison among various scaling approaches in terms of I/O cost

Computation cost

In Fig. 13 we can see the number of XOR operations during the scaling process. It is evident that AIR has the least computation cost and can save up to \(95.8\%\) compared with RR, \(78.8\%\) compared with GSR, respectively.

Fig. 13
figure13

Comparison among various scaling approaches in terms of computation cost

Load balancing ratio

In this section, we evaluate load balancing ratio among AIR, RR, FastScale and GSR under different traces, which is shown in Fig. 14. The load balancing ratio describes the I/O distribution status among disks. As a result we can see that in most cases the I/O distribution for is similar comparing to other methods.

In Fig. 14, it is obvious that AIR performs better in terms of load balancing ratio especially for MSN and merge traces. Comparing to FastScale, AIR usually achieves a more balanced distribution among disks, and AIR reduces load balancing ratio by up to 25.2%, 24.4% comparing to RR, 58.1% comparing to GSR, respectively.

Fig. 14
figure14

Comparison among various scaling approaches in terms of load balancing ratio

I/O distribution standard deviation

In this section, we evaluate I/O distribution standard deviation among AIR, RR, FastScale and GSR, which are shown in Fig. 15. The standard deviation is another way to evaluate the I/O distribution. For RAID-5, AIR has a better performance than that of RAID-0, which is because that the parity modification operations are processed in the disks before scaling.

In Fig. 15, we can see that AIR performs better in distribution standard deviation when the initial disk array is larger.

AIR realizes a more balanced I/O distribution compared with FastScale, RR and GSR under different traces, and the standard deviation reduced by up to 78.3% .

Fig. 15
figure15

Comparison among various scaling approaches in terms of standard deviation of access distribution

Average I/O response time

In this section, we evaluate the average I/O response time among AIR, RR, FastScale and GSR by running different traces, which is shown in Fig. 16. As we can see, AIR has a lower response time than other methods in RAID-0 and RAID-5, respectively.

For merge trace, in which the requests are coming more intensively, the average response time increases greatly for other scaling methods, but it influences slightly for AIR. It can be seen in Fig. 16 that RR response slower when the number of initial disks increase, while AIR stay low and stable. Comparing to FastScale and RR, AIR reduces the average response time by up to 70.3%, and 65.3% comparing to GSR, respectively.

Fig. 16
figure16

Comparison among various scaling approaches in terms of average response time

Scaling time

In this section, we present the total scaling time of different methods using different traces.

Figure 17 shows the comparison among various approaches in terms of scaling time. According to Fig. 17, we can see that even if AIR has extra prediction time cost, the total scaling time is much smaller than other scaling approaches. It means the prediction on data access patterns affects the whole scaling processes slightly.

In Fig. 17, we can see that the scaling time of other scaling methods increases with the increase of the number of initial disk or the number of extended disk, while the scaling time for AIR stays low. AIR saves the scaling time by up to 76.8% compared with FastScale, 96.7% compared with RR, and 87.4% comparing to GSR, respectively.

Fig. 17
figure17

Comparison among various approaches in terms of scaling time (s)

Optimization rate

Considering the necessity of various traces (reflect to different applications), here we conclude the optimization rate of all metrics in respectively RAID-0 and RAID-5 system in Tables 8, 9, 10, 11 and 12.

Table 8 Optimisation rate of average response time with different traces in different scenarios
Table 9 Optimisation rate of I/O Cost with different traces in different scenarios
Table 10 Optimisation rate of scaling time with different traces in different scenarios
Table 11 Optimisation rate of migration rate in different scenarios
Table 12 Optimisation rate of computational cost in different scenarios

Analysis

From the numerical results above, we can conclude that AIR has several advantages listed as follows.

First, the predictions of disk accesses and hot blocks are accurate and the time of prediction is absolutely acceptable for the scaling.

Second, AIR maintains a more balanced I/O distribution among disks while reducing greatly the data migration ratio, the I/O cost and the computation cost on scaling.

Third, AIR can accelerate the scaling by up to 30.3 times as it reduces the scaling time by up to 96.7%.

Conclusion

In this paper, an approximate intelligent redistribution (AIR) is proposed to speed up the RAID scaling. AIR utilizes the information of data access patterns to explore the hot data, which are the candidate blocks for migration. To minimize the I/O cost, AIR supports several migration rules under various scenarios, and redirects new write requests to the extending disk(s). Therefore, an approximate data distribution is provided after scaling. To demonstrate the effectiveness of AIR approach, we conduct several simulations. Compared to classic scaling approaches, the results show that (1) AIR reduces the migration I/Os by up to 99.3%; (2) AIR reduces the number of overall data migration ratio by up to 95.8%; (3) it accelerates the scaling process by up to 30.3X.

References

  1. Brown, N.: Online RAID-5 resizing. drivers/md/raid5.c in the source code of Linux Kernel 2.6.18. http://www.kernel.org/ (2006). Accessed 6 May 2019

  2. Bucy, J., Schindler, J., Schlosser, S., Ganger, G.: The disksim simulation environment version 4.0 reference manual (cmu-pdl-08-101). Parallel Data Laboratory (2008)

  3. Chen, P., Lee, E., et al.: RAID: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2), 145–185 (1994)

  4. Corbett, P., et al.: Row-diagonal parity for double disk failure correction. In: Proc. of the FAST ’04 (2004)

  5. Goel, A., et al.: SCADDAR: An efficient randomized technique to reorganize continuous media blocks. In: Proc. of the ICDE’02 (2002)

  6. Gonzalez, J., Corte, T.: Increasing the capacity of RAID-5 by online gradual assimilation. In: Proc. of the SNAPI’04 (2004)

  7. Gonzalez, J., Cortes, T.: Increasing the capacity of RAID-5 by online gradual assimilation. In: Proc. of the SNAPI ’04 (2004)

  8. Gu, J., Wu, C., Xie, X., Qiu, H., Li, J., Guo, M., He, X., Dong, Y., Zhao, Y.: Optimizing the parity-check matrix for efficient decoding of RS-based cloud storage systems. In: The 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019). IEEE (2019)

  9. Hetzler, S.: Storage array scaling method and system with minimal data movement. US Patent 20080276057, June (2008)

  10. Hitachi Data Systems, Hitachi Virtual Storage Platform Architecture Guide. (2011) http://www.hds.com/assets/pdf/hitachi-architecture-guide-virtual-storage-platform.pdf?WT.ac=uk_hp_sp2r1. Accessed 6 May 2019

  11. Huang, C., Xu, L.: STAR: an efficient coding scheme for correcting triple storage node failures. IEEE Trans. Comput. 57(7), 889–901 (2008)

  12. Hwang, K., et al.: RAID-x: A new distributed disk array for I/O-centric cluster computing. In: Proc. of the HPDC ’00 (2000)

  13. Jiang, Y., Wu, C., Li, J., et al.: BDR: A balanced data redistribution scheme to accelerate the scaling process of XOR-based triple disk failure tolerant arrays. In: 2016 IEEE 34th International Conference on Computer Design (ICCD). IEEE Computer Society (2016)

  14. Jin, H., Wu, C., Xie, X., Li, J., Guo, M., Lin, H., Zhang, J.: Approximate code: a cost-effective erasure coding framework for tiered video storage in cloud systems. In: 48th International Conference on Parallel Processing (ICPP 2019), August 5–8, 2019, Kyoto, Japan. ACM, New York, NY, USA, 10 p. https://doi.org/10.1145/3337821.3337869 (2019)

  15. Miranda, A., Cortes, T.: Craid: online raid upgrades using dynamic hot data reorganization. In: Proc. of the USENIX FAST ’14 (2014)

  16. Oracle Corporation. A better RAID strategy for high capacity drives in mainframe storage. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/raid-strategy-hi-capacity-drives-170907.pdf (2013). Accessed 3 May 2019

  17. Pareto, V., Page, A.N.: Translation of Manuale di economia politica (“Manual of political economy”), A.M. Kelley, (1971), ISBN 978-0-678-00881-2

  18. Patterson, D., et al.: A case for Redundant Arrays of Inexpensive Disks (RAID). In: Proc. of the ACM SIGMOD ’88 (1988)

  19. Plank, J.: The RAID-6 liberation codes. In: Proc. of the FAST ’08 (2008)

  20. Reed, I., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8, 300–304 (1960)

  21. Sundermayer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH (2012)

  22. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: ACL (2015)

  23. Wilkes, J., et al.: The HP autoRAID hierarchical storage system. ACM Trans. Comput. Syst. 14(1), 108–136 (1996)

  24. Wu, S., Xu, Y., Li, Y., Zhu, Y.: POS: a popularity-based online scaling scheme for raid-structured storage systems. In: Proc. of the ICCD ’15 (2015). POS[26], STAR code[22], RS6[23], TIP code[24]

  25. Wu, C., He, X.: GSR: a global stripe-based redistribution approach to accelerate RAID-5 scaling. In: 41st International Conference on Parallel Processing (2012)

  26. Wu, C., He, X., Han, J., et al.: SDM: a stripe-based data migration scheme to improve the scalability of RAID-6. In: IEEE International Conference on Cluster Computing. IEEE (2012)

  27. Wu, C., He, X., Li, J., et al.: Code 5-6: an efficient MDS array coding scheme to accelerate online RAID level migration. In: 2015 44th International Conference on Parallel Processing (ICPP). IEEE (2015)

  28. Wu, S., Xu, Y., Li, Y., et al.: POS: a popularity-based online scaling scheme for RAID-structured storage systems. In: 2015 33rd IEEE International Conference on Computer Design (ICCD). IEEE Computer Society (2015)

  29. Xie, X., Wu, C., Gu, J., Qiu, H., Li, J., Guo, M., He, X., Dong, Y., Zhao, Y.: AZ-Code: an efficient availability zone level erasure code to provide high fault tolerance in cloud storage systems. In: The 35th International Conference on Massive Storage Systems and Technology (MSST 2019). IEEE (2019)

  30. Zhang, G., et al.: SLAS: an efficient approach to scaling round-robin striped volumes. ACM Trans. Storage 3(1), 1–39 (2007)

  31. Zhang, G., et al.: ALV: a new data redistribution approach to RAID-5 scaling. IEEE Trans. Comput. 59(3), 345–357 (2010)

  32. Zhang, G., Li, K., Wang, J., Zheng, W.: Accelerate rdp raid-6 scaling by reducing disk i/os and xor operations. IEEE Trans. Comput. 64(1), 32–44 (2015)

  33. Zhang, G., Wu, G., Lu, Y., et al.: Xscale: online X-code RAID-6 scaling using lightweight data reorganization. IEEE Trans. Parallel Distrib. Syst. 27(12), 1–1 (2016)

  34. Zhang, Y., Wu, C., Li, J., Guo, M.: TIP-code: a three independent parity code to tolerate triple disk failures with optimal update complexity. In: Proc. of the IEEE/IFIP DSN ’15 (2015)

  35. Zheng, W., Zhang, G.: FastScale: accelerate RAID scaling by minimizing data migration. In: Proc. of the USENIX FAST’11 (2011)

Download references

Acknowledgements

We thank anonymous reviewers for their insightful comments. This work is partially sponsored by the National Key R&D Program of China (No. 2018YFB0105203), the Natural Science Foundation of China (NSFC) (No. 61972246), and the Natural Science Foundation of Shanghai (No. 18ZR1418500).

Author information

Correspondence to Chentao Wu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, Z., Guo, H. & Wu, C. AIR: an approximate intelligent redistribution approach to accelerate RAID scaling. CCF Trans. HPC (2020). https://doi.org/10.1007/s42514-020-00021-0

Download citation

Keywords

  • RAID
  • Redistribution
  • Data migration
  • Scalability
  • Performance evaluation