After having defined some necessary preliminaries in the last section, we are now able to discuss the system model, the formal specification of the local and the global optimization models, and the heuristic optimization approach.
Each optimization approach suggests the placement of the chunks of the data objects, on several cloud storages, in a cost-efficient way without violating predefined service-level objectives (SLOs). Those SLOs are defined for each data object by the owner of the data object. The SLOs are availability, durability, and vendor lock-in as defined in Sect. 2.
System model
For the data object placement optimization, we provide a mixed-integer linear programming (MILP)-based local and a global data placement approach and a heuristic approach. In the following, we introduce the used variables, the cost model, and the used decision variables, before we discuss the optimization approaches.
Variables
In our system model, the set of all available storages is labeled with S where \(s \in S=\{s_1,s_2,\ldots \}\) defines one storage of this set. N is the set of all data objects, and \(F~\in ~N=\{F_1,F_2,\ldots \}\) is one data object of this set. Next, \(f \in F=\{f_1,f_2,\ldots \}\) defines a chunk of a data object F, with \(\vert S \vert ~\ge ~\vert F \vert \). As described in Sect. 2.2, the amount of chunks of a data object depends on the used erasure coding configuration. The parameter \(\tau \) defines the amount of historical information of a chunk (e.g., amount of read and write operations) that is used for the optimization. Under the assumption that the usage pattern of a data object does not change over a period of time [16], our optimization approach uses this information to predict future data access.
As mentioned in Sect. 2.3, several storage providers define a block rate pricing model for the storage and traffic cost. In the system model, \(b_s\) defines one block of a block rate pricing model of storage s and \(b_s \in B^{T_\mathrm{out}}_s=\{b_1,b_2,\ldots \}\), where \(B^{T_\mathrm{out}}_s\) defines the set of all outgoing traffic price blocks. Analogously, \(B^\mathrm{sto}_s=\{b_1,b_2,\ldots \}\) defines the pricing blocks for the storage prices. Furthermore, \(b_s=(b_{L,s},b_{U,s},p_s)\) defines a triple that includes the lower bound of the pricing block \(b_{L,s}\), the upper bound \(b_{U,s}\), and the price of this block \(p_s\).
Cost model
The cost model is used by the optimization to calculate the cost that accrue by storing a chunk. The cost of storing a chunk is composed of the storage cost, the traffic cost of the chunk, and the cost of the performed read and write operations.
The total cost that is charged if a chunk f is stored on storage s by considering the last \(\tau \) min of the chunk’s history is calculated by (1). The total cost is calculated by adding the used storage cost, the cost for read and write operations, and the cost for the used incoming and outgoing traffic. The single cost factors will be explained in the following.
$$\begin{aligned} c_{s,f}(\tau )= & {} c^{S}_{s,f}(\tau ) + c^{R}_{s,f}(\tau ) + c^{W}_{s,f}(\tau ) + c^{T_\mathrm{in}}_{s,f}(\tau )\nonumber \\&+ c^{T_\mathrm{out}}_{s,f}(\tau ) \end{aligned}$$
(1)
The storage cost that are charged if the chunk f is stored on s based on the last \(\tau \) min of the chunk’s history is calculated by (2). In the equation, the term \(p^{S}_{s, \gamma _{s,f}}\) calculates the current storage price. Since several cloud storage providers use the already discussed block rate pricing model, the current price can be different depending on the present usage of the storage. In \(p^{S}_{s, \gamma _{s,f}}\), this is taken into account by the term \(\gamma _{s,f}\) that calculates the present usage of the storage s and adds the size of f to the result if f is currently not stored on the storage. The resulting price is then multiplied with the chunk size calculated by \(\sigma _{f}(\tau )\) considering the last \(\tau \) min of the chunks history. This term also considers if a BSU is defined for the storage. If this is the case and the chunk size is smaller than the BSU, then \(\sigma _{f}(\tau )\) uses the BSU instead of the actual chunk size.
If storage s is a long-term storage, the BTU time has to be considered as well. This is done by \(\hat{\sigma }_{f, \mathrm{BTU}} \cdot h_{f}\). The term \(\hat{\sigma }_{f, \mathrm{BTU}}\) calculates the storage size of the chunk f that is charged for the remaining BTU time.
$$\begin{aligned} c^{S}_{s,f}(\tau ) = p^{S}_{s, \gamma _{s,f}} \cdot (\sigma _{f}(\tau ) + \hat{\sigma }_\mathrm{f,BTU} \cdot h_{f}) \end{aligned}$$
(2)
The charged cost for the write and read operations is calculated by (3) and (4). The terms \(r^{W}_{f}(\tau )\) and \(r^{R}_{f}(\tau )\) return the amount of write and read operations of chunk f during the last time period \(\tau \), respectively. Further, \(p^{W}_{s}\) defines the price of a write operation and \(p^{R}_{s}\) of a read operation. Delete operations are handled analogue.
$$\begin{aligned} c^{W}_{s,f}(\tau )= & {} r^{W}_{f}(\tau ) \cdot p^{W}_{s} \end{aligned}$$
(3)
$$\begin{aligned} c^{R}_{s,f}(\tau )= & {} r^{R}_{f}(\tau ) \cdot p^{R}_{s} \end{aligned}$$
(4)
The outgoing and incoming traffic cost of a chunk f, in the last time period \(\tau \), is calculated by (5) and (6). Since the description of the outgoing traffic cost is also applicable to the incoming traffic cost, we will only discuss the outgoing traffic cost defined in (5). \(t^\mathrm{out}_{f}(\tau )\) defines the amount of read bytes from chunk f during the last time period \(\tau \). The term \(p^{T_\mathrm{out}}_{s,\beta _{s,f}}\) returns the outgoing traffic price of storage s. Analogue to storage cost calculation, defined in (2), the block rate pricing model has to be considered also for the traffic cost calculation. Analogue to \(\gamma _{s,f}\) in (2), this is done by \(\beta _{s,f}\) in (5). \(\beta _{s,f}\) calculates the amount of read bytes from storage s, including chunk f. If storage s is a long-term storage, data retrieval cost can be charged as well. This additional price is considered by \(p^\mathrm{ret}_{s,\beta _{s,f}}\). The variable \(h_f\) is the same as in (2).
$$\begin{aligned} c^{T_\mathrm{out}}_{s,f}(\tau )= & {} t^\mathrm{out}_{f}(\tau ) \cdot \left( p^{T_\mathrm{out}}_{s,\beta _{s,f}} + p^\mathrm{ret}_{s,\beta _{s,f}} \cdot h_{f}\right) \end{aligned}$$
(5)
$$\begin{aligned} c^{T_\mathrm{in}}_{s,f}(\tau )= & {} t^\mathrm{in}_{f}(\tau ) \cdot p^{T_\mathrm{in}}_{s,\beta _{s,f}} \end{aligned}$$
(6)
Optional migration cost which occurs if a chunk f has to be migrated from one storage to another is calculated by (7) and (8). Which equation is taken to calculate the migration cost depends on the migration type. If a chunk f has to be migrated from one storage to another storage of the same provider and this provider defines a special migration price, (7) is used. If a chunk f has to be migrated from one storage to another storage and the provider is different, (8) is used. In (8), \(p^{T_\mathrm{out}}_{s_1,\beta _{s_1,f}}\), \(p^{T_\mathrm{in}}_{s_2,\beta _{s_2,f}}\), and \(p^\mathrm{ret}_{s,\beta _{s,f}}\) are analogously defined as in (5) and (6). In (7), \(p^{T_\mathrm{out,reg}}_{s_1,\beta _{s_1,f}}\) and \(p^{T_\mathrm{in,reg}}_{s_2,\beta _{s_2,f}}\) define the same but considering region migration prices. \(\hat{\sigma }_{f}\) specifies the size of the chunk f. \(r^{R}_{s_1}\) and \(r^{W}_{s_1}\) define the amount of required read and write operations. The terms \(p^{R}_{s_1}\) and \(p^{W}_{s_2}\) represent the same as in (3) and (4).
$$\begin{aligned} \begin{aligned} c^{M_\mathrm{reg}}_{s_{1},s_{2},f} =&\left( p^{T_\mathrm{out,reg}}_{s_1,\beta _{s_1,f}} + p^{T_\mathrm{in,reg}}_{s_2,\beta _{s_2,f}} + p^\mathrm{ret}_{s,\beta _{s,f}} \cdot h_{f}\right) \cdot \hat{\sigma }_{f} \\&+ r^{R}_{s_1} \cdot p^{R}_{s_1} + r^{W}_{s_2} \cdot p^{W}_{s_2} \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} c^{M}_{s_{1},s_{2},f} =&\left( p^{T_\mathrm{out}}_{s_1,\beta _{s_1,f}} + p^{T_\mathrm{in}}_{s_2,\beta _{s_2,f}} + p^\mathrm{ret}_{s,\beta _{s,f}} \cdot h_{f}\right) \cdot \hat{\sigma }_{f} \\&+ r^{R}_{s_1} \cdot p^{R}_{s_1} + r^{W}_{s_2} \cdot p^{W}_{s_2} \end{aligned} \end{aligned}$$
(8)
Decision variables
In our optimization model, \(x_{(s,f)} \in \{0,1\}\) defines if a chunk f is stored on storage s (\(x_{s,f} = 1\)) or not (\(x_{s,f}~=~0\)). The optimization model uses the variable \(g_{\tilde{S},F} \in \{0,1\}\), where \(\tilde{S} = \{s_1,s_2,\ldots ,s_n\}\) is a subset of the storage set S with \(\vert \tilde{S} \vert = \vert F \vert \) and \(\tilde{S} \subseteq S\). \(g_{\tilde{S},F} = 1\) indicates that all storages of the subset \(\tilde{S}\) have one chunk of F stored and \(g_{\tilde{S},F} = 0\) indicates that at least one storage of \(\tilde{S}\) does not store a chunk of F.
The decision variable \(h_f \in \{0,1\}\) denotes that a chunk f is currently stored on a long-term storage, indicated by \(h_f = 1\), or not, indicated by \(h_f = 0\). The system model further uses the decision variables \(z_{s_1,s_2}\) and \(y_{s_1,s_2}\) to indicate if two storages are the same or are different but have the same storage provider. The variable \(y_{s_1,s_2} \in \{0,1\}\) defines if the storages \(s_1\) and \(s_2\) are not identical but have the same storage provider, indicated by \(y_{s_1,s_2}=1\); \(y_{s_1,s_2} = 0\) otherwise. Analogously, \(z_{s_1,s_2} \in \{0,1\}\) defines if the storages \(s_1\) and \(s_2\) are not the same and have different storage providers, indicated by \(z_{s_1,s_2}=1\); \(z_{s_1,s_2} = 0\) otherwise.
The decision variables \(u^{T_\mathrm{out}}_{s,b_s} \in \{0,1\}\), \(v^{T_\mathrm{out}}_{s,b_s} \in \{0,1\}\), and \(o^{T_\mathrm{out}}_{s,b_s} \in \{0,1\}\) are used to indicate if the overall used outgoing traffic of a storage is in a specific pricing step of a block rate pricing model. \(u^{T_\mathrm{out}}_{s,b_s}~=~1\) indicates that the outgoing traffic of storage s is bigger than the lower boundary \(b_{L,s}\) defined in \(b_s\); \(u^{T_\mathrm{out}}_{s,b_s}~=~0\) otherwise. \(v^{T_\mathrm{out}}_{s,b_s} = 1\) indicates that the outgoing traffic of storage s is smaller than the upper boundary \(b_{U,s}\) defined in \(b_s\); \(v^{T_\mathrm{out}}_{s,b_s} = 0\) otherwise. Furthermore, \(o^{T_\mathrm{out}}_{s,b_s} = 1\) indicates that the traffic is between the lower and upper bound of \(b_s\), which means that \(u^{T_\mathrm{out}}_{s,b_s} = 1\) and \(v^{T_\mathrm{out}}_{s,b_s} = 1\) hold. \(o^{T_\mathrm{out}}_{s,b_s} = 0\) indicates that this is not the case. The variables \(u^\mathrm{sto}_{s,b_s} \in \{0,1\}\), \(v^\mathrm{sto}_{s,b_s} \in \{0,1\}\) and \(o^\mathrm{sto}_{s,b_s} \in \{0,1\}\) are used to assess the block for the used storage of s.
Local placement problem
After defining the system model, we are now able to discuss the local optimization problem. The local optimization problem defines the problem of finding a cost-efficient placement of one data object F. Therefore, the optimization problem is provided with the data object F and all available storages S. Additionally, the problem takes the time period \(\tau \) as input.
Objective function
(9) shows the objective function, which is set to minimize the overall cost to store F.
$$\begin{aligned} \begin{aligned}&\text {min } \sum _{f \in F}\sum _{s \in S} \;\big (\; c_{s,f}(\tau ) \cdot w_{s,f} + c^{M}_{{f}_{s},s,f} \cdot z_{{f}_{s}, s} \\&\qquad +c^{M_\mathrm{reg}}_{{f}_{s},s,f} \cdot y_{{f}_{s}, s} \big ) \cdot x_{s,f} \end{aligned} \end{aligned}$$
(9)
\(c_{s,f}(\tau ) \cdot w_{s,f}\) calculates the overall cost to store the chunk f on the storage s by taking the last \(\tau \) min of the chunks usage history into account. The term \(c_{s,f}(\tau )\) is already discussed in Sect. 3.1.2. The term \(w_{s,f} \in \) [1,BTU] is a multiplier that helps to specify if the overall storage cost can be reduced by storing a chunk f on a long-term storage. If s is a long-term storage, the calculation of the resulting value of the term is initialized with the value of the BTU, i.e., \(w_{s,f}\) = BTU. The algorithm decreases \(w_{s,f}\) each time there was no or rare usage of the chunk according to the history, where the amount of historical information is defined by the BTU. This is done until all historical information is checked or until \(w_{s,f} = 1\). If s is a standard storage without any BTU, the value is always 1.
\(c^{M}_{{f}_{s},s,f} \cdot z_{{f}_{s}, s}\) and \(c^{M_\mathrm{reg}}_{{f}_{s},s,f}~\cdot ~y_{{f}_{s}, s}\) calculate the migration cost from one storage to another, without and with special migration prices. The term \({f}_{s}\) returns the storage on which the chunk f is currently stored. Finally, the decision variable \(x_{s,f}\) decides if the chunk f is stored on the storage s or not.
Constraints
The first constraint (10) ensures that the selected storage solution of a data object fulfills the required vendor lock-in factor \(l_F\). As mentioned above, the vendor lock-in factor is set for each data object as a SLO.
$$\begin{aligned} \frac{1}{\sum _{f \in F}\sum _{s \in S}x_{s,f}} \le l_F \end{aligned}$$
(10)
Constraints (11) and (12) ensure that the required durability and availability of a data object are fulfilled. In the following we will only describe (11) because the description of (12) is analogue.
\(\sum _{{\tilde{S}}' \in r_{\tilde{S},k}} \big [\prod _{s \in {\tilde{S}}'} \hat{a}_s \cdot \prod _{s \in \tilde{S} \setminus {\tilde{S}}'}(1-\hat{a}_s)\big ]\) calculates the availability of the storage set \({\tilde{S}}'\), whereas \({\tilde{S}}'\) holds all possible combinations of size k of the set \(\tilde{S}\). Those combinations are represented by \(r_{\tilde{S},k}\). \(\hat{a}_s\) defines the availability of s. Conclusively, this part of the equation calculates the probability that there are k simultaneously available storages. To complete (11), we have to include the functionality that a system which uses erasure coding with a coding configuration of (m, n) and can withstand up to \(n-m\) simultaneous storage failures. This is done by increasing k starting from m, i.e., the minimum amount of required chunks of F depicted by \(F^m\), to \(\vert \tilde{S} \vert \). In the equation, this is achieved by \(\sum _{k=\vert F^m \vert }^{\vert \tilde{S} \vert }\). Finally, the result is compared to \(a_F~\cdot ~g_{\tilde{S},F}\) where \(a_F\) defines the required availability of the data object and \(g_{\tilde{S},F}\) defines if each storage in \(\tilde{S}\) has one chunk stored or not. \(g_{\tilde{S},F}\) is defined by constraints (13) and (14) that define together a logic AND.
$$\begin{aligned}&\sum _{k=\vert F^m \vert }^{\vert \tilde{S} \vert } \sum _{{\tilde{S}'} \in r_{\tilde{S},k}}\bigg [\prod _{s \in {\tilde{S}}'}\hat{a}_s \prod _{s \in \tilde{S} \setminus {\tilde{S}}'} \left( 1-\hat{a}_s\right) \bigg ] \ge a_F \cdot g_{\tilde{S},F} \end{aligned}$$
(11)
$$\begin{aligned}&\sum _{k=\vert F^m \vert }^{\vert \tilde{S} \vert } \sum _{{\tilde{S}'} \in r_{\tilde{S},k}}\bigg [\prod _{s \in {\tilde{S}}'}\hat{d}_s \prod _{s \in \tilde{S} \setminus {\tilde{S}}'} \left( 1-\hat{d}_s\right) \bigg ] \ge d_F\cdot g_{\tilde{S},F} \end{aligned}$$
(12)
$$\begin{aligned}&g_{\tilde{S},F} \ge \sum _{s \in \tilde{S}}\sum _{f \in F}x_{s,f} - (\vert F \vert - 1) \end{aligned}$$
(13)
$$\begin{aligned}&g_{\tilde{S},F} \le \sum _{f \in F}x_{\tilde{s},f} \qquad \forall \tilde{s} \in \tilde{S} \end{aligned}$$
(14)
With (15) it is ensured that only \(\vert F \vert \) assignments from the chunks \(f \in F\) to the storages \(s \in S\) exist. Furthermore, (16) and (17) ensure that each chunk is stored on only one storage and (18) defines the decision variable boundaries.
$$\begin{aligned}&\sum _{f \in F}\sum _{s \in S} x_{s,f} = \vert F \vert \end{aligned}$$
(15)
$$\begin{aligned}&\sum _{s \in S} x_{s,f} \le 1 \end{aligned}$$
(16)
$$\begin{aligned}&\sum _{f \in F} x_{s,f} \le 1 \end{aligned}$$
(17)
$$\begin{aligned}&g_{\tilde{S},F} \in \{0,1\}; \;\; x_{s,f} \in \{0,1\}\nonumber \\&z_{s_1, s_2} \in \{0,1\}; \;\; y_{s_1, s_2} \in \{0,1\} \end{aligned}$$
(18)
Global placement problem
In the following, we discuss the global optimization problem. In comparison with the local optimization problem, the global optimization problem defines the problem of finding the cheapest placement for all data objects on all available storages S. The global optimization problem gets as an input the sets S and N, which contain all data objects. Further, the parameter \(\tau \) is set to the length of the BTU, i.e., \(\tau \) = BTU. This way the global optimization gets the historic information of the last BTU and can, therefore, precisely calculate if the cost can be decreased by storing the chunk on a long-term storage where the whole BTU is charged.
Objective function
The objective function of the global optimization problem optimizes the placement of all chunks \(f \in F\) of all data objects \(F \in N\) as shown in (19).
$$\begin{aligned} \begin{aligned}&\text {min } \sum _{s \in S} \bigg [ \sum _{F \in N} \sum _{f \in F} \big (c^{R}_{s,f}(\tau ) + c^{W}_{s,f}(\tau ) \\&\qquad \qquad +c^{M}_{f_s,s,f} \cdot z_{{\hat{s}}, s} + c^{M_\mathrm{reg}}_{f_s,s,f} \cdot y_{{\hat{s}}, s} \big ) \cdot x_{s,f} \\&\qquad \qquad +c^\mathrm{sto}_{s}(\tau ) + c^{T_\mathrm{out}}_{s}(\tau ) \bigg ] \end{aligned} \end{aligned}$$
(19)
While for the local optimization the consideration of the currently stored chunks of a storage was enough to identify the current price in a block rate pricing model, this is not applicable anymore for the global optimization. This is due to the fact that by adding the possibility to migrate multiple chunks at once, all chunks on a storage can change and, thus, also the pricing step. To consider this, the global optimization problem splits up the cost calculation \(c_{s,f}(\tau )\). Therefore, it models those cost calculations that may include a block rate pricing model, namely the used outgoing traffic cost \(c^{T_\mathrm{out}}_{s}(\tau )\) and the used storage cost \(c^\mathrm{sto}_{s}(\tau )\), as constraints. Those calculations are done by (25) for the outgoing traffic cost and by (31) for the storage cost. The terms \(c^{R}_{s,f}(\tau )\) and \(c^{W}_{s,f}(\tau )\) calculate the read/write operation cost, as defined in (3) and (4). The migration cost are calculated by \(c^{M_\mathrm{reg}}_{f_s,s,f}\) and \(c^{M}_{f_s,s,f}\) as defined in (7) and (8).
Constraints
The global optimization uses the same constraints as the local optimization, i.e., (10) to (18). However, due to the fact that the global optimization considers the placement of all files at once, all local optimization constraints need to be defined for all \(F \in N\). Furthermore, the global optimization requires further constraints, to include the storage cost and outgoing traffic cost of a storage, which are discussed in the following.
(20) to (24) define if the outgoing traffic of storage s is in the range of a block rate pricing model step. In (20) together with (21), it is defined if the used outgoing traffic of storage s is bigger than the lower boundary \(b_{L,s}\) of a pricing step \(b_s \in B^{T_\mathrm{out}}_s=\{b_1,b_2,\ldots \}\). \(t^\mathrm{out}_{f}(\tau )\) is defined analogue as in (5), and M is a sufficient large constant that is at least larger than the largest possible value of \(\sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot x_{s,f}\).
$$\begin{aligned} \begin{aligned}&b_{L,s} \le \sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot x_{s,f} + M \cdot \left( 1 - u^{T_\mathrm{out}}_{s,b_s}\right) \\&\quad \forall s \in S;\;\forall b_s \in B^{T_\mathrm{out}}_s;\;\exists b_{L,s} \in b_s \end{aligned} \end{aligned}$$
(20)
$$\begin{aligned} \begin{aligned}&b_{L,s} > \sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot x_{s,f} - M \cdot u^{T_\mathrm{out}}_{s,b_s} \\&\quad \forall s \in S;\;\forall b_s \in B^{T_\mathrm{out}}_s;\; \exists b_{L,s} \in b_s \end{aligned} \end{aligned}$$
(21)
(22) together with (23) indicates if the used outgoing traffic of storage s is smaller than the upper boundary \(b_{U,s}\) of a pricing step \(b_s \in B^{T_\mathrm{out}}_s=\{b_1,b_2,\ldots \}\).
$$\begin{aligned} \begin{aligned}&\sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot x_{s,f} \le b_{U,s} + M \cdot \left( 1 - v^{T_\mathrm{out}}_{s,b_s}\right) \\&\quad \forall s \in S;\;\forall b_s \in B^{T_\mathrm{out}}_s;\;\exists b_{U,s} \in b_s \end{aligned} \end{aligned}$$
(22)
$$\begin{aligned} \begin{aligned}&\sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot x_{s,f} > b_{U,s} - M \cdot v^{T_\mathrm{out}}_{s,b_s} \\&\quad \forall s \in S;\;\forall b_s \in B^{T_\mathrm{out}}_s;\;\exists b_{U,s} \in b_s \end{aligned} \end{aligned}$$
(23)
Finally, (24) defines if the outgoing traffic is between the lower and upper boundary of a pricing block \(b_s\). This is the case if the used outgoing traffic is bigger than the lower boundary \(b_{L,s}\), indicated by \(u^{T_\mathrm{out}}_{s,b_s}\), and smaller than the upper boundary \(b_{U,s}\), indicated by \(v^{T_\mathrm{out}}_{s,b_s}\).
$$\begin{aligned} 0 \le u^{T_\mathrm{out}}_{s,b_s} + v^{T_\mathrm{out}}_{s,b_s} - 2 \cdot o^{T_\mathrm{out}}_{s,b_s} \le 1 \;\; \forall s \in S;\forall b_s \in B^{T_\mathrm{out}}_s \end{aligned}$$
(24)
The information if the used traffic of a storage s is within a pricing range, defined by \(o^{T_\mathrm{out}}_{s,b_s}\), is then used to calculate the cost that are charged due to the used traffic of a storage, indicated by \(c^{T_\mathrm{out}}_{s}(\tau )\). This is done by (25) where \(p_s\) defines the price of the pricing range \(b_s \in B^{T_\mathrm{out}}_s=\{b_1,b_2,\ldots \}\).
$$\begin{aligned} \begin{aligned}&\sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot p_s \cdot x_{s,f} - M \left( 1-o^{T_\mathrm{out}}_{s,b_s}\right) \\&\quad \le c^{T_\mathrm{out}}_{s}(\tau ) \le \sum _{F \in N} \sum _{f \in F} t^\mathrm{out}_{f}(\tau ) \cdot p_s \cdot x_{s,f} \\&\qquad +M \cdot (1-o^{T_\mathrm{out}}_{s,b_s}) \;\; \forall s \in S;\forall b_s \in B^{T_\mathrm{out}}_s;\exists p_s \in b_s \end{aligned} \end{aligned}$$
(25)
Analogue as (20) to (24) defines if the used outgoing traffic of a storage s is within a pricing range of a block rate pricing model step, equations (26) to (30) define if the size of the chunks stored on storage s is within a pricing range. Since the description of (20) to (24) is also applicable to (26) to (30), we will repeat it here in detail. In the following, \(\sigma _{f}(\tau )\) is defined analogue as in (2), and M is a sufficient large constant that is at least larger than the largest possible value of \(\sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot x_{s,f}\).
$$\begin{aligned} \begin{aligned}&b_{L,s} \le \sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot x_{s,f} + M \cdot \left( 1 - u^\mathrm{sto}_{s,b_s}\right) \\&\quad \forall s \in S;\;\forall b_s \in B^\mathrm{sto}_s;\;\exists b_{L,s} \in b_s \end{aligned} \end{aligned}$$
(26)
$$\begin{aligned} \begin{aligned}&b_{L,s} > \sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot x_{s,f} - M \cdot u^\mathrm{sto}_{s,b_s} \\&\quad \forall s \in S;\;\forall b_s \in B^\mathrm{sto}_s;\;\exists b_{L,s} \in b_s \end{aligned} \end{aligned}$$
(27)
$$\begin{aligned} \begin{aligned}&\sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot x_{s,f} \le b_{U,s} + M \cdot \left( 1 - v^\mathrm{sto}_{s,b_s}\right) \\&\quad \forall s \in S;\;\forall b_s \in B^\mathrm{sto}_s;\;\exists b_{U,s} \in b_s \end{aligned} \end{aligned}$$
(28)
$$\begin{aligned} \begin{aligned}&\sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot x_{s,f} > b_{U,s} - M \cdot v^\mathrm{sto}_{s,b_s} \\&\quad \forall s \in S;\;\forall b_s \in B^\mathrm{sto}_s;\;\exists b_{U,s} \in b_s \end{aligned} \end{aligned}$$
(29)
$$\begin{aligned}&0 \le u^\mathrm{sto}_{s,b_s} + v^\mathrm{sto}_{s,b_s} - 2 \cdot o^\mathrm{sto}_{s,b_s} \le 1 \;\; \forall s \in S;\forall b_s \in B^\mathrm{sto}_s \end{aligned}$$
(30)
Furthermore, analogue as (25) calculates the cost that occur due to the used outgoing traffic of s, (31) calculates the cost that occur due to the used storage of s.
$$\begin{aligned} \begin{aligned}&\sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot p_s \cdot x_{s,f} - M \left( 1-o^\mathrm{sto}_{s,b_s}\right) \\&\quad \le c^\mathrm{sto}_{s}(\tau ) \le \sum _{F \in N} \sum _{f \in F} \sigma _{f}(\tau ) \cdot p_s \cdot x_{s,f} \\&\qquad +M \cdot (1-o^\mathrm{sto}_{s,b_s}) \;\; \forall s \in S;\;\forall b_s \in B^\mathrm{sto}_s;\exists p_s \in b_s \end{aligned} \end{aligned}$$
(31)
Finally, (32) ensures that \(c^{T_\mathrm{out}}_{s}(\tau )\) and \(c^\mathrm{sto}_{s}(\tau )\) are positive for all \(s \in S\) and that the remaining decision variables are bounded to \(\{0,1\}\).
$$\begin{aligned} \begin{aligned}&c^{T_\mathrm{out}}_{s}(\tau ) \ge 0;&c^\mathrm{sto}_{s}(\tau )&\ge 0&\\&u^{T_\mathrm{out}}_{s,b_s} \in \{0,1\};&o^{T_\mathrm{out}}_{s,b_s}&\in \{0,1\};&v^{T_\mathrm{out}}_{s,b_s}&\in \{0,1\}\\&v^\mathrm{sto}_{s,b_s} x\in \{0,1\};&u^\mathrm{sto}_{s,b_s}&\in \{0,1\};&o^\mathrm{sto}_{s,b_s}&\in \{0,1\} \end{aligned} \end{aligned}$$
(32)
Heuristic placement
As a third optimization approach, we developed a heuristic for the global optimization. This heuristic first applies a classification of all data objects by the used storage size and the used outgoing traffic. Subsequently, the best fitting storage set for each of those classes is selected by optimizing a representative data object. The result of this optimization is then applied to all data objects in a class. Therefore, a nearly optimal solution can be achieved by calculating the optimal placement for only a couple of data objects.
Based on an intensive analysis of the global and local optimization results, we identified that the main reasons of a chunk migration are the outgoing traffic of a chunk and the size of it. Therefore, the classification is based on these two aspects. For defining the upper and lower boundaries of a class, we are using a separation by quantiles for the storage size classes. However, in comparison with the storage size of the chunks, where each chunk has a size > 0, the traffic of the chunks can be 0, i.e., all not used chunks. To take care that all of them are in the same class, we define the traffic class boundaries based on the outgoing traffic instead of a separation by quantiles, which could distribute those chunks into several classes.
Algorithm 1 and 2 describe how our approach uses classification and the already discussed local optimization problem to find a cheap storage solution for all chunks of all data objects. In the following we first discuss the classification in Algorithm 1 and then the optimization, described in Algorithm 2, based on the results of the classification. Depending on the requirements, this heuristic optimization can be performed for each data object access or only in predefined intervals, e.g., each 1,000\(^\text {th}\) data object access. This is due to the fact that our proposed heuristic optimizes the placement of all chunks at a time and, thus, also considers the data objects that were accessed before the optimization starts.
As an input, the classification algorithm, depicted in Algorithm 1, requires the boundaries for the traffic and storage classes. The first step of the algorithm is to sort all data objects according to their used traffic and storage size, which is done by the methods sortByTraffic() and sortByStorage() in lines 2 and 3. Subsequently, empty lists for all traffic classes, called trafficClasses, and for all storage classes, called storageClasses, are generated (lines 4 and 5) according to the classes defined by the boundaries trafficBoundaries and storageBoundaries. Those empty lists are then filled with the sorted data objects by the method fillClasses() (lines 6 and 7). Finally, the cartesian product of the lists trafficClasses and storageBoundaries is created and stored in the list classes (lines 9–14). This results in a list of all combinations of the traffic and storage classes including their corresponding chunks. This list is then used in Algorithm 2 for the optimization.
Algorithm 2 gets as input all data objects that are subject for optimization, in our case all stored data objects, and all available storages. At the beginning of Algorithm 2, the performClassification() method is called (line 2), which is described by Algorithm 1. The result of this call is a list of all possible traffic and storage class combinations, including the corresponding chunks. Algorithm 2 is then iterating through this list (lines 3–15) and selects at the beginning of each iteration, wherever the class is not empty (line 4), a representative data object for the current class. This is done by the method getRepresentativeDataObj() (line 5). The concrete data object selection depends on the implementation of this method. Possible implementations are, e.g., a random selection, the first or last data object of the class, or a data object in the middle of the class. Subsequently, method localOpt() (line 6) finds the best chunk placement for the representative data object by solving the local optimization problem from Sect. 3.2 and stores it in dataObj. Afterward, the result of the optimization is read by the method getStorages() (line 7) that returns the list of selected storages and stores it in selStorages.
This storage set is then applied for all data objects in the current class. For this, the algorithm iterates through all data objects in the class (lines 8–13). In each iteration, the first step is to map each chunk of a data object to one storage of the selStorages list. This is done by the method getMap(), and the result is stored in \(chunkStgMap \) (line 9). To avoid unnecessary migration steps, this method has to consider that chunks may already be stored on one of the selected storages. Therefore, getMap() considers if a chunk is already stored on one of the selected storages from the list selStorages. In this case, the chunk will be mapped to the same storage again. In the final step, each of those chunks to storage mappings is added to the resulting list optimalPla (lines 10–12). At the end of the execution of Algorithm 2, the list optimalPla holds the final optimal chunk to storage mapping for all chunks of all data objects.
Since there is no historical information available for the first upload of a new data object, we use a fixed storage set for the first upload. Ideally, this fixed storage set includes the cheapest storages in respect to the storage and traffic prices.
The complexity of the algorithm depends mainly on the complexity of the selected local optimization solution, i.e., line 6 in Algorithm 2, and on the sorting of the data objects, i.e., lines 2 and 3 in Algorithm 1. The complexity of Algorithm 1 is \(O(\mathrm{max}(n \cdot log(n), m \cdot w))\), where n is the number of data objects, m is the amount of traffic classes, and w is the amount of storage classes. Algorithm 2 depends on the local optimization, which is a NP-hard problem. To reduce and limit the optimization duration a deadline-based approach can be used, e.g., take the best solution after 2 min of searching for the local optimization.