General idea
The general idea of our approach is to decompose the main problem into simpler subproblems that are solved in an iterative way. Each subproblem obtains its input from the solution of the previous subproblem, as shown in Fig. 3: the numbered blocks in the figure represent the subproblems we solve. Our approach is then repeated at regular intervals as a method of pro-active self-adaptation, or in response to unexpected situations that cause a run-time SLO violation as a method of reactive self-adaptation. A general idea of each subproblem we address is described as follows, while details are given in the next subsections.
1. Choosing the minimum computational requirements for each application component In this step we decide the minimum computational requirements in terms of resource rates (e.g., Amazon’s elastic computing units, or simply ECUs) that are needed by each application component to satisfy the quality requirement. At this stage we do not consider the available resources, but we just determine the ECU requirements of the application.
2. Choosing the resources to rent In this step we calculate the bidding price that minimizes the cost for each unit of rate (e.g., 1 ECU) and, based on it, we decide which resources to rent. The sum of the ECUs of the rented resources should be large enough to fulfill the ECU requirements of the application decided in the previous step.
3. Choosing the allocation of the application components to the resources In this step we decide how to allocate the different application components into the rented resources to minimize the negative effects of allocation (e.g., the reduction in performance due to load balancing, as it happens in the third deployment example in Fig. 1).
4. Analyzing the overall system and possible scaling-up of bottlenecks The performance of the overall deployed system is analyzed again taking into account the overhead added by the presence of multiple CPUs and load-balancing. This is also the step in which we consider the effects of the random environment in terms of possibility of losing spot instances and replacing them with on-demand instances in case the chosen bid price is overbid. If this analysis shows that the chosen resources and allocation do not fulfill the quality requirements anymore, the application ECU requirements of the bottleneck software servers are increased to compensate, and new resources/allocations are decided.
Finding the optimal rate for each software server
In this step we want to find a first approximation of the solution of the global problem by assuming that each software server m is deployed on a dedicated hypothetical resource that provides the minimum rate \(\hat{\mu }_m\) to process requests such that the SLO constraints are satisfied. In this step we do not consider the characteristics of the real resources (e.g., number of processors, prices, and the random environments information) since a decision on which one to rent will be done in the next steps. The goal of this optimization problem is to decide the minimal rates \(\hat{\mu }_m\) that fulfill the constraints on the mean response time and on the response time distribution.
$$\begin{aligned} \text {min}\quad&\sum \limits _{m=1,\ldots ,M} \hat{\mu }_m \\ \text {s.t.}\quad&\textit{MRT}_{k,}(\hat{\mu }) \le \textit{maxMRT}_k, \forall k \\&\textit{RTP}_{u,k}(\hat{\mu }) \le \textit{maxRTP}_{u,k}, \forall u, \forall k \end{aligned}$$
To solve this subproblem we use a greedy algorithm that scales down the rates of all the resources as much as it can until one or more bottleneck resources are found for the class of jobs that is closest to the boundary of the constraints. At this point, the rates of the bottleneck resources are fixed, and the algorithm continues to scale down the remaining rates, until all of them have been fixed in the same way.
The pseudocode listing of the algorithm is shown in Fig. 4. The function receives as input an initial set of arbitrarily large feasible rates \(\hat{\mu }_{\textit{init}}\), and the system model S that contains all the parameters of the application and the resources described in Sect. 3. It returns the optimal rates for each software server as vector \(\hat{\mu }\). The variable r is initialized as the set of all available resources that can be scaled. Then, all resources are scaled down using a bisection method until the constraints are violated: minimum rates are increased when the constraints are satisfied and the maximum rates are decreased when the constraints are violated. When the minimum and maximum rates are close enough, the current bottleneck resources are removed from r and the process continues until r is empty. At this point the rate calculated so far is returned as our optimal \(\hat{\mu }\). The auxiliary functions used in the algorithm (briefly described in Fig. 5) are directly derived from the evaluation of the queueing network and simple operational analysis laws.
Finding the real resources to rent
In the previous step we calculated the computational needs in terms of rates of the virtual resources. In this step we want to decide which real resources to rent to provide such computational needs at minimal expense. To make this decision we consider for each real resource y a mean price equal to \(\hat{c}_y\), that can be obtained from historical traces using the estimation method we discuss in Sect. 5. The goal is to minimize the sum of these costs while ensuring that the rates of all rented resources are large enough to allocate the rates found as the solution of the previous problem.
$$\begin{aligned} \text {min}\quad&\sum \limits _{y=1,\ldots ,Y} \hat{c}_y \\ \text {s.t.}\quad&\sum \limits _{y \in 1,\ldots ,Y} \hat{\lambda }_y \ge \sum \limits _{m \in 1,\ldots ,M} \hat{\mu }_m \end{aligned}$$
This subproblem is a classical integer linear-programming problem (ILP) since the decision variables are integers, and the constraints and the objective functions are linear. This is a well-known NP-hard problem in which we can find an approximate solution using any ILP solver. We implemented a function findResourcesToRent to interface with the MATLAB intlinprog solver, which accepts the rates of the software servers \(\hat{\mu }\) and the system parameters \(\textit{S}\) as inputs, and returns the resource assignment vector t.
Finding the allocation of the rate for each software server to the real resources
In this step we want to find a good allocation of the rates found so far for each software server to the rented resources. We can combine the allocation of multiple software servers to a single resource and the replication of a single resource to multiple software server, as in the last example of deployment of Fig. 1. The allocation decision should minimize the overhead due to load balancing by minimizing the number of associations (\(a_{m,y}\)) between software servers and resources while still ensuring: (i) that each software server obtains at least its minimum rate \(\hat{\mu }_m\), (ii) that each rented resource y is not providing more than its maximum rate \(\hat{\lambda }_y\).
$$\begin{aligned} \text {min}\quad&\sum \limits _{m=1,\ldots ,M} \sum \limits _{y=1,\ldots ,Y} a_{m,y} \\ \text {s.t.}\quad&a_{m,y} = {\left\{ \begin{array}{ll} 1 &{} \text {if }\;d_{m,y} \ne 0 \\ 0 &{} \text {if }\; d_{m,y}=0 \end{array}\right. }, \forall m, \forall y \\&\sum \limits _{y \in Y} d_{m,y} \ge \hat{\mu }_m, \forall m \\&\sum \limits _{m \in M} d_{m,y} \le \lambda _{t_y}, \forall y \end{aligned}$$
To solve this problem we propose an algorithm that finds an approximate allocation by allocating the rates of the software servers having the largest non-allocated rate to the real resources having the largest available capacity in an iterative process until the rates of all software servers have been allocated.
A listing of this algorithm is shown in Fig. 6 as the findRateAllocation function. This function takes as input the rates \(\hat{\mu }\) we have previously calculated using the findOptimalRates function, and the rented resource rates \(\hat{\lambda }_y\), which can be derived from the vector of types \(t_y\) calculated using the findResourcesToRent function with the relation \(\hat{\lambda } = \lambda (t_y)\). In each iteration of the algorithm we find the software server with the highest rate \(m_\textit{max}\) and the rented resource with the highest rate \(y_\textit{max}\). Then, we allocate the maximum rate between the rate of \(m_\textit{max}\) and the rate of \(y_\textit{max}\) by increasing the corresponding value in the allocation matrix \(d_{m_\textit{max},y_\textit{max}}\). To avoid reallocating previously allocated rates, we decrement both the rate of \(m_\textit{max}\) and the rate of \(y_\textit{max}\) by the allocated value. The process is repeated until all the software servers have zero rate.
System analysis and scaling-up of the bottleneck server
In this step we check if the SLO constraints still hold when considering the system allocated using the resource assignment vector t and the allocation matrix D found in the previous steps. In our implementation we use the LINE tool [18] to evaluate the mean response time and the response time percentiles, which considers also real resource parameters such as the number of processors, the load balancing, and the random environment model that describes the possibility for a spot resource to be lost and replaced with an on-demand one when its bid price is overbid.
If, after calculating the response times, the SLO constraints still hold, we can stop here and return the decision variables t and D calculated so far. These will be used to reconfigure the system and apply the resource rental and allocation decisions.
If the SLO constraints do not hold anymore, it means that the real resource parameters of the proposed allocation had a negative effect on the performance. This can be corrected by identifying one bottleneck server \(m_*\) and increasing its rate by a scaling factor \(\alpha \), which is calculated proportional to the amount of constraint violation. The bottleneck software server is identified as one of the servers that, when scaled-up by \(\alpha \), have the best effect in reducing the constraint violation of the SLO. To calculate the SLO constraint violation we use the following method. Given a set of i constraints rewritten in the form \(V<0\), where \(V=[v_i]\), we define the SLO constraint violation as the maximum value in V. A positive constraint violation means that at least one SLO constraint has been violated.
Finally, to actually determine bottleneck software servers \(m_*\) we propose the findBottleneckM function, which is shown in Fig. 7. This function iterates all the software servers, trying to scale each one up by \(\alpha \) and saving the information of the software servers \(m_*\) that result in the best reduction of constraints violation. The algorithm then simply recalculates the new resource allocations that would be needed when scaling-up the rate of each software server. Once the bottleneck software servers have been found, we just scale their rate up by \(\alpha \) and go back to recalculate the real resources to rent.
Convergence of the approach
In this concluding section we give some final remarks on the convergence of each step of our approach.
The problem of finding the optimal rate (step 1) has a guaranteed convergence since it uses the bisection method for fixing the rate of the M resources associated to the software server. The maximum number of queueing network evaluations needed is \(O\big (M\times \textit{log}_2(\textit{max}(\hat{\mu }_{\textit{init}}))\big )\), where M is the number of software servers and \(\hat{\mu }_{\textit{init}}\) is the vector containing the initial random feasible rates that are given as input to the findOptimalRates function.
The problem of finding the real resources to rent (step 2) is NP-hard and solved using an approximated ILP solver. The convergence and the complexity of this step therefore depends on the ILP solver used and its parameters. In this step no queueing network evaluations are performed.
The problem of finding the allocation (step 3) has a guaranteed converge since at each iteration some rate is transferred from the software server with the maximum unallocated rate to the rented resource with maximum rate availability. The maximum number of rate transfers happens when all the M software servers are transferred to all the Y rented resources, therefore the number of iterations of this step is \(O(M \times Y)\). Similarly to step 2, this step does not perform any queueing network evaluation during its iterations.
Finally, in the last step it is possible that the final solution computed is not feasible (i.e., it violates the constraints). In this case we need to search for bottleneck servers and scale them up by a factor \(\alpha \). The algorithm to find the bottlenecks tries to scale-up all the software servers one by one, thus resulting in O(M) queueing network evaluations for each search. Each search guarantees that the bottleneck resources speed is increased, thus progressively reducing the violation of the constraints until an optimal solution is found. In some limit situations it is possible that an increase in the rate of a bottleneck resource does not reduce the violation of the constraints, which would prevent the convergence of our approach. These limit cases happen when the contribution to the response time added by the load balancing, the multiple number of processors, and the random environment is too large to be compensated by an increase in rate. Examples of these limit situations are cases with very low resource rates or in which bid prices are continuously overbid and underbid. In our experiments based on real data we did not experience any of such limit cases, which leads us to think they are contrived examples.