1 Introduction

Current work on HPC cloud has focused on understanding the cost-benefits of cloud over on-premise clusters [1, 69, 13, 1517, 20, 23]. However, there is still a gap between this understanding and helping users make decisions on bursting their jobs to the cloud. While applications may suffer network overhead, cloud is more effective when we consider resource availability—users do not have to wait long time periods in job queues of cluster management systems.

This paper introduces an advisory service to support users in deciding how to distribute computing jobs between on-premise and cloud resources. Our main contributions are: (i) an advisory service for bursting jobs to the cloud, considering performance and cost difference between cloud and on-premise resources, as well as deadline, local job queue, and application characteristics (Sect. 2); (ii) a case study that shows the advisory service being used by a seismic processing application from the oil & gas industry. We also measured the impact of unreliable execution time predictions on cloud bursting decisions (Sects. 3 and 4).

2 Advisory Service and Policies

The advisory service considers a user deadline, incurred costs, the on-premise job queue length (local), the provisioning time (cloud), the price ratio between local and cloud for the resource allocation, the type of available hardware, and the estimated execution time for both environments with different configurations.

The main input parameters to the advisor are the application profiles and the cost models. The application profiler generates profiles that describe the behavior of a given application considering infrastructure, financial costs, performance, and number of required processors. Several approaches exist to produce application profiles [2, 3, 10, 18, 22, 24]. The cost model for a cloud infrastructure comes from price values offered by cloud providers, whereas the model for the on-premise cluster is a ratio based on the cloud costs [14].

The advisory service currently supports two policies: (i) the maximum budget for running jobs, when users are more concerned about costs; (ii) the maximum execution time, when users must meet a deadline to deliver results, and budget is a secondary concern. Both policies readjust the number of cores for the cloud environment due to restrictions in the number of cores per machine.

3 Application Case Study in Oil and Gas Industry

The case study of our advisory service relies on the application profile for the Full Waveform Inversion (FWI) [21] and cost models for the SoftLayer cloud provider and an on-premise cluster. FWI is a CPU-intensive application in the area of seismic analysis that has components present in other HPC applications, such as communication among multiple processes, solvers for linear systems, matrix operations, among others. We assume that the profile of such applications can be represented by a power-law function due to its inherent scale-invariance characteristics. Hence, similar applications can be scaled to a finer or coarser grid resolution and will behave similarly, by simple tuning the coefficients of the power-law function [11, 12]:

$$\begin{aligned} t = a\, P^b \end{aligned}$$
(1)

where t is the execution time, P is the number of processors, and the coefficients a and b are empirically determined. We can also invert Eq. 1 to solve the number of processors for a given time restriction t as the input parameter.

In order to develop a cost model, we collected prices charged by cloud providers; in our case, SoftLayerFootnote 1 cloud infrastructure. Similar findings from our experiments could be obtained using other cloud providers as they rely on similar prices and charging models (hourly-based). Heterogeneity [5] will be explored as future work. We observed that the hourly-rate for provisioning nodes is a linear relationship to the number of processors P, which can be described by \((C_h = \alpha P + \beta )\). For simplification purposes, we assume the offset coefficient \((\beta )\) can be neglected and the “price per hour”:

$$\begin{aligned} \frac{\varDelta C}{\varDelta t} = \alpha P \end{aligned}$$
(2)

where \(\frac{\varDelta C}{\varDelta t}\) is the hourly-rate for nodes provisioning and \(\alpha \) is a linear coefficient determined empirically.

We can integrate Eq. 2 over time to quantify the total cost for a given turnaround time (the number of processors P does not change with time), while we simplified the costs of on-premise HPC clusters by assuming it is proportional to cloud costs: [8, 14]: \(C_{cloud} = T \, (\alpha P)\) and \(C_{local} = T \, (K \, \alpha \, P)\), where T is the turnaround time and C is the total cost for a given number of processors P and turnaround time T. By coupling the application and costs models (Eqs. 1 and 2), we can have one equation that provides C (total cost) for a given time T, a cost model coefficient (\(\alpha \)), and the application profile (from a and b):

$$\begin{aligned} C = a \, \alpha \left[ \frac{\left( \frac{T}{a}\right) ^{(1+\frac{1}{b})}}{1+\frac{1}{b}} \right] . \end{aligned}$$
(3)

4 Evaluation

The goals of the evaluation are to understand: (i) the financial and time savings of the advisor and (ii) how the advisor is dependent from the application profile accuracy. We compared the advisor against four policies:

  • Always-Local: submits jobs to the on-premise environment—it represents users who do not want or cannot move their jobs to the cloud. We used this policy as baseline for comparison because it still represents the most conservative and traditional behavior of HPC users;

  • Always-Cloud: submits jobs to the cloud—it represents users who do not have access to an on-premise cluster or are willing to test the cloud to avoid acquiring a new cluster in the future;

  • Random: randomly decides between cloud and local environments—it is an attempt to represent users who do not have any supporting mechanism or intuition to know where to run their jobs;

  • Worst-Case: chooses the opposite environment provided by the advisor—it represents hypothetical users who make extremely wrong decisions. This helps us understand how much a user can loose with such decisions.

Other policies [4, 19] could be studied, however finding the optimal resource allocation policy is out of the scope of this paper.

The input data were: deadline ranges from 1 to 100 h; budget ranges from 10 to 100 USD; queue size time ranges from 1 % to 50 % of the deadline; setup time ranges from 1 % to 50 % of the deadline; price ratio between cloud and local environments, with the price of cloud environment ranging from 70 % to 340 % the price of the local environment; total of 28,000 executions per policy. The ranges for the budget, deadline, and setup time are based on our experience with the FWI application. The price ratio is based on HPC cloud literature [8]. The FWI profile was generated using a single input data set, which described the size of the domain, the precision of the output image, and the varying number of processors from 10 to 40.

The advisor computes the costs and turnaround time for both environments for each set of input variables. For each result, the advisor calculates the relative difference between the costs of both environments in the following way:

$$\begin{aligned} \frac{min(Cost_{cloud},Cost_{local})-Cost_{local}}{Cost_{local}} \end{aligned}$$
(4)

where \(Cost_{local}>0\). When the budget-aware policy is executed, the advisor calculates the relative difference of the turnaround time in a similar manner. The results of the other decision policies used for comparison are also relative to the always-local decision policy.

We selected two environments to compare the target application. Cloud: processor frequency = 2.60 GHz, cores per processor 4, memory per machine 64 GB, Ethernet network, CentOS operating system. Cluster: processor frequency = 2.80 GHz, cores per processor 10, memory per machine 132 GB, Ethernet/ Infiniband network, RHEL operating system.

Application profiles describe resource consumption of the application. We derived the power-law scaling function (Eq. 1), in which the coefficients a and b were computed through non-linear least squares curve fitting:

\(t_{local} = 1013.50 \, P_{local}^{-1.58}\) and \(t_{cloud} = 7004.86 \, P_{cloud}^{-2.06}\).

4.1 Results: Costs and Time Savings

Figure 1 shows the results for the deadline-aware policy. When local environment is cheaper (\(K<\) 1) and even 80 % (K = 1.8) more expensive than cloud, it delivers the cheapest cost most of the time. When the price ratio is 1.8, local environment provides the cheapest cost around 71 % of the instances. The reason for this is that local environment showed the best computing performance for executing the target application. Thus, even when the local environment is around 80 % more expensive than cloud, its performance counterbalances its cost. When the price ratio reaches 2.2, the local environment is surpassed by cloud in terms of cost, i.e., cloud is the cheapest environment around 56 % of the time and this percentage becomes greater as the price ratio grows, as expected.

Figure 2 shows the results for the budget-aware policy. Similarly to deadline-aware, always-cloud is comparable to worst-case and advisor is comparable to always-local when the local price is equal to or below the cloud price. However, for higher price ratios, always-cloud surpasses always-local faster than deadline-aware. The turning point occurs when price ratio is around 1.0: always-local is on average 30 % faster than always-cloud, but the median is –0.12; that is, always-cloud is half the time at least 12 % faster than always-local. Although the local environment has better computing performance results over cloud, when they have a similar price (i.e., price ratio around 1.0), the decision on where to run for a given budget is not obvious due to the other input parameters (queue length and setup time), affecting the turnaround time.

Fig. 1.
figure 1

Boxplots of the Deadline-aware policy: the lower the costs the better the policy

Fig. 2.
figure 2

Boxplots of the Budget-aware policy: the lower the time, the better the policy

When the advisor calculates an execution time to meet the budget, the coupled model yields a solution that is near-optimal for execution time. Searching for the optimal execution time is not worthwhile since the adjustments over the infrastructure to meet the available configurations overpass intermediate values found in the optimal solution. For instance, an estimated number of processors \(NProcs=9.5\) must be adjusted to 10 or 9, according to the policies.

Results show the advisor selects the environment that best suits users’ needs. The lack of such supporting tool might cause unnecessary costs or waste of time. Besides, some input variables (e.g., queue size time) can change frequently, making impossible to manually calculate the best environment for execution.

4.2 Results: Accuracy of the Application Profile

We defined the range of inaccuracies from \(-90\) % (i.e., \(-0.9\)) error to 100 % (i.e., 1.0) error, aiming to cover a wide spectrum of such profiles. For each set of input data, the application profile specifies the infrastructure necessary to meet deadline or budget constraints depending on the policy. Let us say this infrastructure has 100 cores disregarding the policy; after injecting the error within the aforementioned range, it will have from 10 (i.e., \(100*(1.0-0.9))\) to 200 cores (i.e., \(100*(1.0+1.0))\).

For each estimation of the advisor, the input data also varied in the same way that was described in previous evaluation (Sect. 4.1), which means that each policy has been executed 28,000 times for each inaccurate profile. After the execution, the advisor provided either the same decision that was computed using the accurate profile or a different one. Even if the decision was the same, it is usually based on slightly different results. Thus, we measured whether the decision is the same or not, and the relative difference between results using an inaccurate and the accurate profile. These relative differences, when executing the deadline-aware policy, were calculated as follows:

$$\begin{aligned} \frac{Cost_{Inaccurate}-Cost_{Accurate}}{Cost_{Accurate}} \end{aligned}$$
(5)

where \(Cost_{Inaccurate}\) and \(Cost_{Accurate}\) are the costs measured using inaccurate and accurate profiles, respectively. The relative differences were calculated similarly when executing the budget-aware policy. Table 1 shows a comparison of the results provided by the advisor using inaccurate and accurate profiles. The first column compares the decision of the advisor using both profiles: \(=\) means same decision and \(\ne \) otherwise. For each policy, this table shows the average of the relative differences (avg), their standard deviation (std) and total number of decisions (size) for each inaccurate profile. So, for each inaccurate profile, the sum of equal and different decisions is 28,000.

Table 1. Results from the advisor using inaccurate and accurate profiles

As the error gets close to zero, the percentage of same decisions increases. Even when the error is 0.9, the advisor computed the same decision for deadline-aware policy around 93 % of the times and for budget-aware policy around 92 % of times. The number of different decisions for the budget-aware is greater than the number of same decisions (53 % against 47 %, respectively) only when the error is –0.9. For this error, the advisor proposed the same decision for the deadline-aware policy around 62 % of the time.

For most of the inaccuracy ranges, the number of equal decisions is far greater than the number of different decisions. That is, even if the application profile is inaccurate, it has a minor decision impact. Therefore, the advisor provides evidence that it is resilient to inaccuracies in the application profile. One reason for this is the wide spectrum of data that has been exercised. In some situations, the difference between choosing cloud or local is so great that the inaccuracy has little to no impact on job(s) placement decision. These results also show that as inaccuracy gets close to zero, the number of correct decisions increases, as expected. When the inaccuracy is close to zero, the infrastructure calculated by the application profile has low chance of being relevant to the final result.

5 Conclusions

The advisory service is composed of modules that can be extended/plugged-in to have more refined Application Profiles and to suit other applications. Further investigations will be required to collect data from a wide range of applications. The main lessons from our study are: (i) in HPC cloud, apart from resource performance, it is important to consider the time a user has to wait in a job queue of the on-premise environment compared to the overhead of cloud resources—more relevant is the total turnaround time, as also pointed by Marathe et al. [14]; (ii) it is possible to consider an advisory service for HPC hybrid clouds even without having highly precise application/job profiles—however, very inaccurate profiles may generate negative impact on costs and turnaround delays; (iii) the higher the cost differences between cloud and on-premise resources the higher the savings brought by an advisory service for resource selection on hybrid clouds.