Introduction

Recently, information technology (IT) was undergone a revolution. In this line, cloud computing attracted great attention in both industries and research communities for the sake of its pervasiveness, elasticity, and economy of scale [1]. Meanwhile, cloud computing is an amazing option for both individuals and organizations that do not have any exact resource usage pattern [2, 3]. For instance, in the case of the garment industry for Valentine's Day or Christmas, private cloud owners can exploit the public cloud to cover their sporadic burst of resource demand instead of proactively resource procurement. Cloud has a wide range of applications from business to even academic projects. One of its abundant academic projects is in scientific workflow scheduling. Workflows include a set of tasks with different sizes, characteristics, and data dependency control flow between sub-tasks [4]; it is of comprehensive and complicated computation tool. Workflows such as LIGO [5, 6], SIPHT [5, 6], Epigenomics [5, 6], Cybershake [5, 6], etc., which are modeled in the form of directed acyclic graphs (DAGs), are popular paradigms in both industries and sciences [7]. Take that a university that has its private datacenter intends to execute such scientific workflows and requests more storage and computing resources during this process. Therefore, it can engage the public cloud to make hybrid cloud architecture. Note that the hybrid cloud is deemed a unique entity for users. One of the most important issues in the execution of workflows on cloud infrastructure is to schedule tasks and to allocate resources to these types of projects efficiently so that the maximum execution time of the last task, the so-called makespan, is minimized [8]. In this regard, schedulers encounter several challenges such as being aware of tasks inter-dependencies, underlying infrastructure heterogeneity, difference in VMs speed and pricing schemes, data transfer time on network channels, etc.

Several works have been published in the literature to solve workflow scheduling on cloud infrastructure with total execution time reduction, energy efficiency, reliability maximization viewpoints, but less paid attention to monetary costs which may have a big conflict with users’ monetary budget. For instance, a load-aware heuristic strategy for dynamic workload and service scheduling in a cloud environment has been proposed by Lu et al. [9]. The main objective of the proposed algorithm was to improve execution performance; then, it has been validated by series of experiments. A novel hybrid discrete particle swarm optimization (HDPSO) algorithm was proposed to reduce maximum workflow execution time on cloud heterogeneous platforms [1]. To do so, this problem has been formulated into a single objective optimization problem. Although it had great improvement, it has not considered VMs’ monetary costs. An energy-aware workflow scheduling algorithm has been presented with the aim of datacenter power management and keeping users’ service level agreement (SLA) in [10]. Utrera et al. [11] have proposed an efficient algorithm to balance imbalance parallelizable programs on spare nodes with the aid of maximum resource utilization. Although it improves infrastructure resource utilization by packing tasks on the same processor, it does not consider user requirements as one of the most important stakeholders in the system. A multi-objective optimization workflow scheduling with execution time and energy efficiency has been propounded by Durilo et al. [12]. They applied the HEFT method as a list scheduler algorithm which has two important phases; at first, it constructs an ordered list of tasks guaranteeing topological order and dependency constraints; at the second phase, it picks a task with the highest priority to map on the processor which finishes the task execution at the quickest as possible time. It seems the suggested work is not suitable for users with tight budget constraints. Since workflows are modeled in the form of DAGs and there exist dependencies between tasks, data transfer between tasks worsens execution time, network traffic, and monetary costs as well. Therefore, the duplication technique may decrease network bandwidth usage and also can improve parallel path and degree of parallelism [13]. Specifically, it is a promising technique for communication-intensive DAGs which have a high communication-to-computation (CCR) rate. Qi Tang et al. have outlined task scheduling on a homogeneous platform by applying the duplication technique [14]. The outcome of their design was promising, but the duplication technique burdens more monetary costs as the scheduler must rent a couple of VMs instead of one. However, their algorithm did not take into account limitations for the number of allowable task duplications.

Reviews of published works in workflow scheduling on cloud platforms reveal that there exists a big gap in the literature for considering user monetary cost budget apart from the makespan minimization perspective. The most important innovation of the current paper which it conveys is that it formulates workflow scheduling problems on cloud platforms with makespan and monetary cost viewpoints. This is a bi-objective optimization problem under some constraints which is an NP-Hard problem. Since it is a discrete optimization problem, the simulated annealing (SA) algorithm is utilized which is very adaptive with discrete search space. However, canonical SA is a point-wise meta-heuristic computation; it cannot explore search space efficiently. This is the reason that a new population-based version of SA is presented; it is done by defining new operators and applying the crowding distance concept to make a bi-objective version of SA (BOSA). Also, to reach concrete results, the proposed algorithm is combined with the HEFT approach. In this way, it can explore search space efficiently to lead Pareto set of potentially conflicting objectives. The main contributions of the paper are as follows:

  1. 1.

    To present a new duplication-based list scheduler

  2. 2.

    To present a pricing model for VM deployment in cloud computing environment

  3. 3.

    To formalize workflow scheduling problem on heterogeneous cloud platforms into a bi-objective optimization problem with makespan and monetary cost optimization viewpoints

  4. 4.

    To present a bi-objective optimization based on simulated annealing task duplication scheduling algorithm (BOSA-TDA) along with new operators to solve the stated discrete bi-objective optimization problem

The rest of the paper is organized as follows. “Related works” classifies task scheduling algorithms in the form of related works. “Problem background” presents the problem background. Several proposed models are outlined in “System, application, scheduling, and pricing models”. “Problem statement” states problem formulation. Our proposed BOSA-TDA algorithm is elaborated in “Proposed bi-objective optimization based on simulated annealing task duplication scheduling algorithm (BOSA-TDA)”. To validate the current work, “Simulation and evaluation” is dedicated to simulation and evaluation. Finally, “Conclusion and future work” concludes this article along with future work inclination.

Related works

Task scheduling is an important concept in all fields of computation domains especially when it is subjected to resource constraints; this is the reason that it has a long history. Figure 1 shows the categories of task scheduling algorithms in the literature.

Fig. 1
figure 1

Classification of Task Scheduling Algorithms in literature

A review in the literature reveals that the traditional task scheduling research has focused on list scheduling algorithms; chief amongst are heterogeneous earliest finish time (HEFT) [19] and critical path on a processor (CPOP) [19]. The basic idea of the HEFT version of list scheduling is that it consists of ordering a list of tasks by assigning priority to each one. The tasks are selected to the assigned priority and the ready task with the highest priority is removed from the task list to be assigned to an available virtual machine that guarantees the earliest finish time (EFT). In this category, another list scheduler, CPOP maps each task which is in the critical path, on the fastest VMs/processors in heterogeneous parallel platforms whereas the other tasks are mapped on VMs based on the EFT concept. Although the two aforementioned list schedulers were promising techniques in the primary era of scheduling, several improvements on these versions have been published in the literature to enhance schedulers’ performance. In this regards, different extensions of list schedulers are such as CCP, CEFT, RHEFT, DHEFT, PEFT which have been customized for cloud environments [54,55,56]. For instance, robust HEFT (RHEFT) and distributed HEFT (DHEFT) have been developed to embed user’s quality of service (QoS) requirements in the model apart from makespan [56]. A cost-effective fault-tolerance (CEFT) scheduling algorithm for real-time tasks in cloud environment was presented in [21]. This scheduling algorithm is applied in cloud environment with permanent or transient failure. The simulation result shows the CEFT gains promising balance between low cost and deadline guarantee in real-time cloud systems. In addition to, a novel list scheduler algorithm which combines machine learning techniques and HEFT (known as QL-HEFT) was presented in [22]. The QL-HEFT scheduler utilizes upward ranking values from HEFT which are used for reward in Q-learning process. After an ordered list is provided; then, QL-HEFT engages the earliest finish time procedure to schedule high prior task which is placed in an ordered list by Q-learning process, on the fastest VM that returns the optimal result. The QL-HEFT was compared with three different classic list schedulers upward, downward, and CPOP approaches that were presented in [19]. The simulation results proved the superiority of QL-HEFT against counterparts in term of average response time. Also, Arabnejad and Barbosa in [54] have presented a novel list scheduler, predictable earliest finish time (PEFT) by introducing a look-ahead feature with computation of an optimistic cost table. It preserves time complexity against other existing approaches, but it lacks to consider rented VMs’ cost. Some other famous list schedulers are: Highest Level First with Estimated Times (HLFET) [15], Modified Critical Path (MCP) [16], Dynamic Critical Path (DCP) [17], Dynamic Level Scheduling (DLS) [18], and Longest Dynamic Critical Path (LDCP) [20] in which their concentration are mostly on critical path management of given DAGs.

Recently, heuristic-based approaches became popular besides list schedulers. A heuristic-based algorithm normally finds a near-optimal solution in polynomial time. It searches for a path in the solution space at the expense of ignoring some possible trajectories [34]. Clustering and duplication are two prominent heuristic-based task schedulers in parallel systems [13, 25, 28,29,30, 57,58,59,60,61,62]. The heuristic-based scheduling algorithms can be classified as cluster-based schedulers such as [28, 57, 58] and task duplication-based schedulers such as [13, 25, 29, 30, 59,60,61,62]. In the former method, the scheduler reduces communication costs by creating high communication-intensive dependent tasks as a cluster and mapping that clustered tasks on the same VM/processor whereas, in the latter approach, the duplication-based scheduler increases the degree of parallelism by executing a key subtask on more than one processor. Lin et al. [13] and Mishra et al. [28] have utilized clustering and duplication techniques in task scheduling problems. In [13], authors make some new graphs from input DAG by utilizing clustering, duplication, and replication methods. The main objective is to minimize makespan subject to keeping throughput and utilization at appropriate levels; then, one of the newly generated graphs which optimizes objective function and meets problem constraints is selected as a final solution. This work was validated in both real datasets and random task graphs which proved its superiority against some comparative algorithms. However, these heuristics are not appropriate in the platforms with a limited number of parallel VMs/processors [1, 8]. In addition, the duplication method was applied for task scheduling on homogeneous platforms in [14]. The outcome of their design was promising in makespan reduction, but their algorithm did not consider limitations for the number of allowable task duplications since it burdens users more monetary charge. Several heuristics have been devised to solve scheduling Bag-of-Tasks applications on hybrid clouds under due data constraints in [63]. This paper’s trend was to optimize the total cost function which contains tardiness penalties and public cloud usage cost. Clustering and Scheduling System II (CASS II) has been presented to improve scheduling performance. To do so CASS II engages tasks on critical path to construct a cluster. Then, it assigns this cluster on the fastest available processor without considering any duplication technique [23]. Another duplication scheduling heuristic is discussed in an extended report published by Oregon state university [27].

Despite list schedulers and heuristic-based approaches which are biased deterministically, stochastic-search-based algorithms incorporate a combinatorial process in the search space for finding optimal solutions. Some stochastic-search-based algorithms such as meta-heuristic-based or even hybrid meta-heuristic-based approaches typically require sufficient sampling of candidate solutions in the search space and have shown robust performance on a variety of scheduling problems. In this regards, genetic algorithms (GAs) [8, 31,32,33, 64, 65] particle swarm optimization (PSO) [40, 43], Hybrid discrete PSO (HDPSO) [1], ant colony optimization (ACO) [47, 66], artificial bee colony algorithm (ABC) [52, 67] simulated annealing (SA) [35, 36, 68] cuckoo search algorithm (CS) [52, 93], the memetic discrete differential evolution algorithm [69] and tabu search (TS) [64, 70] have been successfully applied to different scheduling problems. Among them, GAs have been widely utilized to evolve solutions for many task scheduling problems [8, 31,32,33, 64, 65]. Table 1, chronologically, depicts the most cited related works in literature.

Table 1 Chronological comparison of emerging scheduling Algorithm in literature

For instance, a shuffled genetic-based algorithm has been presented for task scheduling algorithms [8]. In its initial population, two individuals are filled by upward and downward ranking algorithms and the third individual is filled by level ranking which is drawn from the HEFT approach; then, the rest population is created by shuffling these individuals to produce feasible chromosomes. The same approach has been done by hybrid discrete particle swarm optimization (HDPSO) algorithm which produces initial swarm followed by two proved theorems [1]; then, it is randomly combined with the Hill Climbing method to make a good balance between exploration and exploitation in search space. Both presented models formulated scheduling problems as a single objective optimization by reducing the makespan viewpoint. A multi-objective optimization workflow scheduling with execution time and energy efficiency inclination has been propounded by Durilo et al. [12]. Although this improves makespan and power consumption at the same time, it is not suitable for users with tight budget constraints. Another bi-objective optimization task scheduling with maximizing reliability and minimizing energy perspectives has been propounded by Zhang et al. [77]. This bi-objective HEFT (BOHEFT) scheduler weights system reliability more than performance metrics and maps tasks on heterogeneous VMs till low energy consumption and high reliability are simultaneously achieved. This algorithm ignores makespan and utilized VMs’ cost taking into consideration. Since the task scheduling problem is a discrete optimization problem, the simulated annealing (SA) algorithm seems to be an efficient approach to reach the global optimum in discrete space [78]. Several versions of SA have been developed to figure out different scheduling problems [72, 73, 79]. As the SA algorithm is a point-wise optimization approach, it has two basic drawbacks. Firstly, it cannot explore search space efficiently in comparison with population-based evolutionary algorithms such as GA because it cannot generate a handful of candidate solutions. Secondly, it is hard to customize point-wise SA for multi-objective optimization problems. In [37] a hybrid genetic and simulated annealing algorithm (GASA) has been presented to solve scheduling problem in a cloud environment. This work is based on a list scheduler but to generate a handful of promising lists, it utilizes GA algorithm along with its strong crossover operator. To improve the gained solutions, it utilizes SA operators. Also, in [38] another hybrid genetic and thermodynamic simulated annealing algorithm (GATSA) was proposed to solve workflow scheduling in a cloud environment with regards to makespan minimization viewpoint. The proposed GATSA utilizes thermodynamic laws to gradually and variable decrease the temperature in the cooling phase. To this end, it applies variable cooling amount based on discrepancies fitness between each pair of consecutive solutions whereas it was neglected in the canonical SA. The conducted simulations in different circumstance proved the dominance of the GATSA against other counterparts in terms of scheduling evaluation metrics. In this line, a min-max ant colony optimization algorithm has been presented in literature to solve job scheduling in grid computing systems [50].

An overall review of the literatures associated with workflow scheduling reveals that there is a clear lack of workflow scheduling algorithms that optimizes both equally important makespan and monetary cost functions. In this line, the development of an intrinsic discrete-nature meta-heuristic algorithm such as SA can efficiently explore discrete search space. To solve the discrete bi-objective workflow scheduling problem, this paper extends a hybrid population-based bi-objective optimization algorithm based on simulated annealing and task duplication scheduling techniques in such a way that it can cover existing aforementioned shortcomings.

Problem background

In this section, a few concepts from the multi-objective optimization theory and canonical SA for a better understanding of this work are succinctly introduced.

Multi-objective optimization and crowding distance concepts

A multi-objective optimization problem is an issue that has several conflicting objectives which need to be optimized simultaneously. Without loss of generality, Eq. (1) outlines a multi-objective optimization problem with minimization inclination [80].

$$\text{minimize } F(x)= {\left({F}_{1}\left(x\right), {F}_{2}\left(x\right),\dots, {F}_{k}\left(x\right)\right)}^{T}.$$
(1)

An element \({x}^{*}\in X\)=(x1, x2,…, xN) is called an N-dimensional feasible solution or a feasible decision. A vector \({z}^{*}:=F\left({x}^{*}\right)\in {\mathbb{R}}^{k}\) for a feasible solution \({x}^{*}\) is called an objective vector or an outcome. In multi-objective optimization, there does not typically exist a feasible solution that minimizes all objective functions simultaneously. Therefore, attention is paid to non-dominated solutions (or Pareto optimal solutions); those solutions cannot be improved in any of the objectives without worsening at least one of the other objective. In mathematical terms, a feasible solution \({X}_{1}\)=(x11, x12,…, x1N) \(\in {X}^{N}\) is said to dominate another solution \({X}_{2}\)=(x21, x22,…, x2N) \(\in {X}^{N}\) as Eq. (2) indicates:

$$\left\{\forall i: {F}_{i}\left( {X}_{1}\right)\le {F}_{i}\left( {X}_{2}\right) \text{ and } \exists j:{F}_{j}\left( {X}_{1}\right)< {F}_{j}\left( {X}_{2}\right) |i,j\in \left\{1. \dots . k\right\}\right\}.\\ $$
(2)

In the multi-objective domain, two concepts of convergence and diversity are very important issues where it differentiates other multi-objective optimization algorithms in term of performance. For instance, the famous NSGA-II algorithm introduces two effective selection criteria namely, Pareto non-dominated sorting and crowding distance to guide the search towards the optimal front [81]. The Pareto non-dominated sorting is used to divide the individuals into several ranked non-dominated fronts according to their dominance relations. The crowding distance is used to estimate the density of the individuals in a population; it is beneficial when the algorithm encounters memory size limitations. The multi-objective algorithm prefers two kinds of individuals: 1) the individuals with lower rank and 2) the individuals with larger crowding distance if their rank is the same [81]. For the last criterion, the crowding distance is employed that was defined in [82]. Our criterion is to prefer solutions with the lower rank and higher crowding distance value; the higher value of crowding distance means the solution sets were derived from broader area. Finally, the non-dominated solutions are returned.

Simulated annealing (SA)

Simulated annealing is one of the most popular meta-heuristics developed that derives its inspiration from the natural world. In the case of simulated annealing, this inspiration comes from the behavior of fluids when they are subjected to control cooling such as in the production of large crystals. Simulated annealing for combinatorial optimization was introduced by Ref. [78] and independently by Ref. [83]. Despite other meta-heuristic approaches which have evolutionary trend, the SA tends to examine the worse solution apart from the good solution because in this way it runs away from getting stuck in a local optimal trap. During the annealing process, if a better solution is gained, it is accepted but in the case of producing a worse solution it is accepted by the amount of probability [84, 85]. It can be tuned in such a way that acceptance of the worse solution can happen at the beginning phase of the algorithm and when it reaches the end, the probability of worse solution acceptance is near to 0. In the other words, when SA became familiar with the search space, it behaves similar to other evolutionary algorithm at the last epochs; namely, it accepts only better solutions. The SA has one plus point and one negative point. The plus point is relevant to its flexibility for discrete optimization like task scheduling problems, but the negative point is relevant to its nature which is a point-wise algorithm. This is the reason the population-based of SA is presented to gain better results; as a matter of fact, the final results proved this idea.

System, application, scheduling, and pricing models

To present the proposed algorithm, some models are introduced for better understanding. Also, this paper presents mathematical optimization models. The nomenclature tabulated in Table 2 applied in the paper makes the paper easy to follow.

Table 2 Nomenclature of used notations

System model

In this section, understudy system model is introduced which executes DAG applications. The system contains a set of L different heterogeneous VMs which are interconnected with high-speed networks; namely, VMset = {\({\text{VM}}_{1}^{\text{Pr}}\), \({\text{VM}}_{2}^{\text{Pr}},\dots ,{\text{VM}}_{k}^{\text{Pr}},{\text{VM}}_{k+1}^{\text{Pu}}\),…\({\text{VM}}_{k+m=L}^{\text{Pu}}\)}. In this model, the number of k VMs out of L makes a private cloud whereas the number of m VMs out of L makes a public cloud. Here, there is no difference between VMs of private and public clouds provided the underlying network is high speed. For the sake of simplicity, it is taken a heterogeneous cloud platform with L different integrated VMs which is deemed a unique entity for users, but the heterogeneity is based on processors’ architecture, speeds, and pricing schemes. For pricing model, the private cloud is considered on-demand whereas the public cloud is considered with charge period basis (c.f. Eqs. (1721)). Moreover, the processing power in term of number of MIPS and monetary cost in term of $/hour associated with each VM are variable and determined in advance. Such a system is depicted in Fig. 2.

Fig. 2
figure 2

Three-layer system model to execute dependent tasks on heterogeneous virtual machines

In this model, the front-end layer is the user layer in which the user submits his/her request. The scheduling layer contains resource manager, job scheduler, DAG maker, and task scheduler which pays attention to the user’s QoS request and his/her cost budget. In the back-end, there exists both private and public cloud that makes hybrid architecture. The main concentration of the current paper is on task scheduling in a hybrid architecture. For simplicity, a uniform high-speed network is considered in which all VMs can uniformly communicate with each other with the same bandwidth BW. So, for each i, j = 1,…,L where \(i\ne \) j; B(\({\text{VM}}_{i}\),\({\text{VM}}_{j}\)) = BW.

Application model

Each workflow application is modeled to directed acyclic graph (DAG); it is considered a directed acyclic task graph in \(G=\left(T, E\right)\). Each vertex \(v\in T=\left\{{t}_{1},{t}_{2},{\dots ,t}_{n}\right\}\) in a DAG represents a task. Also, a DAG has two special nodes \({t}_{\text{entry}}\) and \({t}_{\text{exit}}\) that do not have predecessor and successor nodes respectively. The set of edges E in the graph where \(\{\text{e}\left({t}_{i},{t}_{j}\right)\in E|{t}_{i}\ne {t}_{j}\}\) represents dependency between tasks \({t}_{i}\text{ and }{t}_{j}\). That is a precedence constraint that indicates the task \({t}_{j}\) can start its execution only after completion of the task \({t}_{i}\). The set \(\left\{{t}_{j}\in T:\text{ e}\left({t}_{j},{t}_{i}\right)\in E\right\}\) of all immediate predecessors of \({t}_{i}\) is referred to as \(Pred\left({t}_{i}\right)\) and the set \(\left\{{t}_{j}\in T:\text{ e}\left({t}_{i},{t}_{j}\right)\in E\right\}\) of all immediate successors of \({t}_{i}\) is referred to as \(Succ\left({t}_{i}\right)\). A task without any predecessor is called an entry task, i.e., \(Pred\left({t}_{i}\right)={\varnothing }\) and a task without any successor is called an exit task, i.e., \(Succ\left({t}_{i}\right)={\varnothing }\). The size of the task \({t}_{i}\) and the weight assigned to edge \(e\left({t}_{i},{t}_{j}\right)\) for computation and communication are represented by \(Size({t}_{i})\) and \(e\left({t}_{i},{t}_{j}\right)\) respectively. The amount of execution time for the task \({t}_{i}\) on \({\text{VM}}_{k}\) is calculated via Eq. (3). Besides, the average execution time of the task \({t}_{i}\) on this heterogeneous platform is gained via Eq. (4). Where ES (\({\text{VM}}_{{k}}\)) and nP are execution speed of \({\text{VM}}_{{k}}\) in terms of MIPS and the number of virtual processors in the system.

$$\text{ET}\left({t}_{i},{\text{vm}}_{k}\right)= \frac{Size\left({t}_{i}\right)}{\text{ES}\left({\text{VM}}_{k}\right)},$$
(3)
$$\overline{\text{ET }({t}_{i}) }=\frac{\sum_{\forall {\text{VM}}_{k} \in \text{VMset} }\text{ET}({t}_{i},{\text{VM}}_{k})}{nP}.$$
(4)

Moreover, the data transfer time (TT) between each pair of VMs can be calculated via Eq. (5) in which the amount of data being transferred and common bandwidth are effective in the TT parameter [86].

$$\begin{aligned} \text{TT}\left({t}_{i},{t}_{j}\right)=\left\{\begin{array}{ll} 0\, & \text{ if} \,{t}_{i}\, \text{ and } \,{t}_{j}\\ & \text{ are scheduled on the same VM }\\ \frac{e\left({t}_{i},{t}_{j}\right)}{\text{BW}}& {\text{otherwise}} \end{array}.\right.\end{aligned} $$
(5)

This algorithm is evaluated using synthetic data from five real-world scientific workflow applications, such as Montage (generation of image mosaics of the sky), Epigenomics (mapping of the epigenetic state of human cells), SIPHT (The bioinformatics project that is conducting a wide search for small untranslated RNAs in bacteria), CyberShake (generating seismic hazard maps for earthquake detection), and LIGO (detection of gravitational waves in the universe) [5]. Figure 3a–e illustrate the aforementioned DAGs form projects.

Fig. 3
figure 3

Five real-world scientific workflow applications: a Montage [5], b Epigenomics [5], c SIPHT [5], d LIGO [5], and e Cyber Shake [5]

Scheduling model and duplication technique

The scheduling model determines which task is assigned on which type of VM in regards to objective functions. The scheduling model that this paper applies is a list scheduler which has two important phases; at the first phase, it provides a list of ordered tasks with priority weight whereas at the next phase it picks a high priority task to assign on the available VM which guarantees the earliest finish time not the earliest start time. For the first phase, three approaches upward, downward, and level rankings of the HEFT algorithm are applied with the incorporation of duplication method if necessary. For the VM selection, two functions are engaged which are: Earliest Finish Time (EFT) and Earliest Start Time (EST). The first function indicates the earliest time in which a virtual machine \({\text{VM}}_{k}^{\text{pr}/\text{pu}}\) can finish the execution of the subtask \({t}_{i}\) whereas the second indicates the earliest time that the execution can be started. To do so, two famous functions downward ranking Rankd(.) and upward ranking Ranku(.) are applied. The former starts from the entry task to the exit task to weigh each task a priority whereas the latter starts from the exit to the entry task. Both of them have recursive behavior. The downward ranking of a task \({t}_{i}\) is recursively calculated by Eq. (6) where its value is considered zero for the entry task which Eq. (7) indicates.

$$\begin{aligned} Rankd\left({t}_{i}\right)& = {\text{max}}_{\forall {t}_{j} \in pred({t}_{i})}\left(Rankd\left({t}_{j}\right)\right.\\ &\quad\left. +\overline{\text{ET }({t}_{j}) }+\overline{e({t}_{j},{t}_{i})}\right),\end{aligned} $$
(6)
$$Rankd\left({t}_{\text{entery}}\right)=0,$$
(7)

where \(pred({t}_{i})\) is the set of an immediate predecessor of task \({t}_{i}\) and \(\overline{e({t}_{j},{t}_{i})}\) is the average communication cost of edge\(e({t}_{j},{t}_{i})\), and \(\overline{\text{ET }({t}_{j})}\) is the average computation cost of the task \({t}_{j}\). \(Rankd\left({t}_{i}\right)\) is the longest distance from the entry task to the task \({ t}_{i}\), excluding the computation cost of the task itself [19]. Similarly, the upward rank of a task \({t}_{i}\) is recursively defined by Eq. (8) [19]. Also, for creating a sequence ordered of tasks based on upward ranking, each upward raking value for each task \({t}_{i}\), except for exit task, is calculated via Eq. (8) whereas upward value for exit task is set by its average computation cost which Eq. (9) shows.

$$Ranku\left({t}_{i}\right)= \overline{\text{ET }({t}_{i}) }+{\text{max}}_{\forall {t}_{j} \in \text{Succ}({t}_{i})}\left(\overline{e({t}_{i},{t}_{j})}+ Ranku\left({t}_{j}\right)\right),$$
(8)
$$Ranku\left({t}_{\text{exit}}\right)=\overline{\text{ET }\left({t}_{\text{exit}}\right) }.$$
(9)

The term \(Succ({t}_{i})\) is the set of immediate successors of task \({t}_{i}\). In addition, the term \(Ranku\left({t}_{i}\right)\) is the length of the critical path from task \({t}_{i}\) to the exit task \({t}_{exit}\), including the computation cost of the task \({t}_{i}\). The third heuristic is the level ranking approach in which it assigns a level number to a task. The entry task has level 0, but for other tasks, the level is recursively calculated by Eq. (10). Then, the tasks are sorted based on their level ranking with increasing order.

$$\text{Level}\left({t}_{i}\right)= {\text{max}}_{\forall {t}_{j} \in pred({t}_{i})}\left(\text{Level}\left({t}_{j}\right)\right)+1.$$
(10)

In addition, in this paper, task scheduling performance is improved by utilizing the duplication technique which doubles critical tasks on different VMs because this technique enhances running parallelism. On the other hand, duplication may shorten the time interval in which VMs are at service; therefore, it can potentially decline monetary cost. Note that, duplication technique intrinsically leads to charge for a couple of VMs rent instead of one VM, so, there exists a clear conflict between makespan reduction and monetary costs. This is the reason this issue is formulated as a bi-objective optimization problem by applying the dominance concept for finding solutions that compromise between objectives. To apply the duplication technique to an ordered list, a random number is dedicated to each task \({t}_{i}\) which has the minimum value, \(min=1\) and the maximum value, \(max=\text{min}\left\{\text{count}\left\{\text{VMs}\right\},\text{ max}\left\{\text{count}\left(Pred\left\{{t}_{i}\right\}\right),\text{count}\left(Succ\left\{{t}_{i}\right\}\right)\right\}\right\}\); these numbers are determined based on the number of VMs, the number of predecessors, and the number of successors associated with each task. Then, it can be balanced in the proposed enhanced simulated annealing process. After an ordered list is prepared by any ranking heuristic; then, the duplication technique is applied. Hereafter, each duplicated task is treated the same as the original task in the list. To apply the VM selection phase, new EST \(({t}_{i},{\text{VM}}_{k})\) is defined which is used to show the last time the task \({t}_{i}\) whether original or duplicated task can wait for execution on \({\text{VM}}_{k}\). If \({t}_{i}\) is an entry task in a DAG or it is the first task must be assigned in the \({\text{VM}}_{k}{^{\prime}}\) s task list that the \({VM}_{k}.\text{List}={\varnothing }\) shows the case, the EST \(({t}_{i},{\text{VM}}_{k})\) is equal to the boot-up time of \({\text{VM}}_{k}\), shown by \(\text{BUT}\left({\text{VM}}_{k}\right)\); otherwise, the term EST \(({t}_{i},{\text{VM}}_{k})\) is calculated with Eq. (11). In this equation, \({t}_{j}^{\gamma }\) is γ-th duplication of task \({t}_{j}\) that is a member of the predecessor of \({t}_{i}\), otherwise task \({t}_{j}\) is an original predecessor of task \({t}_{i}\). The important note is that, the fastest duplicated task and the slowest original task in the predecessor of task \({t}_{i}\) should be taken into consideration. In addition to, the function Avail (\({\text{VM}}_{k}\)) is used to indicate the time which this VM’s last task has been finished and it is available for the new task.

$$\begin{aligned} & \text{EST}\left({t}_{i},{\text{VM}}_{k}\right)\\ &\quad =\text{max}\left\{ \text{Avail}\vphantom{\left({t}_{j}^{\gamma }\right)} \left({\text{VM}}_{k}\right), \left\{\vphantom{\left({t}_{j}^{\gamma }\right)}{\text{max}}_{\forall {t}_{j} \in \text{pred}\left({t}_{i}\right)}\left\{\text{AFT}\left({t}_{j}\right)+\text{TT}\left({t}_{j},{t}_{i}\right)\right\},\right.\right.\\ &\quad\left.\left. {\text{min}}_{\forall {t}_{j}^{\gamma } \in \text{duplicat}\left({t}_{j}\right)}\left\{\text{AFT}\left({t}_{j}^{\gamma }\right)+\text{TT}\left({t}_{j}^{\gamma },{t}_{i}\right)\right\} \right\}\right\}.\end{aligned} $$
(11)

When a new virtual machine \({\text{VM}}_{k}\) is intended to be started before the task scheduling can be performed, it is needed to boot up the virtual machine \({\text{VM}}_{k}\) in the system; where the function \(\text{BUT}\left({\text{VM}}_{k}\right)\) is considered to measure this boot-up time. This overhead is negligible for long-term scheduling, but it can become a problem when running a virtual machine is unnecessary. A scheduling algorithm may terminate a running task to save cost, but it can be restarted to meet the executing task. The overhead caused by launching a new virtual machine may not justify the cost. The time which is determined by \(\text{BUT}\left({\text{VM}}_{k}\right)\) is effective at the initial run of the virtual machine and it can also affect the monetary cost [87].

The value of Earliest Finish Time function, \(\text{EFT}\left({t}_{i},{\text{VM}}_{k}\right)\) for each task \({t}_{i}\), whether it is duplicated or an original on the virtual machine \({\text{VM}}_{k}\) is calculated by adding two values \(\text{EST}\left({t}_{i},{\text{VM}}_{k}\right)\) and \(\text{ET}\left({t}_{i},{\text{VM}}_{k}\right)\) which Eq. (12) shows.

$$\text{EFT}\left({t}_{i},{\text{VM}}_{k}\right)= \text{EST}\left({t}_{i},{\text{VM}}_{k}\right)+\text{ET}\left({t}_{i},{\text{VM}}_{k}\right).$$
(12)

The function \(\text{AFT}({t}_{i})\), in Eq. (11), is utilized for the actual finish time of the task \({t}_{i}\) which can be measured via Eq. (13) and also Eq. (15).

$$\text{AFT}\left({t}_{i},{\text{VM}}_{\text{index}}\right)={\text{min }}_{\forall {\text{VM}}_{k} \in \text{ VMs}}\left\{\text{EFT}\left({t}_{i},{\text{VM}}_{k}\right)\right\}.$$
(13)

The term \({\text{VM}}_{\text{index}}\) is used for calculating \(\text{AST}({t}_{i},{\text{VM}}_{\text{index}})\) according to Eq. (14). The term \({\text{VM}}_{\text{index}}\) indicates task \({t}_{i}\) is started on it. The term AFT(.) can be measured via Eq. (15).

$$\text{AST}\left({t}_{i}\right)=\text{EST}\left({t}_{i},{\text{VM}}_{\text{index}}\right),$$
(14)
$$\text{AFT}\left({t}_{i},{\text{VM}}_{\text{index}}\right)= AST\left({t}_{i}\right)+ ET\left({t}_{i},{\text{VM}}_{\text{index}}\right).$$
(15)

In this paper, the first objective function is to minimize the makespan parameter which is determined by Eq. (16).

$$makespan=\text{min}\left\{ {\text{max}}_{\forall {t}_{i} \in T}\left(\text{AFT}\left({t}_{i}\right)\right)\right\}.$$
(16)

Pricing model

Cloud computing follows the pay-per-use pricing model which means users being charged for the whole time duration even if they use only a fraction of it. Thus, in the proposed model, each instance of leased \({\text{VM}}_{k}\) is charged per hourly time interval [88,89,90,91,92,93]. Infrastructure as a service (IaaS) providers offer instances of the virtual machine from the set of available \(VMs\) to its clients. Each \(VM\) has different configurations like memory size, CPU type, and cost per time unit. The \(VM\) configuration determines that the cost of faster \(VMs\) are costlier as compared to the slower ones. The function \(\text{EUnit}\left({\text{VM}}_{k}\right)\) shows monetary price that must be paid per each unit of execution time on \({\text{VM}}_{k}\). Also, the function \(\text{EC}({t}_{i},{\text{VM}}_{k})\) shows that the fee to be paid for each \({t}_{i}\) execution on \({\text{VM}}_{k}\) according to Eq. (17) similar to literatures in [88, 94, 95]. In Eq. (17), the term \(\text{EFT}\left({t}_{i},{\text{VM}}_{k}\right)\) indicates the time interval needed for execution of task \({t}_{i}\) on \({\text{VM}}_{k}\) which is gained via Eq. (12). The upper bound is used to round the execution time because the payment of VMs is based on unit of time in cloud environment. For instance, if one deploys an a1.medium instance from Amazon EC2 which attributed with vector a1.medium = ( vCPU = 1, Mem = 2 GiB); its on-demand hourly rate is $0.0255/h [89]. So, the \(\text{EUnit}({\text{VM}}_{k})\) value is $0.0255/hour. If one deploys such on-demand VM for 20 h and 35 min duration, he/she must pay for 21 h. Consequently, the bill is $0.5355.

$$\text{EC}\left({t}_{i},{{\text{VM}}}_{k}\right)=\lceil\text{ET}\left({t}_{i},{\text{VM}}_{k}\right)\rceil*\text{EUnit}\left({\text{VM}}_{k}\right).$$
(17)

The function \(\text{Tunit}\left({\text{VM}}_{k},{\text{VM}}_{l}\right)\) is used to show the monetary price which must be paid per each unit of communication data between \({\text{VM}}_{k}\) and \({\text{VM}}_{l}\). Also, the term \(\text{TC}\left({t}_{i},{t}_{j}\right)\) shows the total fee to be paid for communication cost from tasks \({t}_{i}\) to \({t}_{j}\) between \({\text{VM}}_{k}\) and \({\text{VM}}_{l}\) provided each task is assigned on different virtual machines \({\text{VM}}_{k}\) and \({\text{VM}}_{l}\) where k \(\ne l\), according to Eq. (18).

$$\text{TC}\left({t}_{i},{t}_{j}\right)= \lceil\text{TT}\left({t}_{i},{t}_{j}\right)\rceil*\text{Tunit}\left({\text{VM}}_{k},{\text{VM}}_{l}\right).$$
(18)

Since this article considers hybrid cloud architecture as infrastructure to execute workflows, both private and public cloud must be taken into consideration. Note that, the pricing procedures of the two clouds are different. The \(\text{private VM}\) scheduler uses an on-demand provisioning approach. The term on-demand is the standard model that is considered by most existing scheduling techniques [88, 96]. Metered services, also called pay-per-use models, are any type of payment structures in which customers have access to potentially unlimited resources but only pay for what they use. VM instances can be launched and terminated at any time. Because of this, scheduling algorithms need to estimate an optimal number of instances to be allocated before begin of execution provided it is not specified by users; so, it must determine additional instances if it is required during the execution. In this line, those instances that no longer contribute to the workflow execution should also be switched off to save cost. On the other hand, the \(\text{public VM}\) scheduler uses the charge period provisioning approach. With this assumption, scheduling algorithms try to exploit all the leftover resource utilization of the charged periods and decide whether to terminate machine instances at the end of their charge periods to minimize the cost. With this fine-grained assumption, the scheduler should be aware of fitting tasks within the charged period to take cost-effective procedure; this makes it complicated as considering these points during the scheduling process [96].

Therefore, the amount of total monetary cost (ToC) variable comprises both total execution and transfer costs which are brought in Eq. (19). So, the second objective function is to minimize the ToC cost function.

$$\text{ToC}=\text{TEC}+\text{TTC}.$$
(19)

The total execution cost (TEC) variable represents the amount of monetary cost that will be paid for the execution of all tasks in a workload. The TEC value varies depends on one uses the public virtual machine \(\begin{array}{c}{\text{VM}}_{k}^{\text{pu}}\in \{\text{public VMs}\}\end{array}\) or private virtual machine \(\begin{array}{c}{\text{VM}}_{k}^{\text{pr}}\in \{\text{private VMs}\}\end{array}\) because \(\text{private VM}\) scheduler applies on-demand provisioning and \(\text{public VM}\) scheduler applies charge period provisioning approaches respectively. According to Eq. (20), the TEC value is measured whether private cloud or public cloud is applied. Thus, for \(\text{private VM}\) usage, this cost value is calculated in such a way that this is the cost of each task that is executed within \(\left(\forall \begin{array}{c}{\text{VM}}_{k}^{\beta } | \beta \in \text{private VM}\end{array}\right)\) and for \(\text{public VM}\) cost calculation is for the time between the period start and end time of using \((\forall \begin{array}{c}{\text{VM}}_{k}^{\beta } | \beta \in \text{ public VM}\end{array})\). The decision variable \({x}_{k}\) is used to indicate whether \({\text{VM}}_{k}\) is utilized or not.

$$ \begin{aligned} &{\text{TEC}}=\sum_{{\text{VM}}_{k}\in{\text{ private VM}}}\\ & \quad \sum_{{t}_{i}\in {\text{VM}}_{k}.{\text{List}}}\lceil{\text{EC}}\left({t}_{i},{\text{VM}}_{k}\right)\rceil.{x}_{k}\\ &\quad +\sum_{{\text{VM}}_{k}\in{\text{ public VM}}}\lceil \left({\text{EFT}}({t}_{{\text{Last}}},{\text{VM}}_{k})\right.\\ & \quad \left.-{\text{EST}}\left({t}_{{\text{First}}},{\text{VM}}_{k}\right)\right)*{\text{EUnit}}({\text{VM}}_{k}).{x}_{k}\rceil.\end{aligned} $$
(20)

The terms \({t}_{\text{Last}}\) and \({t}_{\text{First}}\) indicate to the end of the last task execution time and the start of the first task on \({\text{VM}}_{k}.\) Also, such as TEC, the measurement of \(\text{TTC}\) variable in hybrid cloud environment has different values depending on which type of underlying public/private virtual machines are applied. Only in the case of communication is inside or starts from public VMs \(({\text{VM}}_{l}^{\beta } | \beta \in \text{public VM})\), the transfer monetary cost can be ignored. Equation (21) demonstrates how to calculate the TTC variable.

$${\text{TTC}}=\sum_{\begin{array}{c}{\text{VM}}_{k}\in \text{private VMs},{\text{ VM}}_{\begin{array}{c}l \end{array}}\in \text{public VMs}\\ {\text{VM}}_{k}\ne {\text{VM}}_{l }\end{array}}\sum_{\begin{array}{c}e\left({t}_{i},{t}_{j}\right)\\ {t}_{i}\in {\text{VM}}_{k}.\text{List}\\ {t}_{j}\in {\text{VM}}_{l}.\text{List}\end{array}}\text{TC}\left({\text{VM}}_{k},{\text{VM}}_{l}\right).$$
(21)

Note that, the proposed pricing model is a general model in which it can be utilized for both private cloud owners and individuals without any infrastructure, but both are on a tight budget intend to utilize the public cloud. For the sake of simplicity, in the simulation, only public cloud adoption is taken into consideration. To this end, the part associated with a private cloud is overlooked from the pricing model, although it does not have any side effect on the proposed model.

An illustrative example

To present the effectiveness of the proposed model, a sample DAG depicted in Fig. 4 is considered. Then, the results relevant to executions of different approaches are reported. This sample graph consists of ten tasks from \({t}_{1}\) to \({t}_{10}\). The data transfer time between the tasks is shown by the number above each arc. For executing the tasks in this workflow, a set \(VMs=\{{\text{VM}}_{1},{\text{VM}}_{2},{\text{VM}}_{3}\}\) of three heterogeneous VMs is considered in this model. The values of functions \(\overline{\text{ET }\left({t}_{i}\right) }, \text{AST}\left({t}_{i}^{\gamma }\right), \text{AFT}\left({t}_{i}^{\gamma }\right), \text{AUV}\left({t}_{i}^{\gamma }\right), Ranku({t}_{i})\), and \(Rankd({t}_{i})\) for each task \({t}_{i}\) are given in Table 3. Note that, the lease time interval and boot-up time of the \(VMs\) are taken to be zero in this illustrative example. Table 3 also illustrates the task \({t}_{i}\)‘s execution time and monetary cost on each \({\text{VM}}_{j}\).

Fig. 4
figure 4

An example of a DAG application with 10 tasks [97]

Table 3 Available VMs, their execution times, and Monetary cost for the workflow depicted in Fig. 4

For encoding the candidate solution, it has two parts; namely, the first part is the task list and the second part is the number of corresponding duplicated tasks. Figure 5 depicts the valid chromosome relevant to a DAG drawn in Fig. 4. In this Figure, tasks \({t}_{1}\) and \({t}_{4}\) were duplicated two times. Note that Cmax notation is used for workflow’s maximum completion time or makespan.

Fig. 5
figure 5

Task encoding model chromosome

Figure 6 illustrates a comparison between the novel BOSA-TDA algorithm versus several proposed algorithms in the literature.

Fig. 6
figure 6

Output of illustrative example with a NSGAII, b HEFT-TD, c Lookahead HEFT-TD, d SOGA, and e BOSA-TDA algorithm

As Fig. 6 demonstrates BOSA-TDA (depicted in Fig. 6e) outperforms others in terms of makespan. After the BOSA-TDA, NSGAII (Fig. 6a), Single objective GA (SOGA) (Fig. 6d), HEFT-TD (Fig. 6b), and Lookahead HEFT-TD (Fig. 6c) have the next ranking order in term of makespan. Note that, SOGA only intends to minimize makespan metric in which it neglects to take cost reduction improvement; then, based on its solution, the cost of utilized VMs is calculated and reported. Usually, it has a good result in the first objective function. In this regards, Fig. 7 depicts the effectiveness of BOSA-TDA in comparison with other approaches in terms of monetary cost, SLR, speedup, and efficiency which are evaluation metrics.

Fig. 7
figure 7

Comparison of proposed BOSA-TDA against other approaches in terms of evaluation metrics: a makespan and monetary, b SLR (c), speedup (d), and efficiency for an illustrative example

Regarding Fig. 7, it can be concluded that BOSA-TDA outperforms other approaches in terms of evaluation metrics.

Problem statement

The problem, in this paper, is to map a set of tasks of a given workload on a set of VM instances in such a way that total monetary cost (ToC) and \(makespan\) of the workflow scheduling are simultaneously minimized while some constraints are preserved. This problem is formulated to a bi-objective optimization problem with service cost and service time minimization viewpoints; this formulation is drawn in Eqs. (2226). Note that the first and the second objectives have been elaborated in Eq. (16) and Eq. (19) respectively.

$$\text{minimize }F\left(\overrightarrow{\text{solution}}\right)=\left[{f}_{Makespan}\left(\overrightarrow{\text{solution}}\right), {f}_{\text{ToC}}\left(\overrightarrow{\text{solution}}\right)\right],$$
(22)
$$ {f}_{{Makespan}}\left(\overrightarrow{\text{solution}}\right) :\text{min}\left(\{ {\text{max}}_{\forall {t}_{i} \in T}\left(\text{AFT}\left({t}_{i}\right)\right)\} \right),$$
(23)
$${f}_{\text{ToC}}\left(\overrightarrow{\text{solution}}\right) : \text{min}\left(\text{TEC}\left(\overrightarrow{\text{solution}}\right)+\text{TTC}\left(\overrightarrow{\text{solution}}\right)\right),$$
(24)
$$ s.t: ToC \le \text{Budget},$$
(25)
$$makespan\le \text{Deadline}.$$
(26)

In this formal presentation, two constraints (25, 26) can be adjusted by users depending on their monetary budget and time sensitivity. Since the objectives in a bi-objective optimization problem usually conflict with each other, the Pareto dominance concept is commonly used to compare generated solutions [87].

Proposed bi-objective optimization based on simulated annealing task duplication scheduling algorithm (BOSA-TDA)

The problem statement in the previous section reveals some important points. Namely, the issue we face is an NP-Hard problem with discrete search space. In addition, it is a multi-objective optimization problem the reason why it needs profound exploring search space to find abundant Pareto solutions. To have abundant solutions, the SA suffers to provide a handful of candidate solutions since the canonical SA is a point-wise algorithm. So, a new population-based BOSA-TDA algorithm is devised that takes benefit of HEFT approaches, i.e., upward, downward, and level rankings, and also duplicated list of tasks in their initial population. Algorithm 1 illustrates the novel proposed BOSA-TDA.

figure a

This pseudo-code in Algorithm 1 is relevant to the main algorithm; it receives a DAG and underlying hybrid platform specifications; then, it returns a set of non-dominated solutions as PS. Since diverse solutions are preferable against dense solutions in multi-objective problems, before the main algorithm returns the final non-dominated solutions which were gained from the first front-ranking, it calls the CrowdingDistance procedure to find diverse solutions such as in [81] . Similar to other population-based algorithms, the proposed algorithm starts with random initial individuals. Note that each individual in the population is a record of three fields. Namely, the fields Chrom, Obj1, and Obj2 which are used for solution encoding, the first and the second objectives respectively. In this line, the first three individuals of the population are created according to upward, downward, and level ranking approaches, for the rest of the population the CreateRandomSolution algorithm is called to produce other individuals, which is depicted in Algorithm 2. After the individuals are prepared, the objective functions can be calculated via Eq. (16) and Eq. (19) respectively. Then, it plummets into the main loop of Algorithm 1 through lines (17–56). This repeats MaxIteration times which is set 50 times in this paper. In each iteration, for all individuals of the population, several instructions are run. For each individual as a candidate, the SA is run to explore the neighborhood of each solution. To do so, a neighborhood operation is defined which calls randomly one of the four algorithms: Algorithm 3 through Algorithm 6; by this, search space will be efficiently permutated. If a new solution is better than the current one, it is accepted. In the other words, if the new solution dominates the previous one it is substituted in line#43, otherwise the worse solution is probabilistically accepted in line #46. The SA utilizes the temperature concept, temp, to take over the algorithm. This is set the temp to the big value at the outset and it gradually decreases it to reach freeze value. The exponential function \(\text{exp}(-\Delta newPop/temp)\) is used to calculate the probability of worse solution acceptance; the temperature temp value is near to freeze, the chance of acceptance is near to 0. The parameter \(\Delta newPop\) is used to have the effectiveness of normalized objective functions in the decision. The coefficient α is applied to indicate the importance of objective functions. For now, this article takes 0.5 as the same importance for each of both. After that, the small cooling stage happens to reach in stable point. Since the range of two objective functions is different, the individuals’ objective functions are normalized based on the minimum and maximum of both objective function values of the whole population in each round.

Initial population

The proposed BOSA-TDA similar to other meta-heuristic computations starts with the initial population. As the main algorithm depicts, the first three individuals of the population are placed by upward, downward, and level ranking algorithms which are HEFT-based approaches. For the rest individuals, Algorithm 2 is called which has random treatment. It guarantees the casual traversing of search space. For this reason, Algorithm 2 is called to create populations with more plenitude solutions. To do so, it generates new chromosomes in two steps. The first step is to create a random order of tasks for the first part of chromosome <\(Task List\)  > and the second is to create a random number of duplications in \(<Duplication List>\) for each task \({t}_{r}\). The initial value of \(<Task List>\) is set to \(\text{null}\). In this pseudo-code, the list of tasks that have no predecessor, namely \(Pred\left({t}_{i}\right)=\varnothing \); or input nodes are put in set \(\{AvailSet\}\). The variable list \(\{AvailSet\}\) is a set of tasks that can be selected at any time for inserting to the list \(<Task List>\). In the CreateRandomSolution algorithm in lines (4–14), while loop is dedicated to continuing until the set \(\{AvailSet\}\) has any member of predecessors. The variable Visited is an array that holds the visited task number at any time. In this regard, in line 5, a random task \({t}_{r}\) is selected from the set \(\{AvailSet\}\). If all predecessors of \({t}_{r}\) are visited before then \({t}_{r}\) is removed from the set \(\{AvailSet\}\). Then, the values \(Succ\left({t}_{r}\right)\) are added to the set \(\{AvailSet\}\). Afterward, the selected task \({t}_{r}\) is added to \(<Task List>\) otherwise the task \({t}_{r}\) is removed from the set \(\{Avail List\}\) because \({t}_{r}\) in the forthcoming rounds will be added.

After adding all tasks of set \(T\) to the list \(<Task List>\) separately, it is turned to the second step which is placed in lines (15–19). In this section, there should be a random number assigned to the task that represents the number of duplicates for each task. However, the values assigned to the task list must have at least one and a maximum value of \(MaxD\) because the value more than this number will only cause additional duplication and burdens redundant monetary costs without any performance improvement.

figure b

Simulated annealing process

As the main body of the algorithm calls an SA-like procedure, it needs some operators to permute discrete search space efficiently. For this, four different neighbors of current state algorithms are applied; each of which is randomly called in every round. As a candidate solution has chromosome encoding which has two parts task list and duplication list, our new operators target both parts. Note that, both task lists and duplication lists are incorporated in proposed meta-heuristic operations. The name of four neighbor algorithms are CrossoverTask, CrossoverDuplication, MutationTask, and MutationDuplication Algorithms.

CrossoverTask algorithm

The CrossoverTask procedure receives two chromosomes from the Population in the form of \(<Task List>\) as input and crossover them to generate a new chromosome. The first one is Pop[i] and the second one is a random Pop[rand]; then, it returns a better child. For doing this, \(x1 \text{ and } x2\) are considered as inputs and \(Y1 \text{ and } Y2\) are new children. Firstly, the algorithm generates a random number R and acts as a single point crossover of the genetic task scheduling algorithm such as in [8], i.e., it copies \(x1[1..R]\) to corresponding elements of \(Y1\) and copies \(x2[1..R]\) to \(Y2\). Then, all tasks in \(x2\) that do not exist in \(Y1\) will be inserted to \(Y1\) in the same order. Also, all tasks in \(x1\) that do not exist in \(Y2\) will be inserted to \(Y2\) in the same order; this procedure guarantees dependency constraints. Then, it returns dominated child. Algorithm 3 is presented to show this type of crossover. Also, Fig. 8 illustrates its functionality.

figure c
Fig. 8
figure 8

Example of crossover task procedure

CrossoverDuplication algorithm

The CrossoverDuplication procedure receives two chromosomes from the population in the form of \(<Duplication List>\) as input and crossover them to generate a new chromosome. The first one is Pop[i] and the second one is a random Pop[random]; then, it returns a better child. For doing this, firstly it generates a random number R; then, it puts values \(\left[chD1[1..R],chD2[R+1..nT]\right]\) in \(Y1\), and also it puts values \(\left[chD2[1..R],chD1[R+1..nT]\right]\) in \(Y2\). Algorithm 4 depicts the CrossoverDuplication procedure. Also, Fig. 9 illustrates its functionality.

figure d
Fig. 9
figure 9

Example of crossover duplication procedure

MutationTask procedure

This procedure receives a chromosome \(chT\) from \(<Task List>\) as input and mutates it to generate a new chromosome to \(<Task List>\). For doing so, this firstly generates random number \(i\); then, it checks all elements after place \(chT[i]\) until it reaches the end of the list or reach a member \({t}_{j}\) that is the successor of \(chT[i]\). Then, in line #10 of Algorithm 5, it selects a random number \(h\) between \(chT[i+1..j-1]\), if all predecessor of \(chT[h]\) is in \(chT[1..i-1]\) (call it \(HeadTasks\)), it swaps \(chT\left[h\right]\) and \(chT[i]\) values. Algorithm 5 depicts the MutationTask procedure. Also, Fig. 10 illustrates its functionality.

figure e
Fig. 10
figure 10

Example of mutatetask procedure

MutatationDuplication procedure

Such as all evolutionary algorithms the operation of mutation may help to avoid getting stuck in local optimal; this is the reason to customize mutation operation in both task and duplication lists. By getting a chromosome, the value of its \(<Duplication List >\) part is mutated. This function randomly mutates one or more members of the list \(<Duplicate List>\). Also, the new values do not violate the minimum and maximum conditions. Algorithm 6 depicts the MutationDuplication procedure. In addition to, Fig. 11 illustrates its functionality.

figure f
Fig. 11
figure 11

Example of mutate duplication procedure

Simulation and evaluation

This section is dedicated to the simulation and evaluation of the novel BOSA-TDA. To do so, miscellaneous scenarios are conducted to reach robust results. As such, different scenarios and datasets are generated. Forthcoming subsections are considered for scheduling metrics; scenarios; and datasets; experiments; and data analysis for this reason. Each scenario has been independently executed 20 times each of which sets MaxIteration = 50 in its main loop; then, the average of execution results was reported.

Scheduling metrics

To have better evaluation and comparisons, several metrics are applied which are being pervasively used in literature. These metrics are enlisted in this section.

Communication to computation ratio (CCR)

The unpleasant feature of network delay is considered in this model whenever there needs to be communication between virtual machines. Unfortunately, network delay has a drastic impact on execution time; this is the reason the scheduler must be aware of the workload nature. The edges of the DAG representing the dependencies can be weighted to indicate the data transfer requirements. In this regard, the communication to computation ratio (CCR) of a graph is used for knowing how to extend the workload communication-intensive or computation-intensive is and to define how long the data transfer on the network will take. This CCR parameter is calculated via Eq. (27). It is the ratio of the average communication cost to the average computation cost. If a DAG's CCR value is very low, it can be considered as a computation-intensive application [1, 19].

$$\text{CCR}=\frac{\frac{1}{|E|}\sum_{\forall e\left({t}_{i}.{t}_{j}\right) \in \left\{E\right\}}\overline{e\left({t}_{i}.{t}_{j}\right)}}{\frac{1}{nT}\sum_{\forall {t}_{i} \in \left\{T\right\}}{\overline{ET\left({t}_{i}\right)}}}$$
(27)

Schedule length ratio (SLR)

Since each graph has its feature to be utilized in the scheduling process, the makespan gained as a result of each algorithm is no longer meaningful. To obviate this problem, considering a lower bound for execution time is a beneficial approach to give a bright clue. To avoid confusion, it needs to be normalized. To do so, it leverages the critical path (CP) concept to make a new parameter schedule length ratio (SLR). The graph’s CP is the longest path from entry node to exit node in which it cannot be parallelizable owing to dependency. If the scheduler executes nodes relevant to CP on the fastest available processors/VMs, the makespan parameter cannot be less than CP’s length. Then, the new parameter SLR which is a normalized metric regardless of the studied graph can be measured via Eq. (28). This metric is defined relative to the critical path rather than the total execution time. This is because the shortest execution time of a job on a highly parallel platform is determined by the length of its critical path. Note that, SLR value of each schedule is greater than one. If this value is close to one, then it can be said that the scheduler is working very well.

$$\text{SLR}=\frac{Makespan}{\sum_{{t}_{i}\in \text{Critical Path}}\text{min}(\text{ET}\left({t}_{i},{\text{VM}}_{k}\right))\forall {\text{VM}}_{k}\in \{\text{VMs}\}}$$
(28)

Speed up

This metric indicates how many times the algorithm runs faster in comparison to on a single processor or a VM, preferably the faster processor [1]. This metric is attained via Eq. (29).

$$\text{Speed up}=\frac{\text{serial execution on the fastest} VM}{Makespan}=\frac{{\text{min}}_{\forall {\text{VM}}_{k}\in \left\{VMs\right\}}\left({\sum }_{\forall {t}_{i}\in \left\{T\right\}}\text{ET}\left({t}_{i},{\text{VM}}_{k}\right)\right)}{Makespan}$$
(29)

Efficiency

The complementary metric is efficiency because the speed up metric does not determine you gained this level of speed up with spending how many processors [1]. This metric is calculated via Eq. (30).

$$\text{Efficiency}=\frac{\text{speed up}}{\text{numberof used} VM}*100\%.$$
(30)

Scenarios and datasets

To assess the performance of the proposed BOSA-TDA, the famous workloads such as synthetic data from five real-world scientific workflow applications are used [47], such as Montage (generation of image mosaics of the sky), Epigenomics (mapping of the epigenetic state of human cells), CyberShake (generating seismic hazard maps for earthquake detection), LIGO (detection of gravitational waves in the universe), and SIPHT (used in biology) [5]. In this line, the evaluation is based on the comparison between the proposed algorithm against other state-of-the-arts in terms of prominent evaluation metrics of scheduling domain such as makespan, cost, SLR, speedup, and efficiency. Moreover, to demonstrate the effectiveness of novel BOSA-TDA on a wide workload spectrum with a different attribute from computation-intensive to communication-intensive graphs, a new extension of such graphs is produced. Since different VMs’ configurations have variable performance and pricing schemes, different correlations between variables VM speed and VM price are utilized. The VMs’ processing power is considered such as Amazon EC2 instances A, T, M, etc. [93]. In addition to, the pricing scheme is considered hourly range from $0.011/h to $0.27/h ($94/year to $2367/year) [92]. Since some clouds neglect in/out data transfer charges and for the sake of simplicity, the data transfer cost and model constraints are omitted. In this dataset, it is taken a positive correlation coefficient near to 1 value for ES and EC vectors. This indicates that once execution speed increases, the execution cost also increases. Also, the datasets are conducted in such a way that having diverse workloads in terms of CCR; in this way, the robust evaluation can be made. To do so, four different values for CCR which are 0.1, 0.5, 1.0, and 5.0 considered respectively for computation-intensive, rather computation-intensive, moderate, and communication-intensive graphs. As the result of extensive simulations for communication-intensive graphs with parameter CCR equal and more than 5 proves that there is not any meaningful discrepancy between the performance of comparative algorithms, the last condition is ignored. Note that, in case of last condition, all algorithms behave such as a serial algorithm [1]. Overall, 15 scenarios have been considered to be examined for the efficiency of the current proposed work. Namely, there are five famous workloads each of which with three different CCR parameters.

Experiments and data analysis

Experiments

In this section, experimental results that are drawn from extensive simulations of proposed BOSA-TDA in comparison with state-of-the-arts NSGAII [81], HEFT-TD [30], Lookahead [30], and SOGA [85] are reported. Note that, the average result of 20 independent executions is reported in terms of makespan, total monetary costs, SLR, speed up, and efficiency metrics. Figures 12 through 16 are dedicated to these contrasts. As such, Figs. 12 and 13 respectively depict algorithms comparison in terms of makespan and monetary costs separately relevant to computation-intensive (CCR = 0.1), rather computation-intensive (CCR = 0.5), and moderate (CCR = 1.0) of understudy workloads. Figure 12 is condensed to show the comprehensive behavior of comparative algorithms in term of makespan value of all scenarios.

Fig. 12
figure 12

Comparison of all comparative algorithms in term of makespan value in all scenarios for different workloads

Fig. 13
figure 13

Comparison of all comparative algorithms in term of monetary cost value in all scenarios for different workloads

In this regard, the Fig. 13 is also condensed to illustrate the comprehensive behavior of comparative algorithms in term of monetary cost value of all scenarios.

For computation-intensive workload case studies which have CCR = 0.1, Fig. 12 demonstrates that BOSA-TDA outperforms against other algorithms in term of makespan metric except for in workloads LIGO and Montage that the BOSA-TDA has the same behavior with NSGAII; also, in Epigenomics dataset in which it has the same behavior with HEFT-TD and Lookahead algorithms; in other cases, the BOSA-TDA has significant dominance against counterparts. For monetary cost analysis, Fig. 13 proves the dominance of BOSA-TDA against other state-of-the-arts in all cases.

In addition to, for rather computation-intensive graphs which have CCR = 0.5, Fig. 12 proves that BOSA-TDA beats other algorithms in terms of makespan except for in Montage and Epigenomics workloads so that the BOSA-TDA has the same behavior with NSGA II. Also, the BOSA-TDA does not have any dominance against SOGA in term of makespan improvement in only Montage workload. In term of monetary cost, as Fig. 13 shows, the BOSA-TDA has improvement in cost reduction against others in the majority workloads except for in Montage and SIPHT workloads where the Lookahead algorithm has dominance against BOSA-TDA. In the aforementioned workloads, only SOGA and BOSA-TDA have the same results.

In the moderate workloads where CCR = 1.0, Fig. 12 demonstrates BOSA-TDA beats other approaches in all circumstances in term of makespan improvement except for in contrast with SOGA where it has not any dominance in the Cybershake graph. Also, the BOSA-TDA has superiority against other algorithms in term of monetary cost improvement in LIGO and Montage workloads, but it fails to outperform versus HEFT-TD, Lookahead, and NSGAII algorithms in Epigenomics workload and against SOGA and Lookahead algorithms in SIPHT workload. In addition to, the BOSA-TDA and SOGA have the same result in the CyberShake graph. Totally, in CCR = 1.0, the BOSA-TDA has superiority in 24 cases out of 25 in term of makespan improvement and 19 cases out of 25 in terms of cost reduction improvement. Figure 14 is dedicated for analysis of comparative algorithms in term of SLR value.

Fig. 14
figure 14

SLR comparison of all comparative algorithms in all scenarios

However, in term of SLR, the BOSA-TDA beats other approaches in the majority of scenarios, but it is not dominated by the rest. In Fig. 14, for computation-intensive graphs, the BOSA-TDA has the same result in a few cases where in the most cases it has dominance against other approaches. In rather computation-intensive graphs, the BOSA-TDA has the same output with NSGAII and SOGA in two cases; in other cases it beats others. In the moderate graphs, the BOSA-TDA beats others in all cases.

Accordingly, Fig. 15 is dedicated for comparison of comparative algorithms in term of speedup; it shows that the BOSA-TDA has superiority versus other approaches in term of speedup metric in all of the scenarios except for in some limited cases where the BOSA-TDA has the same result as the NSGAII indicates.

Fig. 15
figure 15

The speedup comparison of all comparative algorithms in all scenarios

One of the most important metrics which releases us from the misleading conclusion is the efficiency metric. To this end, the Fig. 16 is dedicated for analyzing the comparison of comparative algorithms in term of efficiency.

Fig. 16
figure 16

The efficiency comparison of all comparative algorithms in all scenarios

Evaluation associated with the execution of different simulations points out that the BOSA-TDA dominates other algorithms in terms of efficiency except for in moderate Cybeshake graphs (with CCR = 1.0) in which three algorithms NSGAII, HEFT-TD, and Lookahead have dominance against BOSA-TDA. In the other words, the BOSA-TDA dominates in 22 cases out of 25 cases versus other approaches. As Fig. 16 depicts, in CCR = 1.0 scenarios, the NSGA II competes with the BOSA-TDA in terms of evaluation metrics only in some datasets, but in other cases, the BOSA-TDA outperforms other state-of-the-arts significantly.

Data analysis

This section presents data analysis to a better understanding of the proposed algorithm’s performance in contrast with other approaches in terms of prominent metrics derived from literature. In addition, a relative percentage deviation (RPD) metric is applied to point out the amount of enhancement gained via proposed BOSA-TDA approach [1]. To do so, Tables 4, 5, and 6 are dedicated to this reason. Table 4 illustrates a makespan comparison of different comparative algorithms. To have a ranking list in terms of makespan metric from the best to the worst algorithm, NSGAII, SOGA, HEFT-TD, and Lookahead algorithms are placed in the ranking list, but after the BOSA-TDA. The negative cell value means deterioration whereas the zero value means no improvement was gained.

Table 4 Then comparison of BOSA-TDA versus state-of-the-arts in term of RPD of makespan
Table 5 The comparison of BOSA-TDA versus state-of-the-arts in term of RPD of monetary cost
Table 6 The comparison of BOSA-TDA versus state-of-the-arts in term of RPD of SLR

Table 5 illustrates the cost reduction comparison of different algorithms. For cost reduction metrics, Lookahead, HEFT-TD, NSGAII, and SOGA algorithms are placed in the ranking list from the best to the worst, but after the BOSA-TDA which is in the first place.

In this regard, Table 6 illustrates the SLR metric comparison of different comparative algorithms. For SLR metric, NSGAII, SOGA, HEFT-TD, and Lookahead algorithms are placed in the ranking list from the best to the worst, but after the proposed BOSA-TDA.

Table 7 illustrates the Speedup metric comparison of different comparative algorithms. For Speedup metric, again NSGAII, SOGA, HEFT-TD, and Lookahead algorithms are placed in the ranking list from the best to the worst, but after BOSA-TDA.

Table 7 The comparison of BOSA-TDA versus state-of-the-arts in term of RPD of speed up

Table 8 illustrates the Efficiency metric comparison of different comparative algorithms. Also, for the efficiency metric, NSGAII, SOGA, HEFT-TD, and Lookahead algorithms are placed in the ranking list from the best to the worst, but after BOSA-TDA.

Table 8 The comparison of BOSA-TDA versus state-of-the-arts in term of RPD of efficiency

Conclusion and future work

The majority of existing workload scheduling algorithms in the cloud environment only intend to minimize makespan metric whereas they neglect to consider user service bill. Since cloud providers provision different processing power in terms of VM configurations, it burdens variable charges for subscribers. This is the reason that this article formulated workload scheduling in hybrid cloud architecture as a bi-objective optimization problem. To deal with the combinatorial problem, a novel hybrid population-based simulated annealing algorithm by applying the duplication technique has been presented which is named BOSA-TDA. To evaluate the performance of BOSA-TDA in terms of derived prominent metrics from literature, extensive scenarios on set of well-known workloads have been conducted. The reported results from the simulation of extensive scenarios proved the superiority of the proposed algorithm against other state-of-the-art approaches in terms of metrics in this ambit. For future work, we intend to model cloud reliability for mission-critical workloads; then, the scheduling problems can be formulated as a new bi-objective optimization algorithm with total execution time and reliability perspectives.