Introduction

Workflows have been commonly used to describe data processing applications from diversified fields, such as the Internet of Things and bio-informatics [1,2,3]. These workflows often comprise of large-scale data-dependent tasks, which are computation and data intensive. Then, executing various workflow applications calls for powerful high-performance infrastructures. With substantial advantages, such as economies of scale, on demand supply of resources, high elasticity, and reliability, cloud computing is attracting more and more enterprises or individuals to deploy their big data processing workflows [4, 5].

Workflow scheduling in cloud computing is a key technology for achieving the reduction of both execution cost and makespan to gain more profits for cloud providers and ensure the quality of service for cloud consumers [6]. Workflow scheduling problem involves determining the mappings from tasks to resources and the task order on each resource, and is a classic NP-complete [7, 8]. Also, the execution cost and makespan of workflow scheduling are two conflicting optimization objectives [9]. So far, multi-objective evolutionary algorithms have become popular to search a set of compromise solutions within an acceptable time [10,11,12]. To solve the workflow scheduling problem in cloud computing, some studies design new evolution and selection operators to improve the classical multi-objective evolutionary algorithms.

Over the past decade or so, designing efficient evolution operators to reproduce new solutions for multi-objective workflow scheduling problem has attracted considerable research interest [13]. First of all, the popular list-based workflow scheduling methods were embedded into the multi-objective evolutionary optimization framework as evolution operators [14,15,16]. Secondly, bio-inspired evolution operators, such as artificial neural network [17,18,19], ant colony optimization [20], firefly algorithm [21], particle swarm optimization [22,23,24], and grey wolf optimization [25], were modified as evolution operators to solve the multi-objective workflow scheduling problem. Thirdly, integrating heuristic rules and bio-inspired optimization techniques to reproduce offspring populations has become a popular technological path. For instance, Choudhary et al. [26] combined the gravitational search method and the list-based workflow scheduling method to solve bi-objective workflow scheduling problem in cloud computing. Hosseini et al. [27] merged the simulated annealing and a task duplication strategy to optimize the makespan and execution cost of workflows. Mohammadzadeh et al. [28] integrated the antlion and grasshopper optimization algorithms to balance throughput, makespan, cost, and energy consumption of executing workflows in cloud platforms. Zhang et al. [29] enhanced the list-based workflow scheduling method with a local search mechanism to balance the makespan and energy consumption of workflow execution.

At the same time, some studies went into designing selection operators to balance multiple conflicting objectives of workflow execution in cloud computing. For example, Zhou et al. [30] merged a fuzzy-dominance-based environmental selection and a list-based workflow scheduling method to minimize execution cost and makespan of workflows in cloud computing. Kumar et al. [31] integrated the entropy weight mechanism into a multi-criteria decision-making framework to balance makespan, execution cost, reliability, and energy consumption. Ye et al. [32] improved a knee point driven evolutionary method to balance makespan, reliability, execution cost, and the mean durations of all workflow tasks. Pham et al. [33] focused on the volatility of spot cloud resources and improved the multi-objective evolutionary algorithm to make a trade-off between makespan and execution cost for workflows in cloud computing.

In the evolutionary optimization community, a multi-objective optimization problem is generally considered large-scale if it has at least one hundred decision variables [34]. The multi-objective cloud workflow scheduling involves hundreds or even thousands of decision variables, and is a typical large-scale multi-objective optimization problem. However, the existing relevant studies evolve all decision variables as a whole and allocate evolution opportunities to each variable equally. This results in the low efficiency of these existing studies.

The recent research results in the evolutionary computation community demonstrate that cooperative coevolution [34, 35] has become a crucial and effective way to solve large-scale multi-objective optimization problems. In cooperative coevolution approaches, all the decision variables are classified into multiple groups, and decision variables in different groups are evolved in a round-robin manner [36, 37]. These static classification techniques work well when problems’ decision variables are fully or partially separable. However, this is not the case for multi-objective cloud workflow scheduling with nonseparable decision variables caused by data dependencies among tasks.

Besides, multi-objective cloud workflow scheduling poses an imbalance feature among decision variables regarding their contributions to optimization objectives. For instance, delaying the completion of a workflow task on the critical paths [38] often successively delays the completion of many tasks, including its successor tasks and other tasks being executed after the delayed tasks and their successor tasks. Whereas slightly delaying other tasks on the non-critical paths may not cause this chain reaction. The imbalance feature means that we should equip different decision variables with different evolution opportunities. This motivates us to design a decision variable contribution based adaptive mechanism to dynamically adjust the variable grouping and allocate evolution opportunities during the evolution process. Our main contributions in this paper are as follows.

  • We define the contribution of a decision variable as the fitness improvement of the solution generated by perturbing this decision variable. Then, we try to dynamically measure the contribution of each decision variable and classify them according to their contributions.

  • We design an adaptive mechanism to dynamically allocate more evolution opportunities for the variable groups with more contributions to generate offspring solutions efficiently.

  • In the context of fifteen real-world workflows and the Amazon Elastic Compute Cloud, we compare the proposal with four state-of-the-art multi-objective cloud workflow scheduling algorithms. The results demonstrate the competitive performance of the proposal in simultaneously optimizing execution cost and makespan.

This paper is organized as follows. In the second section formulates the multi-objective workflow scheduling problem. In the third section designs the proposed VCAES, followed by experimental verifications in the fourth section. In the final section concludes this paper.

Problem formulation

This section first describes the models of workflows and cloud resources, and then formulates the model for multi-objective workflow scheduling in cloud computing.

Workflow model

Without loss of generality, a workflow application is described by a Directed Acyclic Graph (DAG), whose vertices and directed edges represent workflow tasks and data dependencies, respectively. Formally, we construct the directed acyclic graph for a workflow application as \(\varPsi = \{T, D\}\), where \(T=\{t_1,t_2,\ldots ,t_n\}\) denotes the vertex set corresponding to task set, \(D\subseteq T\times T\) denotes the edge set corresponding to data dependencies among tasks. The existence of an edge \(d_{i,j}\in D\) means that \(t_j\)’s start demands \(t_i\)’s output results. Generally, task \(t_i\) is referred to as an direct predecessor of task \(t_j\), and \(t_j\) is referred to as an direct successor of \(t_i\). Regarding a task \(t_i\), all its direct predecessors is expressed as a set \(P(t_i)\), and all its direct successors is expressed as a set \(S(t_i)\).

Figure 1 gives a visual example of a directed acyclic graph for a workflow with seven tasks, i.e., \(T=\{t_1,t_2,\ldots ,t_7\}\). An edge \(d_{1,2}\) denotes the data dependency from \(t_1\) to \(t_2\), meaning that \(t_2\)’s start have to wait for \(t_1\)’s output results. In Fig. 1, regarding task \(t_6\), the set of its direct predecessors is \(P(t_6)=\{t_3,t_4\}\), and the set of its direct successors is \(S(t_6)=\{t_7\}\).

Fig. 1
figure 1

An example workflow of seven tasks

Cloud resource model

This paper targets the popular cloud paradigm, i.e., Infrastructure as a Service (IaaS). In this paradigm, cloud providers offer multiple types of cloud resources on demand [39, 40]. The differences between different types of cloud resources mainly lie in their charging prices and performance configurations, such as number of CPU cores, memory size, and network bandwidth. Assuming that cloud platforms offer m types of resources, then we model them as \(\varGamma =\{1,2,\ldots ,m\}\), where \(\tau \in \varGamma \) denotes the \(\tau \)-th resource type. Regarding a type \(\tau \), we employ \(\textrm{pr}(\tau )\) and \(\textrm{con}(\tau )\) to represent its price and configurations. Then, a cloud resource of type \(\tau \) is modeled as \(r_k^\tau =\{k, \textrm{pr}(\tau ), \textrm{con}(\tau )\}\), where k denotes the index of resource \(r_k^\tau \).

Refer to well-known cloud providers (e.g., Amazon EC2Footnote 1 and Alibaba Cloud ECSFootnote 2), this study follows the resource charging basis of pay-as-you-use. Under this rule, any consumer can rent any number of resources on demand and is charged according to the real usage time. In general, cloud resources are charged based on the number of billing periods, and the partial period will be rounded up to one more. In case that the period length is 60.0 min, the number of billing periods for 60.01 min is two.

Multi-objective scheduling cloud workflows

Since cloud resources are available on demand, we build a resource pool based on the maximum resource requirements for running a workflow. Assuming the maximum parallelism of the workflow is p, the resource pool includes p resources of each type. Then, we describe the resource pool as: \(R=\left\{ r_1^1,r_2^1,\ldots ,r_p^1,r_{p+1}^2,r_{p+1}^2,\ldots ,r_{2\cdot p}^2,\ldots ,r_{m\cdot p}^m \right\} \).

The decision vector \({\textbf {x}}=\{x_1,x_2, \ldots ,x_n\}\) is used to represent the mappings from workflow tasks to cloud resources, where the value of decision variable \(x_i\) is decoded as the index of the cloud resource mapped to the i-th task. It is worth noting that the value range of each decision variable is an integer from 1 to \(m\cdot p\).

Given a decision vector, assume that the task \(t_i\) is mapped to resource \(r_k^\tau \). The start time \(\textrm{s}t_{i,k}\) of task \(t_i\) refers to the maximum time of receiving all the input data and the available time of the mapping resource.

On resource \(r_k^\tau \), we assume the set of tasks being executed before task \(t_i\) as follows:

$$\begin{aligned} B_i = \left\{ t_p|I(t_p) < I(t_i)\right\} , \end{aligned}$$
(1)

where \(I(t_p)\) indicates \(t_p\)’s order number on resource \(r_k^\tau \).

Then, the start time \(\textrm{s}t_{i,k}\) of task \(t_i\) on cloud resource \(r_k^\tau \) can be described as follows:

$$\begin{aligned} \textrm{s}t_{i,k} = \max \left\{ \max _{t_b\in B_i}\textrm{f}t_{b,k}, \max _{t_p\in P(t_i)}\left\{ \textrm{f}t_{p,*}+\textrm{d}t_{p,i}\right\} \right\} , \end{aligned}$$
(2)

where \(\textrm{f}t_{b,k}\) indicates \(t_b\)’s finish time on resource \(r_k^\tau \), \(\textrm{f}t_{p,*}\) indicates the finish time of task \(t_p\), and \(\textrm{d}t_{p,i}\) indicates the data transfer time from \(t_p\) to \(t_i\).

Before scheduling, task \(t_i\)’s execution time \(\textrm{e}t_{i,k}\) on cloud resource \(r_k^\tau \) can be predicted by the computation length of task \(t_i\) and the CPU frequency of the mapped resource \(r_k^\tau \). The relationship among \(\textrm{s}t_{i,k}\), \(\textrm{e}t_{i,k}\), and \(\textrm{f}t_{i,k}\) is described as follows:

$$\begin{aligned} \textrm{f}t_{i,k} = \textrm{s}t_{i,k} + \textrm{e}t_{i,k}. \end{aligned}$$
(3)

Given a decision vector, the set of all tasks mapped to cloud resource \(r_k^\tau \) can be formulated as:

$$\begin{aligned} T_k = \{t_i|x_i = k, i\in \{1,2,\ldots ,n\}\}. \end{aligned}$$
(4)

With the task set \(T_k\), the start time \(\textrm{u}t_k\) and end time \(\textrm{n}t_k\) of renting resource \(r_k^\tau \) can be computed as follows:

$$\begin{aligned} {\begin{matrix} &{} \textrm{u}t_k = \min _{t_i\in T_k}\left\{ \textrm{s}t_{i,k} - \max _{t_p\in P(t_i)}\textrm{d}t_{p,i}\right\} , \\ &{} \textrm{n}t_k = \max _{t_i\in T_k}\left\{ \textrm{f}t_{i,k} + \max _{t_s\in S(t_i)}\textrm{d}t_{i,s}\right\} . \end{matrix}} \end{aligned}$$
(5)

Based on the above analysis, the first optimization objective, i.e., minimizing the execution cost, can be formulated as follows:

(6)

where C indicates the length of a billing period for cloud resources.

The second optimization objective of this paper is to minimize the workflow’s makespan, which corresponds to the maximum finish time of all the tasks in this workflow. The second optimization objective can be formulated as follows:

$$\begin{aligned} \text{ Min } f_2(\varvec{x}) = \max _{t_i\in T} \textrm{f}t_{i,*}. \end{aligned}$$
(7)

Thus, the model for multi-objective workflow scheduling problem in cloud computing can be summarised as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} \text{ Min } &{} \varvec{f}(\varvec{x})= \left[ f_1(\varvec{x}), f_2(\varvec{x})\right] ,\\ \text{ S.t. } \\ &{} \varvec{x} \in \{1,2,\ldots ,m\cdot p\}^n. \\ \end{array} \right. \end{aligned}$$
(8)

Pareto-dominance has been widely employed to compare solutions in the multi-objective optimization field.

Pareto-dominance: Assuming \({\textbf {x}}_1\) and \({\textbf {x}}_2\) are two feasible solutions. \({\textbf {x}}_1\) is regarded to dominate \({\textbf {x}}_2\) (denoted as \({\textbf {x}}_1\prec {\textbf {x}}_2\)) if and only if the two objectives of \({\textbf {x}}_1\) are not inferior to that of \({\textbf {x}}_2\) (i.e., \(f_j({\textbf {x}}_1)\le f_j({\textbf {x}}_2), \forall j\in \{1,2\}\)) and \({\textbf {x}}_1\) is better than \({\textbf {x}}_2\) on at least one objective (i.e., \(f_j({\textbf {x}}_1)< f_j({\textbf {x}}_2), \exists j\in \{1,2\}\)).

Pareto-optimal solution: Solution \({\textbf {x}}^*\in \{1,2,\ldots ,m\cdot p\}^n\) is regarded as Pareto-optimal if there exist no feasible solution dominating it.

Pareto Set/Front: All the Pareto-optimal solutions are defined as Pareto-Set (PS) in the decision space and Pareto-Front (PF) in the objective space.

Algorithm design

Given a workflow scheduling solution, the importance of each workflow task varies greatly. For instance, adjusting the mapping from a critical task to a resource often successively affects the execution of many tasks, including its successors and other tasks being executed after these tasks and their successors. Whereas adjusting the mapping from a non-critical task to a resource may have no impact on the execution cost and makespan of the workflow. Also, the importance of each workflow task varies from solution to solution. Then, decision variables corresponding to different workflow tasks pose an imbalance feature to optimization objectives. To deal with the large-scale decision variables in cloud workflow scheduling, the VCAES incorporates a novel cooperative coevolution (CC) mechanism to dynamically measure the contributions of decision variables and adaptively allocate evolution opportunities for each group of decision variables based on their contributions. The proposed VCAES follows the framework of traditional multi-objective evolutionary optimization, including initialization, reproduction operator, and selection operator, as shown in Algorithm 1.

figure a

The overall framework of VCAES

As illustrated in Algorithm 1, the inputs of the proposed VCAES are the multi-objective cloud workflow scheduling problem, the population size, memory length for recording the variable contributions, the number of decision variables in one group. Once the VCAES reaches the termination condition, it will output an up-to-date population.

In the initialization stage, one population is generated randomly (Line 1). Next, an \(l\times n\) matrix \({\textbf {M}}\) is initialized to collect the contribution of each variable over the past l iterations (Line 2). The element in row j and column i represents the contribution of the i-th variable in the previous j-th iteration. Also, a set of uniformly distributed reference vectors are initialized to assist in calculating variable contributions (Line 3). In addition, the number of iterations for cooperative co-evolution during each generation is initialized (Line 4). These iterations will be allocated to each group of variables in proportion to the overall contribution of variables in the corresponding group.

After the initialization stage, the VCAES enters the main loop. During each generation, the proposed adaptive cooperative coevolution mechanism is triggered to distribute the decision variables into many groups and allocate evolution opportunities to each group according to variable contributions (Line 6). Next, the memory matrix of variable contributions is updated (Lines 7–14). It is worth noting that the decision variables in the cloud workflow scheduling are related to each other. The VCAES generates a new population by evolving all variables in each generation (Line 15). After that, the non-dominated sorting and elitist-preserving method in NSGA-II [41] is employed to select an offspring population P from the combined population \(P\bigcup Q\) (Line 16).

Before introducing the proposed adaptive cooperative co-evolution mechanism, we define and illustrate the variable contributions.

Suppose Q is a population that generated by evolving the decision variables in group G(i) and fixing other decision variables, the contribution of each variable in G(i) is defined as:

$$\begin{aligned} K(j) = \sum _{\varvec{q}\in Q} \textrm{FI}(\varvec{q}), \forall j \in G(i), \end{aligned}$$
(9)

where \(\textrm{FI}(\varvec{q})\) denotes the fitness improvement of solution \(\varvec{q}\).

For a set of reference vector V, the one associated with solution \(\varvec{q}\) is defined as \(\varvec{v}^* = \arg _{\varvec{v}\in V}\min \langle \varvec{f}(\varvec{q}),\varvec{v} \rangle \), where \(\langle \varvec{f}(\varvec{q}),\varvec{v} \rangle \) represents the acute angle between two vectors. Suppose \(\varvec{p}\) is the associated solution of the reference vector \(\varvec{v}^*\), then the fitness improvement of solution \(\varvec{q}\) can be calculated as follows:

$$\begin{aligned} \textrm{FI}(\varvec{q}) = \max \{0, \textrm{Fit}(\varvec{p},\varvec{v}^*)-\textrm{Fit}(\varvec{q},\varvec{v}^*)\}, \end{aligned}$$
(10)

where \(\textrm{Fit}(\varvec{q},\varvec{v})\) denotes the fitness value of solution \(\varvec{q}\) with respect to the reference vector \(\varvec{v}\), which can be calculated as

$$\begin{aligned} \textrm{Fit}(\varvec{q},\varvec{v}) = \Vert \varvec{f}'\Vert \cdot (\cos<\varvec{f}', \varvec{v}> + \sin <\varvec{f}', \varvec{v}>), \end{aligned}$$
(11)

where \(\varvec{f}' = \varvec{f}(\varvec{q}) - \varvec{z}\) denotes the translated objective vector, and \(\varvec{z}\) denotes the ideal point.

Fig. 2
figure 2

An example workflow of seven tasks

An intuitive example on calculating contribution is given in Fig. 2. Suppose previous solution \(\varvec{p}\) and the new solution \(\varvec{q}\) are associated to the reference vector \(\varvec{v}=(0.5,0.5)\), and their objective vectors are \(\varvec{f}(\varvec{p})=(1.3,1.0)\) and \(\varvec{f}(\varvec{q})=(0.7,0.8)\), respectively. The ideal point is supposed to be \({\textbf {z}} =(0,0)\). Based on the definition in (11), we can obtain the fitness of these two solutions as \(\textrm{Fit}(\varvec{p},\varvec{v})=1.8385\) and \(\textrm{Fit}(\varvec{q},\varvec{v})=1.1314\). Then, the fitness improvement of solution \(\varvec{q}\) is \(\textrm{FI}(\varvec{q})=\max \{0, 1.8385-1.1314\}=0.7071\). If the solution \(\varvec{q}\) is generated by evolving variables \(\{1,3,5\}\), then the contribution of these variables are defined as \(K(1)=K(3)=K(5)=0.7071\).

Algorithm 2 gives the pseudo-code of the proposed adaptive cooperative coevolution mechanism, which dynamically distributes the variables with higher contributions into the same groups and assigns more evolution opportunities to accelerate the population convergence.

figure b

Function AdaptiveCoEvolution()

As illustrated in Algorithm 2, the inputs of the function AdaptiveCoEvolution() are the current population, the contribution of each variable during the previous l generations, the set of reference vectors, the population size, and the number of generations for cooperative coevolution search. Then, the outputs of this function are the updated population and the contribution of each variable.

At first, function AdaptiveCoEvolution() calculates the total contribution of each variable during the previous generations (Lines 1–4), where the notation H(i) represents the total contribution of the i-th variable during the previous l generations. Then, it sorts the variables in a non-ascending order according to their total contributions (Line 5) and distributes the variables into a series of groups (Line 6). In this way, the variables with similar contributions are distributed into the same group, which is helpful to distribute variables with larger contributions into the same groups to gain more evolution opportunities.

After that, function AdaptiveCoEvolution() successively evolves each group of variables, and measure their contributions. For a group of variables, the number of evolutionary iterations is calculated based on the sum of their contributions (Line 9). During each iteration, this function performs the reproduction operator on the corresponding variables to generate a new population (Line 11), and performs the selection operator to select the offspring population (Line 12). After evolving a group of variables, this function measures their contributions (Line 14). Also, it updates the contribution of the corresponding variables (Lines 15–17), where notation K(i) represents the contribution of the i-th variable.

Regarding Function AdaptiveCoEvolution(), it costs \(O(n\cdot l)\) to calculate the contribution of each decision variable (Lines 1–4, Algorithm 2). The time complexity of sorting decision variables is \(O(n\log n)\) (Line 5, Algorithm 2). This function takes \(O(N\cdot n)\) to reproduce an offspring population (Line 11, Algorithm 2). According to the analysis [41], the time complexity of the environmental selection is \(O(N^2)\) (Line 12, Algorithm 2). Then, it takes \(O(n\cdot (N\cdot n + N^2))=O(n\cdot N\cdot (n + N))=O(n^2\cdot N)\) to adaptively each group of decision variables (Lines 8–18, Algorithm 2). Then, the time complexity of Function AdaptiveCoEvolution() is \(O \left( n\cdot l + n\log n + n^2\cdot N \right) = O(n^2\cdot N)\), since l is often less than n.

Regarding the algorithm VCAES, the time complexity of updating the memory matrix is \(O(n\cdot l)\) (Lines 7–11, Algorithm 1). The time complexities of reproducing an offspring population and selecting elitist solutions are \(O(N\cdot n)\) and \(O(N^2)\) (Lines 15-16, Algorithm 1), respectively. Thus, the time complexity of the VCAES during one generation is \(O \left( n^2\cdot N + n\cdot l + N\cdot n + N^2\right) = O(n^2\cdot N)\).

Numerical experiments

To investigate the performance of the proposed VCAES, we compare it with four representative multi-objective cloud workflow scheduling algorithms: MOELS [42], EMS-C [9], WOF [43], LSMOF [44]. The MOELS and EMS-C follow the framework of NSGA-II [41] and incorporate new reproduction operators to generate offspring populations by evolving all the variables. The WOF and LSMOF are representative large-scale multi-objective evolutionary algorithms based on problem transformation.

Experimental setting

The eight types of workflows from different application domains, i.e., Inspiral (Gravitational physics), CyberShake (Earthquake), EpiGenomics (Biology), Montage (Astronomy), Sipht (Bioinformatics), BLAST (bioinformatics), Cycles (agroecosystems), and Seismology (seismology), have been widely used in evaluating cloud workflow scheduling algorithms. We select multiple task sizes from each workflow in the experiments. Besides, the DAG diagrams of some workflow instances with around 30 tasks are illustrated in Fig. 3. It is clear that these workflow instances cover various complicated structures, including in-tree, out-tree, fork-join, pipeline, and mixture. For more details on these workflows, please refer to the Pegasus repository.Footnote 3

Fig. 3
figure 3

DAG diagrams of workflows with about 30 tasks

Five types of resources instances, i.e., t3.nano, t3.micro, t3.small, t3.medium, t3.medium, and t3.large, from Amazon EC2Footnote 4 are employed to simulate the cloud environments. We summarize the parameters of the five resource types in Table 1. Besides, we set the length of a billing period to 60 s and the bandwidth among resource instances to 5.0 Gbps.

Table 1 Parameters for five types of cloud resource

Hypervolume [45] metric is designed to measure the quality of a population concerning both convergence and diversity, and has been frequently used to evaluate the performance of multi-objective evolutionary algorithms. Assume \(\varvec{r} =\{r_1,r_2\}\) is a reference point. The hypervolume value of a population P, corresponding to the volume between the reference point and the objective vectors of the solutions in P, can be calculated as follows.

$$\begin{aligned} \begin{array}{cc} HV(P) = L \left( \bigcup _{p\in P}[f_1(p), r_1]\times [f_2(p), r_2]\right) , \end{array} \end{aligned}$$
(12)

where \(L(\bigtriangleup )\) represents the Lebesgue measure.

Referring to the settings in MOELS and EMS-C, we set the population size of the five algorithms as 100. The maximum number of fitness evaluations is set as the stop condition for all the five algorithms and is set as 500,000.

The five algorithms are independently repeated 31 times on each workflow instance to mitigate random effects. We run all the experiments on a PC with two Cores i7-6500U CPU @ 2.50 GHz 2.59 GHz, 8.00 GB RAM, Windows 10.

Comparison result

Tables 2 and 3 report the average and standard deviation (in brackets) of the hypervolume values for the algorithms MOELS, EMS-C, WOF, LSMOF, and VCAES in scheduling the 27 workflow instances to cloud resources. For each workflow instance, the largest hypervolume value among the five algorithms is highlighted in bold. Besides, the Wilcoxon’s rank-sum test with \(\alpha =0.05\) is used to examine the significant differences between the comparison algorithm and the proposed VCAES in terms of hypervolume metric. The marks −, \(+\), and \(\approx \) represent that the comparison algorithm is significantly worse than, better than, and similar to VCAES.

Table 2 Comparison results for the five algorithms on 15 workflows in terms of the hypervolume metric
Table 3 Comparison results for the five algorithms on 15 workflows in terms of the hypervolume metric

The comparison results in Tables 2 and 3 show that the proposed VCAES performs significantly better than the four baseline algorithms. Specifically, the proposal generates larger hypervolume values than MOELS, EMS-C, WOF, and LSMOF on 20, 22, 26, and 25 out of the 27 workflow instances respectively.

An interesting phenomenon is that the more decision variables, the more pronounced the advantages of the proposed VCAES. For example, in workflow instances with about 30 tasks, the VCAES obtains larger hypervolume values than algorithm LSMOF on 3 out of 5. Whereas, in workflow instances with 100 and 1000 tasks, the algorithm proposed in this paper significantly outperforms all the four comparison algorithms. Similar to the baseline algorithms MOELS and EMS-C, the proposed VCAES also follows the framework of NSGA-II. Different from MOELS and EMS-C, the VCAES integrates a new adaptive strategy based on variable contribution. Although two large-scale multi-objective evolutionary algorithms based on problem transformation, i.e., WOF and LSMOF, exhibit competitive performance in solving continuous problems, their performance in multi-objective scheduling workflows is far inferior to the proposed VCAES. The above comparison results demonstrate that the proposed adaptive mechanism in this paper can effectively improve the performance of evolutionary algorithms in solving multi-objective cloud workflow scheduling problems with large-scale variables.

Fig. 4
figure 4

Population distributions of the five algorithms on solving different workflow scheduling problems

To intuitively compare the convergence and diversity of the five multi-objective workflow scheduling algorithms, Fig. 4 illustrates the distribution of their output populations on workflow instances Inspiral, Montage, Sipht, and Cycles with large-scale tasks.

As illustrated in Fig. 4a, in the context of Inspiral with 100 tasks, the distribution of VCAES’s output population is better than that of three baseline algorithms, i.e., MOELS, EMS-C, and WOF. More specifically, the output solutions of VCAES dominate that of the three algorithms. Although the diversity of solutions obtained by algorithm LSMOF is superior to that of the VCAES when the cost is less than 10, it is far inferior to algorithm VCAES during other range. To sum up, the proposed VCAES is superior to each baseline algorithm in terms of convergence and diversity. In the context of Montage with 100 tasks, the proposed VCAES has similar advantages over the comparison algorithms in terms of convergence and diversity, as shown in Fig. 4b.

As can be observed from Fig. 4c, in the context of Sipht with 100 tasks, the solutions obtained by algorithm VCAES dominate most solutions obtained by comparison algorithms. This means that the VCAES outperforms all the four baseline algorithms in both convergence and diversity. In the context of Cycles with 657 tasks, the proposed VCAES has slight advantages over the comparison algorithms in terms of convergence and diversity, as shown in Fig. 4d.

Performance influence of different mechanisms

The VCAES mainly contains two new mechanisms to dynamically classify the decision variables and adaptively allocate evolution opportunities to each constructed group of decision variables. To measure the respective contributions of the two mechanisms to the overall performance, we construct three variants of the VCAES for comparison. The first variant, denoted as Variant 1, is constructed by replacing the decision variable classification mechanism with a random one and iterating each group of decision variables in a round-robin manner. The second variant, denoted as Variant 2, is constructed by replacing the decision variable classification mechanism with a random one and adaptively allocating evolution opportunities to each group of decision variables. The third variant, denoted as Variant 3, is constructed by removing the evolution opportunity allocation mechanism.

Fig. 5
figure 5

Performance influence of the two proposed components

The main difference between the VCAES and its Variant 3 is that Variant 3 does not have the adaptive evolution opportunity allocation mechanism. Then, the improvement in the hypervolume value of the VCAES relative to Variant 3 can be attributed to the performance contribution of the adaptive evolution opportunity allocation mechanism. Similarly, the improvement in the hypervolume value of the VCAES relative to Variant 3 can be attributed to the performance contribution of the decision variable classification mechanism. The comparison results in Fig. 5 illustrate that the two proposed mechanisms contribute to the overall performance of the VCAES, with the decision variable classification mechanism contributing more. The performance improvement of the VCAES for Variant 1 can be attributed to the performance contribution of mixing the two proposed mechanisms. As shown in Fig. 5, in most workflow instances, the performance contribution of mixing the two mechanisms is better than that of any one. An exception is shown in Fig. 5b, where we can see that the overall performance contribution of mixing the two mechanisms is not as good as the contribution of the decision variable classification mechanism. This is because a mechanism cannot be efficient in any scenario. Instead, it has certain advantages in some scenarios and inevitably has its disadvantages in other scenarios.

Conclusions

This paper focuses on two challenges in multi-objective cloud workflow scheduling: (1) large-scale decision variables; (2) and imbalance feature among variables regarding their contributions to objectives. To deal with these two challenges, this paper suggests a variable-contribution-based adaptive evolutionary cloud workflow scheduling approach that dynamically classifies the variables and adaptively allocates evolution opportunities to each constructed group of variables. Finally, in the context of real-world workflows and cloud platforms, this paper conducts comparison experiments to verify the effectiveness of the proposed adaptive mechanism in enhancing the population to approximate the Pareto-fronts of multi-objective cloud workflow scheduling problems.

Cloud workflow scheduling is a representative grey-box problem, and it is interesting to mine the knowledge on the workflows and cloud resources to derive efficient scheduling algorithms. Another potential direction is to design a parallel evolutionary framework to shorten the time overhead of evolution optimization to support cloud workflow scheduling in real-time and uncertain situations.