1 Introduction

The Cloud has enabled a paradigm shift for researchers by allowing access to a fully scalable, on-demand infrastructure. There are several service options offered by providers, but the most notorious are Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). The SaaS model is a fully managed Solution offered by the provider, Office 365 is one of the main examples, in which the client receives a software where the only concern is the usage of the application. In the PaaS model, the client pays for an environment ready for the deployment of applications managed by the client, for example a Web application, where the cloud provider offers a web server, a database server and a backend. The third model is IaaS, where the product is the ability to rent virtual machines (VMs) with different capabilities and the ability to scale massively both horizontally (number of VMs) and vertically (the capabilities of a single server). The payment model is to pay only for the computation resources (CPU and RAM) and the customer has the option to pay per unit of time. Most providers also charge for network and storage usage on a pay-per-use basis. Examples include Microsoft Azure, Google Compute Engine and Amazon EC2. This model gives the user more control by allowing them to choose a specific operating system (OS), CPU, memory, bandwidth and data storage.

This research will focus on the use of the IaaS model as it allows the researcher to create an infrastructure that can be matched to the requirements of the workflow being executed. Furthermore, it allows the researcher to have a different infrastructure for each specific execution of the different workflows.

Scientific workflows represent a complex scenario with up to thousands of interdependent tasks that need to be mapped to a myriad of possible hosts. Workflows are typically represented using a Direct Acyclic Graph (DAG). Each task takes time to execute depending on the VM selected. The storage unit and network connection between VMs can also have a significant impact on the final execution time. The schedule is conditioned by a number of quality of service (QoS) constraints. Typically, the researcher will want to consider the completion time of the DAG (makespan) or the cost, among other constraints. This mapping of workflow tasks to VMs is known as the Workflow Scheduling Problem and is considered an NP-complete problem (Madni et al. 2017).

In the last decade, we can find in the literature numerous methods that try to solve this problem. This is the case of heuristic methods, such as the Heterogeneous Earliest Finish Time (HEFT) heuristic proposed in Topcuoglu et al. (2002), which provide a good balance between execution time and performance, or MOHEFT, proposed by Durillo and Prodan (2014), which is a multi-objective version. However, for more complex workflows, metaheuristic approximations are commonly used. Different bio-inspired algorithms have been proposed, some of them based on physical processes, this is the case of Yuan et al. (2021), where Simulated Annealing is applied in a multi-objective approach, Biswas et al. (2019) propose the use of a Gravitational Search Algorithm (GSA), or Adhikari and Amgoth (2019) apply an Intelligent Water Drop Algorithm that optimises the makespan and the infrastructure involved. There are also hybrid approaches, for example Elaziz et al. (2019), propose a hybridisation of the Moth Search Algorithm (MSA) and the Differential Evolution (DE), or Ye et al. (2019), where a heuristic and a genetic algorithm are combined to minimise the execution time of the workflow.

A significant number of approaches seek to minimise makespan because it is one of the most important goals for scientists, but we have found that their makespan estimates often do not receive the attention they deserve. Inaccurate makespan values can have a negative impact if the scientist take them as a source of truth. An algorithm may generate a good schedule in real execution, but it is essential that it does not deviate from the predicted makespan; a failure to meet time constraints in a cloud computing scenario can generate unexpected costs if the algorithm is over-optimistic. To reduce this gap, the computational model must avoid some simplifications; one commonly found is the omission of disk communications when only network transfers are considered. This inclusion can have a large impact, making the model more accurate in workflows where communication represents a significant part of the execution time.

Traditionally, the network has been the bottleneck in any intensive data transfer. However, as technology has improved (e.g. self-balancing link aggregation of multiple network adapters or fibre connections), this is no longer a technical limitation but a budgetary issue. This is no the case of public cloud providers such as Google, Amazon and Microsoft which offer high-tech data centres that are constantly upgrading their infrastructure in terms of processors/networking/storage devices due to the competitiveness of the market.

For instance, Fig. 1 illustrates the comparison of network bandwidth and maximum disk speeds related to the Google Cloud C2 machine series, a set of virtual machine models designed for compute-intensive workloads, and the available range of virtual storage devices and services in this Cloud provider (Google 2023). As we can see, in almost all configurations the bottleneck is storage speed and not network bandwidth.

The model considered in this paper is the Disk-Network-Communication (DNC) evaluation model previously introduced in Barredo and Puente (2022), which is designed to take into account storage communication times in workflow tasks.

Fig. 1
figure 1

Read and Write transfer speeds of different options of storage in Google Cloud infrastructure, compared against the network bandwidth in its C2 machine series

In the current paper we will develop an extended study of this model, looking to evaluate its accuracy improvements in both data-intensive and compute-intensive workflow problems, and its contribution to makespan optimisation. We proposed some hybrid metaheuristics combining elements of the HEFT heuristic and a Genetic Algorithm, and contrast their results with those of the previous standard Network-Computing (NC) evaluation model.

The main contributions of the current paper are to:

  • Reformulate of HEFT heuristic and a lamarckian GA algorithm in terms of the new DNC evaluation model.

  • Propose different hybrid algorithms combining the GA and both components and heuristic solutions of the well-known HEFT heuristic, to minimize makespan.

  • Introduction of an accuracy metric to study the similitude of estimated makespan with respect to real execution times, based on simulations using Wrench framework (Casanova et al. 2020).

  • Design of two different cloud computing scenarios by varying infrastructure sizes and specs, and solving instances of seven real scientific workflow problems from WFCommons repository (Coleman et al. 2022).

  • Experimental study covering seven different real scientific workflow applications. All workflow execution instances were generated from real Pegasus WMS (workflow management system) executions, and available in WFCommons repository. The study covers estimated makespans accuracy of DNC vs NC model, and real makespan improvements of hybrid proposals with respect to standard HEFT heuristic and the proposed lamarckian genetic algorithm.

The remainder of the paper is organised as follows. In the next section, we give the formulation of the Scientific Workflow Scheduling model. The proposed solving methods are described in Sect. 3. In Sect. 4, we report the results of the experimental study. Finally, in Sect. 5, we summarise the main conclusions and outline some ideas for future work.

2 The scientific workflow scheduling model

Computational applications in distributed systems are workflows modelled as a set of tasks interconnected by precedence constraints. The execution of a task can be initiated as soon as the necessary data are available, i.e. after all predecessors have been completed. After execution, every task generates an output dataset. This dataset is required by its successor tasks in the workflow before their execution. Most workflow applications can be represented in the form of a direct acyclic graph where the nodes are the tasks, and the arcs are the precedence constraints. This is the case for multiple scientific applications in very different research fields such as bioinformatics (1000genome, Epigenomics, SoyKB, SRA Search), agroecosystem (Cycles), Seismology (Seismic Cross Correlation) or astronomy (Montage) (Juve et al. 2013).

2.1 Definition of the workflow scheduling problem

Scientific workflows are represented as a direct acyclic graph \(G=(T,A)\), where \(T=\{t_1, t_2,..., t_n\}\) is the set of nodes representing the n tasks of the problem, and \(A=\{(t_i,t_j) | 1\le i \le n, 1 \le j \le n, i \ne j\}\) represents the set of arcs or dependency constraints among the tasks. The nodes are labelled with their corresponding task sizes in MFLOPs (Million Floating Point Operations), and the arcs are labelled as data(ij), i.e., the dataset size in MB to transfer from \(t_i\) to \(t_j\). Moreover, \(t_{entry}\) and \(t_{end}\) are fictitious tasks with null computation and communication, representing the entry and exit points of the workflow, respectively.

The resource model consists in a cloud service provider that offers an IaaS platform as a set of mixed types of hosts, or VMs, to its clients. Let M={\(vm_1,\) \(vm_2\),...,\(vm_m\)} be the set of VMs, each one modelled as a tuple \(<pc, nb, ds>\), where pc, nb and ds are the processing capacity in GFLOPS (GFLOPs per second), network bandwidth in MB/s and disk reading/writing speed in MB/s respectively.

Now, given a workflow \(G=(T,A)\) and an IaaS infrastructure as a set of VMs M, the goal of the Workflow Scheduling Problem - defined by the tuple (GM) - is twofold. Firstly, we need to find a feasible solution \(S=(Hosts,Order)\) where Hosts is a mapping from tasks to VMs and Order is a topological order of G. And then, we want this schedule be optimal in the sense that its makespan is minimal.

Specifically, the goal is:

$$\begin{aligned} minimize\;EFT(t_{exit}) \end{aligned}$$
(1)

where \(EFT(t_{exit})\) is the estimated finish time of the task \(t_{exit}\).

The schedule of the tasks and the corresponding estimated value of makespan are defined in accordance with the applied evaluation model: the standard NC model or the new DNC model. In the next subsections both models are formally introduced.

2.2 Schedule evaluation models

In the context of metaheuristic optimization, thousands of schedules are generated and subsequently evaluated. Standard evaluation models tend to simplify the real infrastructure factors involved in computational executions in Cloud or on-premise infrastructures. The main advantage is that their low computational cost allows them to be used as a decoding model in metaheuristic optimization methods. However, the main drawback is the underestimation of the real computational cost in an expensive pay-per-use scenario.

The most common evaluation model for workflow schedules is the one used in Ghorbannia Delavar and Aryan (2014), Durillo and Prodan (2014), Zhu et al. (2016), Chakravarthi et al. (2022), which only considers CPU processing times and data sets network communications between different hosts. This Network-Computation model (NC) ignores all disk accesses from/to tasks in their data sets acquisition/generation. Other common simplification in NC model only considers the latest network communication time from predecessor tasks to establish the starting time estimation of a successor task. For compute-intensive workflows, the NC model may generate quite accurate makespan approximations.

However, in a cloud computing environment (usually a pay-per-use infrastructure environment) not only computing times or networking communications are relevant, but also disk data input/output transference times. This extra time should be considered at least in data-intensive workflows where data transfer operations are not negligible with respect to computing times. In this paper we will focus in a new extended evaluation definition called Disk-Network-Computation model (DNC), proposed in Barredo and Puente (2022), which considers network communications and all disk operations.

To illustrate, we would present in the Fig. 2 a workflow with 3 tasks of 1 TU (time unit) of computation each, said tasks have to transfer 1 DU (data unit) of information. Figure 3 illustrates the execution of the workflow having two hosts: Host A has a disk with a speed of 1 DU per TU and Host B has 0.5 DU per TU. For model a) we only have to consider the network of 1 DU per TU (block C\(_{2,3}\)), so the makespan is 3 TU; for model b) there are some new concepts, all tasks have to write data to disk (W_Disk\(_A\) and W_Disk\(_B\)), but host B takes twice the time. Task T3 on host A cannot start reading until all the files have been written, then it can start reading the first file (\({R\_Disk}_A\)), and because the files are read sequentially, the data from T2 has to wait. Another key difference is in the idea of using the slowest medium, the disk of host B is slower than the bandwidth, so the transfer (C\(_{2,3}\)) takes 2 TU. The final important an often omitted factor is that the current task may need to write its output data to disk, in this case it takes 1 TU, giving a final makespan of 8 TU.

Fig. 2
figure 2

Workflow describing a 3 task with one fictitious entry task to have one entry point and one endpoint

Fig. 3
figure 3

Processing phases of tasks execution in NC and DNC evaluation models for the workflow in Fig. 2. In a NC model, only input data from network transmissions of the last predecessor task to finish are considered; in b DNC, on the other hand, we consider all input data transmissions from both network and local disk, and only after the last predecessor task has written its results to local disk. Finally, the current task is not considered finished until data output is written to local disk

When we apply an evaluation model to a workflow problem solution, we get the schedule and its corresponding a priori or estimated makespan, but it is only after execution of the workflow - which in this work will be done via IaaS simulations - that we get the actual makespan. In Barredo and Puente (2022) a robustness measure was introduced to quantify the difference between a priori makespan estimation and real makespan. In this work we introduce the concept of accuracy of a makespan estimation of an schedule s as its similarity degree with respect to the real makespan, which is defined as:

$$\begin{aligned} accuracy(s) = \frac{makespan_{est}(s)}{makespan_{sim}(s)} \end{aligned}$$
(2)

where \(makespan_{sim}(s)\) is the actual makespan from the simulation and \(makespan_{est}(s)\) is the a priori makespan estimated by the scheduler of the solution s. The surrogate model used by the scheduler generates optimistic makespan estimations, consequently, the accuracy value must be \(\le 1\).

The main contribution of the DNC evaluation model is the improved accuracy of the estimated makespan. An accurate a priori makespan will offer scientists a relevant information to take decisions about the computing infrastructure to rent depending on the budget and the urgency of the results.

2.3 Disk-network-computing vs network-computing processing model

In this section, the DNC model estimations are defined and their differences to the NC model are highlighted. The DNC model differs from standard NC model in communications and processing times, because only DNC model considers disk accesses. The computation of tasks is similar but it is necessary to reformulate the main concepts.

The makespan estimation using DNC model is calculated applying the following definitions:

Definition 1

The computation time of task \(t_i\) on a machine \(vm_k\), denoted \(ct_i^k\), is defined as:

$$\begin{aligned} ct_i^k=size(t_i)/pc_k \end{aligned}$$
(3)

where \(size(t_i)\) is the size of the task \(t_i\) measured in GFLOPs and \(pc_k\) is the processing capacity of the virtual machine \(vm_k\) in GFLOPS.

Definition 2

The data transfer time between tasks \(t_i\) and \(t_j\) mapped on \(vm_k\) and \(vm_l\) respectively is:

$$\begin{aligned} dt_{i,j}^{k,l} = {\left\{ \begin{array}{ll} \frac{data(i,j)}{ds_k} &{}: k = l \\ \\ \frac{data(i,j)}{min(ds_k, nb_k, nb_l)} &{}: k \ne l \\ \end{array}\right. } \end{aligned}$$
(4)

where data(ij) is the output-data size from \(t_i\) to \(t_j\), and \(vm_k\) and \(vm_l\) are the VMs where tasks are scheduled respectively. In this DNC model, when tasks \(t_i\) and \(t_j\) are scheduled in the same VM, data input files should be read from local disk, \(ds_k\) is the disk speed of \(vm_k\), and \(nb_k\) and \(nb_l\) are the network bandwidths of \(vm_k\) and \(vm_l\) respectively. In NC model all data transfers from/to disk are ignored, so in \(k=l\) case of Eq. 4 the communication time is zero.

Definition 3

The estimated finish time of task \(t_i\) executed on machine \(vm_k\) involves not only the processing time but also complete input and output data transfer operations. It is defined as:

$$\begin{aligned} EFT(t_i,vm_k) = EST(t_i, vm_k) +input_{i,k}+ ct_i^{k}+ output_{i,k} \end{aligned}$$
(5)

where \(input_{i,k}\) is the communication time for input data of \(t_i\) on \(vm_k\) from all its predecessors. It is defined as:

$$\begin{aligned} input_{i,k}=\sum _{t_j \in pred(t_i)}dt_{j,i}^{l,k} \end{aligned}$$
(6)

where each predecessor task \(t_j\) is executed on its corresponding machine \(vm_l\). The corresponding \(output_{i,k}\) is the writing time for all output data of \(t_i\) in the \(vm_k\) local disk, that is:

$$\begin{aligned} output_{i,k}=\frac{\sum _{t_j \in succ(t_i)}data(i,j)}{ds_k} \end{aligned}$$
(7)

NC model, in contrast to DNC model, considers only the computation time \(ct_i^{k}\) while all input/output data transference is ignored.

Definition 4

The estimated starting time of task \(t_i\) on \(vm_k\) is defined as:

$$\begin{aligned} EST(t_i,vm_k) = avail\left( i,k,\max _{t_j \in pred(t_i)}EFT(t_j,vm_l)\right) \end{aligned}$$
(8)

where each predecessor task \(t_j\) is executed on its corresponding machine \(vm_l\) and avail(ikm) is the earliest available time slot of \(vm_k\) after m to compute \(t_i\).

DNC model use an insertion policy which assigns the earliest idle time slot between two already-scheduled tasks on the assigned VM. The length of the time slot should be at least capable of cover not only computation but also data transfer times of the considered task. Additionally, scheduling in this idle time slot should preserve precedence constraints.

3 Solving methods

In this work we propose different solving methods. First, a reformulated version of HEFT, one of the most used scheduling heuristics for makespan optimization in scientific workflows. Second, an efficient genetic algorithm based on Barredo and Puente (2022). And finally, we combine both previous methods to generate several hybrid genetic algorithms exploding HEFT’s components to improve makespan quality with no loss of accuracy.

3.1 HEFT heuristic with DNC model

The original HEFT was proposed in Topcuoglu et al. (2002), its main idea is to schedule tasks so that the earliest finish time (EFT) is minimized for all the tasks. Both phases of HEFT with classic NC model (\(HEFT_{NC}\)), and HEFT heuristic using new DNC model (\(HEFT_{DNC}\)) are described as follows:

  • Phase 1: Calculating priority of tasks

    In this phase, the priority of each task is calculated using average execution time and average communication time. The priorities are calculated from bottom to up direction. The sequence of tasks will be generated from highest to lowest priority, satisfying all workflow precedence constraints. In \(HEFT_{NC}\) the priority of task \(t_i\) is given by

    $$\begin{aligned} prioNC(t_i) = \overline{ct_i} + max_{t_j \in succ(t_i)} ( \overline{dt_{i,j}} + prioNC(t_j) ) \end{aligned}$$
    (9)

    where, \(\overline{ct_i}\) is the average execution time of task \(t_i\) and \(\overline{dt_{i,j}}\) is the average communication time between task \(t_i\) and tasks \(t_j\). The main difference in \(HEFT_{DNC}\) is that the priority of a task includes not only average communication with the highest priority successor task, but with all its successors and predecessors, from Eq. 6 and Eq. 7. The priority is defined as follows:

    $$\begin{aligned} prioDNC(t_i)= & {} \overline{input_{i}} + \overline{ct_i} + \overline{output_{i}}\nonumber \\{} & {} + max_{t_j \in succ(t_i)} ( prioDNC(t_j) ) \end{aligned}$$
    (10)

    where \(\overline{input_{i}}\) and \(\overline{output_{i}}\) are the average communication times between task \(t_i\) and all its predecessors and successors respectively.

  • Phase 2: Mapping tasks to VMs

    The actual mapping of tasks to VMs is performed in this phase according to their priority. In \(HEFT_{NC}\) the task with the highest priority is scheduled first, by calculating earliest finish time, considering only the computation time, on all available VMs. In contrast \(HEFT_{DNC}\), using eq. 5, considers not only computation time but also all input/output data transference for a more realistic EFT estimation.

3.2 Genetic algorithm

In this section, we introduce the main components of the genetic algorithm proposed to solve the Workflow Scheduling Problem and study the accuracy of the solutions using both NC and DNC evaluation models. This evolutionary algorithm combines previous algorithms from Barredo and Puente (2022) and Palacios et al. (2015). Algorithm 1 shows a pseudocode of the GA: it is a generational genetic algorithm with random mating selection and replacement by tournament between parents and offspring, which confers the GA an implicit form of elitism. The GA uses one of the two evaluation models: NC or DNC. As a result we will have two different genetic algorithms: NC-GA and DNC-GA respectively. Both algorithms require the following parameters: population size (\(pop_{size}\)), number of generations (\(max_{gens}\)), crossover and mutation probabilities (\(p_c\) and \(p_m\)).

figure a

Coding Schema The coding schema is based on permutations of tasks (Ye et al. 2019; Zhu et al. 2016), each one with a specific VM assignment. So, a gene is a pair (i,k), 1 \(\le i \le |T|\) and 1 \(\le k \le |M|\), and a chromosome includes a gene like this for every task. For example, given an instance with 4 tasks and 2 VMs, a feasible chromosome is the following: \(chr_1\): ((1 2) (4 1) (2 1) (3 2)) which represents the task ordering (\(t_1\), \(t_4\), \(t_2\), \(t_3\)) with VMs assignments (\(vm_2\), \(vm_1\), \(vm_1\), \(vm_2\)) respectively. We only consider task orders that codify a topological order so every task must be located in a gene after its last predecessor and before its first successor in the chromosome. Therefore, the individuals generated in the initial population and by the genetic operators must be consistent with the task dependencies constraints.

Decoding Schema The schedule represented by a chromosome is calculated following the selected evaluation model as a decoder. The genes are processed from left to right in the chromosome sequence. For each gene (i,k), the task \(t_i\) is scheduled at the earliest free gap of \(vm_k\), after the latest finish time of its predecessors in the workflow, where the processing time of task \(t_i\) (computation and communications - depending on the evaluation model) fits. The makespan of the built schedule is the latest finish time of all the workflow tasks. In order to accelerate the convergence to optimal solutions, the lamarckian learning (Houck et al. 1996) is considered as the last stage of decodification and evaluation phase. As a result, the gene order of the chromosome is recoded according to the resulting topological order of the generated schedule.

Crossover The mating operator must establish the order and VM assignment of the tasks at the generated offspring. A feasible schedule permutation must follow all the dependencies which exist among tasks. For chromosome mating, we follow Zhu et al. (2016) and the so called CrossoverOrder algorithm. First the operator randomly chooses a crossover position, which splits each parent sequence into two subsequences. After that, the two first substrings are taken to be the initial sequence of the offspring and then filling the remaining positions with the genes representing the remaining tasks taken from the other parent, while keeping their relative order. The resulting task orders will not cause any dependency conflict since the order of any two tasks should have already be present in at least one parent.

Mutation The mutation operator cannot break the task order dependencies. First, we select a random task \(T_i\). Next, we identify all predecessors and successors of \(T_i\). Then the operator locates the longest subsequence of genes holding \(T_i\) that doesn’t include any predecessor or successor of \(T_i\). Finally, \(T_i\) is moved to a randomly chosen location inside this subsequence. Consequently the assigned VM is mutated to a random index in the set of VMs.

Initial Population The \(pop_{size}\) initial individuals of the population are generated at random but following a topological order of the tasks, and with valid VM index assignments for all tasks.

We start with an empty chromosome, and then we identify as candidate tasks those that have \(t_{start}\) as their only predecessor in the workflow. At every step we extract at random a task T from candidate tasks, this task will be appended at the end of the partial chromosome. Then, we update candidate tasks by adding all the successors of T which have all their predecessors in the chromosome. The process is repeated until the set of candidate tasks gets empty. The resulting chromosome tasks sequence follows a topological order.

The initial assignment of virtual machine k is selected at random for every task \(T_i\) in range \(1 \le k \le |M|\).

3.3 Hybrid genetic algorithms

Although \(HEFT_{DNC}\) heuristic and previous genetic algorithm methods bring reasonable quality solutions to the workflow scheduling problem, hybridisation can be applied to improve the quality of their results. The idea is to inject the knowledge of the different two phases of \(HEFT_{DNC}\) in the DNC genetic algorithm (\(GA_{DNC}\)). As a result, two new different hybrid genetic algorithms are available:

  • \(HGA_{Ph1}\): in this algorithm the \(HEFT_{DNC}\) tasks ranking is used in all schedule generations. Therefore, the task order information is omitted in chromosomes, and in the decodification operation the tasks order is previously fixed and only information about VMs assignment comes from the chromosome.

  • \(HGA_{Ph2}\): this version gets the task order from the chromosome and assigns the tasks to the VM with the lowest earliest completion time (EFT in eq. 5) of all available VMs. Since the VM assignment information in the chromosome is unnecessary, it is omitted.

A priori one of the advantages of the proposed hybrid algorithms is the reduction of the solution space, so that the same number of evaluated individuals represents a larger explored subspace of solutions. On the other hand, the risk of the injected heuristic information is the convergence bias towards local minima.

Additionally, the heuristic individual generated by \(HEFT_{DNC}\) is considered in the initial population of the proposed genetic and hybrid algorithms: \(GA^{H}_{DNC}\), \(HGA^{H}_{Ph1}\) and \(HGA^{H}_{Ph2}\). They are all elitist algorithms, so we only need to add one copy to the initial population to guarantee that the quality of the final solution is as high as the \(HEFT_{DNC}\) version.

4 Experimental study

This section evaluates the accuracy and quality of solutions from the different solving methods presented in Sect. 3. First, the workflow problems used, the evaluation metrics and the simulation platform are described. Then, the experimental study and its results, using a cloud simulator built on top of the Wrench library (Casanova et al. 2020), are analysed.

4.1 Workflows instances

The scientific workflows instances considered in this study correspond to seven different problems from the WFCommons repository (Coleman et al. 2022). All of them are workflow execution instances generated using Pegasus workflow management system (Deelman et al. 2019). The different problems present diverse characteristics but can be classified in two main types: compute-intensive and data-intensive workflows. In the data-intensive there is another sub-type that can be described as “collision prone” in where the tasks have a huge number of dependencies and can produce exhaustion in the drives and network interfaces. An example of a workflow of the former sub-type can be found in the Fig. 4.

Fig. 4
figure 4

Diagram of Montage Scientific Workflow (source: wfcommons.org)

All used workflow problems are now presented accompanied by a brief description:

  • 1000Genome This workflow uses data from the 1000 Genome projects and finds mutational overlaps in order to provide a data for the evaluation of health conditions caused by mutations.

  • Cycles It is a Agroecosystem model used to simulate the perturbations of biogeochemical process caused by different agricultural practices.

  • Epigenomics This workflow is related to genome sequencing operations.

  • Montage It consists in the reprojection and background correction to compose a mosaic using Flexible Image Transport System (FITS) telescope images.

  • Seismology The workflows represent the process of taking multiple seismic stations and cross-correlating the measurements of acceleration.

  • SoyKB Is the genomics pipe of re-sequencing the soybean germplasm to change its traits.

  • SRASearch This workflow is the process of searching the Sequence Read Archive(SRA) and transforming the data to have aligned sequencing reads.

To evaluate the impact of workflow dimensions of each proposed problem we have selected four instances for each problem, having four sizes: extra-small (50–100 task), small (100–200), medium (200–500), and large (5000–1000) if they are available in the repository.

Table 1 contains all instances used in the experimentation followed by the number of tasks and the communication to computation ratio (CCR) as a metric to know the percentage of the total execution spent in communications with respect to computation time (Xu et al. 2014) here adapted to DNC model:

$$\begin{aligned} CCR = \sum _{i \in T}{\frac{\left( \overline{input_{i}} + \overline{output_{i}}\right) }{\overline{ct_i}}} \end{aligned}$$
(11)

where \(\overline{input_{i}}\) and \(\overline{output_{i}}\) are the average communication times between task \(t_i\) and all its predecessors and successors respectively, and \(\overline{ct_i}\) is the average execution time of task \(t_i\).

Table 1 Analysis of ccr for every instance grouped by problem

4.2 Benchmark platform and experimental configuration

Wrench is a simulator library that can calculate the real makespan of a given workflow on a specific infrastructure. A simulator can provide us a close estimation of metrics like makespan, energy or cost, without the need of buy/hire real hardware. The selected software is a C++ library that can be used to build a custom simulator.

We have simulated a processes/communications system inspired by HTCondor (Thain et al. 2005), the High-Throughput Computing environment under the well-known Pegasus WMS, where each computing host has direct access to a local disk, and remote disks from the rest of hosts are accessible by their network interface connections - such as the NFS service in Linux. Each host has a network interface connected to a virtual switch interconnecting all hosts. Tasks read files from local or remote disk, but always store the output files in the local disk.

The hardware available in a scientific cluster can be quite diverse, whether it is an on-premise cluster or one available in the public cloud, and therefore the proposed experimentation uses different clusters with diverse configurations. For this reason, we have considered both homogeneous and heterogeneous infrastructure configurations.

To study the impact of the new DNC evaluation model we have designed two scenarios: 1) ScFast, where there are hosts with 441 GFLOPS CPU, a disk of 115 MB/s and a network of 125 MB/s. 2) ScMixed, where half of the hosts incorporate a 200 MB/s disk and the other half a 20 MB/s disk, CPUs and network specs are the same of ScFast. For each scenario we have different number of servers going from 1 to 16 in powers of 2. The simulation was run on a Linux computer with the following specs: Intel Core i7-10700k@3.8Ghz, 32GB RAM 3200 MHz, 1TB SSD 2400MB/s.

We made a C++ application based on the GA implemented in Barredo and Puente (2022) that contains several methods for calculating the makespan of a given workflow in a user-defined host infrastructure.

In all the experiments the genetic and hybrid algorithms are configured exactly the same with an initial population of 100 individuals and 1000 generations. The crossover has a probability of 1.0 and the mutation is 0.1.

Each experiment is run 10 times per workflow instance and number of hosts for both scenarios. Each run is validated with the simulator having the real makespan of each individual run.

4.3 Accuracy study

In this section the accuracy of both evaluation models is studied. Table 2 show the behaviour of NC and DNC model using heuristic (HEFT) and metaheuristic (GA) optimization algorithms in both scenarios (ScFast and ScMixed). Problems are presented in ascending ordered of average communication computation rate (\(\overline{ccr}\)) of its instances. Makespan Accuracy values presented are the worst-case obtained for solutions on all different size instances of the problem over the diverse set of hosts considered.

In NC Model problems identified as compute-intensive in Coleman et al. (2022) (Cycles and Montage) should, a priori, get higher accuracy than the data-intensive ones (1000genome, Seismology, Epigenomics, SRASearch and SoyKB). However, actual results show that when \(\overline{ccr}\) ratio increases, meaning a higher communication load on the problem, accuracy decreases notably. In ScFast scenery, accuracy drops below \(82\%\) (in HEFT) and \(81\%\) (in GA) for 1000genome, the problem with worst results of the benchmark. This accuracy reduction is even more dramatic in ScMixed scenario, where accuracy levels drop below \(48\%\) (in HEFT) and \(43\%\) (in GA), respectively, again in 1000genome problem.

On the other hand, DNC model solves with remarkably high accuracy rates (\(\ge 97\%\)) in all scenarios and for almost all problems and independently of their \(\overline{ccr}\). The only one exception is Montage problem where accuracy drops below of \(95\%/91\%\) (HEFT) and \(96\%/93\%\) (GA) in ScFast/ScMixed scenarios, respectively. The nature of high communications concurrency of Montage could justify these results because neither of the two models are designed to manage this issue.

These results reinforce the decision of adopting DNC as the evaluation model for scheduling scientific workflows graphs in an IaaS independently of the hardware configuration.

Table 2 Accuracy of the NC and DNC models. DNC gets superior accuracy in all configurations, as we can observe

Now, using only DNC Model, Tables 3 and 4 summarise the worst accuracy levels of the different proposed genetic and hybrid algorithms evolving from random and heuristic initial population. The accuracy levels do not vary with respect to the pattern shown in the simple heuristic and genetic algorithms. All values are above \(97\%\), again with the sole exception of Montage, the benchmark’s high concurrency problem, where accuracy drops below \(91\%\) and \(95\%\) in the worst cases of the ScMixed and ScFast scenarios respectively. In conclusion, the different algorithms presented seem to solve all instances in each scenario studied with high accuracy when using the DNC model, as opposed to the NC model, which gives significantly worse results in heterogeneous infrastructures.

Table 3 Makespan accuracy, worst value found for each problem is shown (in bold best value for each problem), using DNC model, of genetic and hybrid algorithms with random and heuristic initial population in ScFast infrastructure
Table 4 Makespan accuracy, worst value found for each problem is shown (in bold best value for each problem), using DNC model, of genetic and hybrid algorithms with random and heuristic initial population in ScMixed infrastructure

4.4 Makespan optimization study

In this section we study the efficiency of the different proposed algorithms using the new evaluation model to optimise the workflow completion time. Instead of directly comparing makespan, which vary widely with the dimensions of each workflow, we will use makespan percentage error (MPE) as a performance metric:

$$\begin{aligned} MPE = \frac{ makespan_{HEFT_{NC}} - makespan }{makespan_{HEFT_{NC}}} \end{aligned}$$
(12)

makespan refers to the solution completion time to be evaluated, and the denominator is the makespan of the solution obtained from the HEFT heuristic using the previous NC model as a quality baseline, which is the computational most inexpensive, but not necessarily least accurate, of the studied methods in this work. Positive values of MPE mean better quality solutions, and negative values mean worse results, both with respect to HEFT. Tables 5, 6, 7, 8 and 9 show average MPE results for all instances of each problem and infrastructure scenario.

Table 5 Summary of results from HEFT and GA methods using both evaluation models in ScFast scenario

Firstly, we study the behaviour of the plain GA using NC model and corresponding heuristic and genetic methods in DNC model. The average MPE results of each problem in ScFast scenario are summarized in Table 5. \(GA_{NC}\) is more efficient than \(HEFT_{NC}\) in Montage, Seismology and Epigenomics, except in the configuration with maximum number of hosts in the last two problems, but it is worse on the remaining four problems, mainly in configurations with high number of hosts. Using the DNC model, globally \(HEFT_{DNC}\) has a marginal improvement in its results in all problems, the exception being Montage using the highest number of hosts where it obtains a substantial improvement. \(GA_{DNC}\) performs better on 6/7 problems, but fails on the configurations with the maximum number of hosts.

In ScMixed scenario, Table 6, \(GA_{NC}\) again improves MPE on the same 3 problems. However, \(HEFT_{DNC}\) and \(GA_{DNC}\) obtain substantial improvements in almost all, 7/6 problems respectively, and mainly in the most complex host configurations (16 hosts).

Table 6 shows that, in general, the algorithms using the DNC model not only improve on the previous model, but even in the more complex ScMixed scenario with a higher number of hosts, these differences are maintained (1000Genome, Cycles and Seismology) or even increased (Epigenomics, Montage and SRSSearch). However, in the instances of soyKB problem the differences between the two models decrease as the number of hosts increases. This is due to the combination of a graph topology with a significant number of high in-degree nodes (i.e. tasks with a high number of predecessors’ data dependencies) and a huge volume of data to be transferred. This combination generates high concurrency with a fast and long saturation of communication channels, both disk and network. In these cases, the new model has less room for improvement. Due to the structural nature of these differences, these conclusions also apply to the results obtained with the hybrid versions of the algorithms, as can be seen from Table 9.

Table 6 Summary of results from HEFT and GA methods using both evaluation models in SCMixed scenario
Table 7 Summary of the statistical study comparing the makespan quality of the heuristic and genetic proposed methods using both evaluation models
Table 8 Summary of results from Hybridization methods and inclusion of HEFT solution in the initial population using both evaluation models in SCFast scenario
Table 9 Summary of results from Hybridization methods and inclusion of HEFT solution in the initial population using both evaluation models in SCMixed scenario
Table 10 Summary of the statistical study comparing the makespan quality of the DNC versions of heuristic, genetic and hybrid proposed methods using random and heuristic initial population

A statistical analysis using non-parametric Friedman test for paired samples (using standard 0.05 significance level) signals significant differences among the studied methods (p-value \(< 2.2e-16\)). A Bonferroni post-hoc analysis reveals both genetic and heuristic DNC versions are statistical significantly better than NC versions, Table 7 resumes statistical results.

Regarding the hybrid proposals, now using only the DNC model, Table 8 and Table 9 show MPE results for SCFast and SCMixed scenarios respectively. Globally, while \(HGA_{Ph1}\) is unable to achieve better results than the genetic and heuristic methods, \(HGA_{Ph2}\) outperforms both previous methods in all problems. The only exception is the computing intensive problem Cycles, where the hybrid method is still unable to improve the heuristic HEFT.

Finally, in order to exploit the synergy between the genetic and heuristic approaches, we have introduced the HEFT solution in the initial population of the proposed genetic (\(GA^H\)) and hybrid algorithms (\(HGA_{Ph1}^H\) and \(HGA_{Ph2}^H\)). As a result, \(HGA_{Ph2}^H\) is the method with the best solutions in all problems and scenarios. It is only outperformed by the genetic algorithm with heuristic population (\(GA^H\)) on the SRASearch and Montage problem on SCFast scenario running on two hosts configuration. A new Friedman test shows that there are significant differences among the different methods using the DNC model (p-value \(< 2.2e-16\)). Table 10 summarises the results of the Bonferroni post-hoc tests, which show the significant superiority of hybridising the genetic algorithm with the scheduling phase of HEFT and using heuristic initial population (\(HGA_{Ph2}^{H}\)).

5 Conclusions and future work

In this paper, we have presented the challenges of calculating the makespan of a scientific workflow and how simplifications for the sake of computational speed can affect the precision of the work. In the pay-per-use model of cloud computing, an inaccurate estimate of makespan will lead to a misleading decision about the necessary cost and time of the hired infrastructure. We have analysed the benefits of using a more appropriated computational model (DNC) to improve the accuracy of the makespan estimates generated by the optimisation algorithms. We have introduced several improvements, (1) in this context lamarckian evolution had improved the makespan quality of the hybrid evolutionary algorithms solutions. (2) Different hybridization algorithms of a well-known heuristic as HEFT with the proposed genetic algorithm had been developed to improve the makespan, with \(HGA^H_{Ph2}\) being the one with the best solutions. (3) An accuracy measure had been introduced and applied to study its correlation with workflow problem typology, with respect to computation and communication ratio (CCR), in different cloud IaaS scenarios and using current real-world scientific workflow problems. We have observed how cloud infrastructure simulators include additional features, such as concurrency and saturation levels of disk and network transfer channels, which depends on the underlying hardware being simulated, but its complexity and computational cost prevent its effective use in scheduler evaluation models. For this reason, we want to use these simulator features to extend the DNC model and improve the performance of our solutions on mixed-feature workflows, such as those in the Montage problem.

In testing different hybridisation methods, we found that all the proposed algorithms have some kind of workflow problems where they work best. For this reason, in future work we want to explore the development of new population-based evolutionary algorithms, applying different heuristic decoding methods to each solution, and letting its best estimate guide the evolution in a multi-decoding approximation. In addition, we aim to experiment in a multi-objective context to study the impact of this new evaluation model, as well as to design new memetic proposals (Tang and Pan 2015; Zuo et al. 2017; Mencía et al. 2022) which combine intensive search with an appropriate exploration-exploitation balance (Guo et al. 2020; Lou et al. 2021; Osuna-Enciso et al. 2022) to minimise other valuable objectives in parallel, such as cost or energy consumption.