Introduction

Cloud computing evolved from the concept of utility computing, which is defined as the provision of computational and storage resources as a metered service, similar to traditional public utility companies [92]. This concept reflects the fact that modern information technology environments require the means to dynamically increase capacity or add capabilities while minimizing the requirement of investing money and time in the purchase of new infrastructure.

Another key characteristic of cloud computing is multitenancy, which enables resource and cost sharing among a large pool of users [91]. This leads to the centralization of the infrastructure and consequent reduction of costs due to economies of scale [123]. Moreover, the consolidation of resources leads to an increased peak-load capacity as each customer has access to a much larger pool of resources (although shared) compared to a local cluster of machines. Resources are more efficiently used, especially considering that in a local setup they often are underutilized [45]. In addition, multitenancy enables dynamic allocation of these resources which are monitored by the service provider.

Characteristics such as multitenancy and elasticity perfectly fit the requirements of modern data intensive research and scientific endeavors [28]. These requirements are associated to the continuously increasing power of computing and storage resources that in many cases are required on-demand for specific phases of an experiment, therefore demanding elastic scaling. This motivates the utilization of clouds by scientific researchers as an alternative to using in-house resources [22].

In parallel, as science becomes more complex and relies on the analysis of very large data sets, data management and processing must be performed in a scalable and automated way. Workflows have emerged as a way to formalize and structure data analysis, execute the required computations using distributed resources, collect information about the derived data products, and repeat the analysis if necessary [115]. Workflows enable the definition and sharing of analysis and results within scientific collaborations. In this sense, scientific workflows have become an increasingly popular paradigm for scientists to handle complex scientific processes [150], enabling and accelerating scientific progress and discoveries.

Scientific workflows, like other computer applications, can benefit from virtually unlimited resources with minimal investment. With such advantages, workflow scheduling research has thus shifted to workflow execution in the cloud [111], providing a paradigm-shifting utility-oriented computing environment with unprecedented size of data center resource pools and on-demand resource provisioning [150], enabling scientific workflow solutions to address petascale problems.

One of the key enablers of this conjunction of cloud computing and scientific workflows is resource management [6], which includes resource provisioning, allocation, and scheduling [72]. Even small provisioning inefficiencies, such as failure to meet workflow dependencies on time or selecting the wrong resources for a task, can result in significant monetary costs [22, 135]. Provisioning the right amount of storage and compute resources leads to decisive cost reduction with no substantial impact on application performance.

Consequently, cloud resource management for workflow execution is a topic of broad and current interest [127]. Moreover, there are few researches on scheduling workflows on real cloud environments, and much fewer cloud workflow management systems, which require even further academic study and industrial practice [127]. Workflow scheduling for commercial multicloud environments, for instance, still is an open issue to be addressed [32]. In addition, data transfer between tasks is not directly considered in most existing studies, thus being assumed as part of task execution. However, this is not the case for data-intensive applications [127], especially ones from the big data era, wherein data movement can dominate both the execution time and cost.

Objectives and contributions

This paper surveys over 110 publications on cloud resource management solutions including resource provisioning and task scheduling. The publications were selected from conferences and journals using a systematic search methodology. Our contributions include the definition of a taxonomy used to classify and analyze the publications. The taxonomy was created based on the typical aspects covered by cloud resource management solutions, such as makespan and cost, as well as on aspects pointed by existing works as future challenges for the area, such as reliability and data-intensive loads. Our analysis shows that little to no work is found for specific areas, such as security and dynamic allocation of resources, especially when combined to other aspects such as complex infrastructures and workflow execution. Finally, applying the proposed taxonomy to the publications selected we provide a quantitative assessment of existing solutions, highlighting the future challenges for the execution of large-scale applications on cloud infrastructures.

Document organization

This paper is organized in five main sections. First section, Concepts and definitions, presents the concepts related to cloud resource management, including several definitions and their consolidation. Second section, Resource management taxonomy, presents the taxonomy created to analyze the references selected for the survey. Third section, Survey, presents the results of the survey, including further analysis of specific works to identify gaps and challenges in the field Fourth section, Gaps and challenges, presents the gaps and challenges to be addressed by future research. Fifth and last section, Conclusion, presents the conclusion of this work focusing on the four main problems to be solved in cloud computing resource management.

Concepts and definitions

Cloud computing is a model for enabling on-demand self-service network access to a shared pool of elastic configurable computing resources [76]. The model is driven by economies of scale to reduce costs for users [36] and to allow offering resources in a pay-as-you-go manner, thus embodying the concept of utility computing [7, 8].

In its inception, cloud computing revolved around virtualization as main resource compartmentalization or consolidation strategy [63, 85] to support application isolation and platform customization to suit user needs [17, 18], as well as to enable pooling and dynamically assigning computing resources from clusters of servers [147]. The significant performance improvement and overhead reduction of virtualization technology [81] propelled its adoption as key delivery technology in cloud computing [24]. Nevertheless, developments on Linux Containers and associated technologies [34, 77] led to the implementation of cloud platforms using lightweight containers [44] such as Docker [66, 110] with smaller overhead compared to virtual machines as containers only replicate the libraries and binaries of the virtualized application [53].

Resource management in a cloud environment is a challenging problem due to the scale of modern data centers, the heterogeneity of resource types, the interdependency between these resources, the variability and unpredictability of the load, and the range of objectives of different actors in a cloud ecosystem [52]. Moreover, resource management comprises different stages or resources and workloads. Due to its importance as fundamental building block for cloud computing, several definitions and concepts are found in the literature. The next subsections explore these definitions and provide a consolidated view of cloud resource management.

Singh and Chana

For [108] resource management in cloud comprises three functions: resource provisioning, resource scheduling, and resource monitoring.

Resource provisioning is defined by the authors as the stage to identify the adequate resources for a particular workload based on quality of service (QoS) requirements defined by cloud consumers. This stage includes the discovery of resources and also their selection for executing a workload. The provisioning of appropriate resources to cloud workloads depends on the QoS requirements of cloud applications [21]. In this sense, the cloud consumer interacts with the cloud via a cloud portal and submits the QoS requirements of the workload after authentication. The Resource Information Centre (RIC) contains the information about all the resources in the resource pool and obtains the result based on requirement of workload as specified by user. The user requirements and the information provided by the RIC are used by the Resource Provisioning Agent (RPA) to check the available resources. After provisioning of resources the workloads are submitted to the resource scheduler. Finally, the Workload Resource Manager (WRM) sends the provisioning results (resource information) to the RPA, which forwards these results to the cloud user.

Resource scheduling is defined as the mapping, allocation, and execution of workloads based on the resources selected in the resource provisioning phase [109]. Mapping workloads refers to selecting the appropriate resources based on the QoS requirements as specified by user in terms of SLA to minimize cost and execution time, for instance. The process of finding the list of available resources is referred to as resource detection, while the resource selection is the process of choosing the best resource from list generated by resource detection based on SLA.

Resource monitoring is a complementary phase to achieve better performance optimization. In terms of service level agreements (SLA) both parties (provider and consumer) must specify the possible deviations to achieve appropriate quality attributes. For successful execution of a workload the observed deviation must be less than the defined thresholds. In this sense, resource monitoring is used to take care of important QoS requirements like security, availability, and performance. The monitoring steps include checking the workload status and verifying if the amount of required resources (RR) is larger than the amount of provided resources (PR). Depending on the result more resources are demanded by the scheduler. On the other hand, based on this result the resources can also be released, freeing them for other allocations. Consequently, the monitoring phase also controls the rescheduling activities.

Jennings and Stadler

For [52] resource management is the process of allocating computing, storage, networking and energy resources to a set of applications in order to meet performance objectives and requirements of the infrastructure providers and the cloud users. On one hand, the objectives of the providers are related to efficient and effective resource utilization within the constraints of SLAs. The authors claim that efficient resource use is typically achieved through virtualization technologies, facilitating the multiplexing of resources across customers. On the other hand, the objectives of cloud users tend to focus on application performance, their availability, as well as the cost-effective scaling of available resources based on application demands.

The cloud provider is responsible for monitoring the utilization of compute, networking, storage, and power resources, as well as for controlling this utilization via global and local scheduling processes. In parallel, the cloud user monitors and controls the deployment of its applications on the virtual infrastructure. Cloud providers can dynamically alter the prices charged for leasing the infrastructure while cloud users can alter the costs by changing application parameters and usage levels. However, the cloud user has limited responsibility for resource management, being constrained to generating workload requests and controlling where and when workloads are placed.

The authors distinguish the roles of cloud user from end user. The end user generates the workloads that are processed using cloud resources. The cloud user actively interacts with the cloud infrastructure to host applications for end users. In this sense, the cloud user acts a broker, thus being responsible for meeting the SLAs specified by the end user. Moreover, the cloud user is mostly interested in meeting these requirements in a manner to minimize its own costs of leasing the cloud infrastructure (from the cloud provider) while maximizing its profits.

From a functional perspective, the end user initiates the process by providing one or more workload requests to the workload scheduling component. The requests are relayed to the workload management component provided by the cloud user (broker). The application is submitted to a profiling process that dynamically defines the pricing characteristics, also defining the metrics to be monitored during execution and the objectives (SLAs) to be observed. The cloud user defines the provisioning to be obtained from the cloud provider. The provider receives the requests via a global provisioning and scheduling component that also profiles the requests in order to determine the pricing attributes (this time from cloud provider to cloud user). Moreover, the application is characterized in order to obtain monitoring metrics and objectives from the cloud provider point of view. Finally, the global provisioning and scheduling element submits requests for the local handler, estimating the resource utilization and executing the workloads.

Manvi and Shyam

For [72] resource management comprises nine components:

  • Provisioning: Assignment of resources to a workload.

  • Allocation: Distribution of resources among competing workloads.

  • Adaptation: Ability to dynamically adjust resources to fulfill workload requirements.

  • Mapping: Correspondence between resources required by the workload and resources provided by the cloud infrastructure.

  • Modeling: Framework that helps to predict the resource requirements of a workload by representing the most important attributes of resource management, such as states, transitions, inputs, and outputs within a given environment.

  • Estimation: Guess of the actual resources required for executing a workload.

  • Discovery: Identification of a list of resources that are available for workload execution.

  • Brokering: Negotiation of resources through an agent to ensure their availability at the right time to execute the workload.

  • Scheduling: A timetable of events and resources, determining when a workload should start or end depending on its duration, predecessor activities, predecessor relationships, and resources allocated.

The authors did not explicitly defined the roles or actors related to cloud management activities. The implicit roles in this sense are the cloud provider (responsible for managing the cloud infrastructure) and the cloud user (interested in executing one or more workloads on the cloud infrastructure). QoS is regarded as fundamental part of the resource management premises. In contrast, the SLAs are not explicitly defined as building block for resource management tasks.

Other definitions

For [80], resource management is a process that deals with the procurement and release of resources. Moreover, resource management provides performance isolation and efficient use of underlying hardware. The authors state that the main research challenges and metrics of resource management are energy efficiency, SLA violations, load balancing, network load, profit maximization, hybrid clouds, and mobile cloud computing. No specific remark to cloud roles or to quality of service are made, although the solutions covered by the survey might present QoS related aspects.

For [75], resource management is a core function of cloud computing that affects three aspects: performance, functionality, and cost. In this sense, cloud resource management requires complex policies and decisions for multi-objective optimization. These policies are organized in five classes: admission control, capacity allocation, load balancing, energy optimization, and quality of service guarantees. The admission policies prevent the system from accepting workloads in violation of high-level system policies (e.g., a workload that might prevent others from completing). Capacity allocation comprises the allocation of resources for individual instances. Load balancing and energy optimization can be done either locally or globally, and both are correlated to cost. Finally, quality of service is related to addressing requirements and objectives concerning users and providers. SLA aspects are not explicitly considered in this set of policies.

For [125], resource management is related to predicting the amount of resources that best suits each workload, enabling cloud providers to consolidate workloads while maintaining SLAs.

For [69], resource management comprises two main activities: matching, which is the process of assigning a job to a particular resource; and scheduling, which is the process of determining the order in which jobs assigned to a particular resource should be executed.

Intercorrelation and consolidation

Table 1 presents a summary of the resource management definitions. The table presents the works analyzed for the study of definitions of resource management, a summary of the viewpoints from each work, which are the actors identified in each work, and whether aspects related to Quality of Service and Service Level Agreements are mentioned and considered in the works or not. The importance of identifying these aspects is to analyze the similarities and disparities among the works to allow a better understanding of the definitions.

Table 1 Summary of resource management definitions, actors, and QoS/SLA aspects considered in each definition

Some works treat resource management and resource scheduling as the same concept. For instance, [127] present a survey focusing on resource scheduling that also comprises several of the components proposed by [72], such as provisioning, allocation, and modeling.

Three definitions were selected due to their clear definition of steps and components of resource management. Table 2 provides a summary of the phases or steps proposed by each definition.

Table 2 Explicit phases or steps proposed in each definition

While the definition from [72] proposes more steps than the others, there is a natural correlation between the phases proposed by each definition. Table 3 presents the correlation between the phases from [108] and the other two. The objective of this table is to fit the steps proposed by [52] and by [72] into the steps from [108], which represents a simpler classification of resource management tasks.

Table 3 Correlation between steps defined in [52] and [72] compared to [108]

Comparing [52] to [108], the workload profiling (to assess the resource demands), pricing, and provisioning steps defined by [52] fit the provisioning step from [108], which is essentially the phase to identify the resources for a particular workload based on its characteristics and on the QoS. This includes the selection of resources to execute the workload. These aspects fit the steps of discovery, modeling, brokering, and provisioning from [72]. Note that the brokering aspect is also implicitly included in the definition from Jennings and Stadler, as they define a specific role for the brokering activity (the cloud user; the end user is the actor that has a workload to be executed in the cloud).

The scheduling phase from [108] are organized in estimation and scheduling by [52]. Manvi and Shyam [72] include an allocation step to these two. In summary, these steps represent the mapping, allocation, and execution of the workload based on the resources selected in the provisioning phase.

Finally, the monitoring phase is present in [108] and in [52]. For [72] the monitoring tasks are implicitly included by the adaptation step, which is related to dynamically adjusting resources to fulfill workload requirements. Because it is necessary to monitor both resource availability and workload conditions in order to provide this feature, this means that this step directly relies on some form of monitoring.

In terms of consolidation, the common point of all definitions is the aspect of managing the life cycle of resources and their association to the execution of tasks. This is the central governing point of cloud resource management which is independent of a specific phase of this life cycle. While it is fundamental to distinguish each phase, they all contribute to two ultimate purposes:

  • Enable task execution; and

  • Optimize infrastructural efficiency based on a set of specified objectives.

These are the key points of interest of this work, therefore comprising not only the specific task of scheduling resources (i.e., associating them to a task), but also managing the resource from its initial preparation (e.g., discovery) to its utilization and distribution.

Resource management taxonomy

Because of its relevance, cloud computing resource management is a topic that not only has a lot of work and research, but also existing surveys and taxonomies. This section presents an analysis of existing taxonomies used to classify the resource management solutions. Finally, we present the taxonomy proposed for classifying the works analyzed in this survey.

Relevant work

Bala and Chana [9] definee nine categories to classify resource management and scheduling solutions: time, cost, scalability, scheduling success rate, makespan, speed, resource utilization, reliability, and availability. Among these categories, time, speed, and makespan are directly correlated. Resource utilization is related to the efficiency of utilization of resources, which is a fundamental aspect of any algorithm. Reliability and availability aspects, although defined as categories, were not identified in any of the solutions analyzed by the authors.

Sotiriadis et al. [112] classify the solutions in terms of flexibility, scalability, interoperability, heterogeneity, local autonomy, load balancing, information exposing, real-time data, scheduling history records, unpredictability management, geographical distribution, SLA compatibility, rescheduling, and intercloud compatibility. Several properties are relevant for heterogeneous environments, such as local autonomy and geographical distribution. Others are correlated, such as scalability, unpredictability management, and rescheduling.

Wu et al. [127] use nine categories to classify their references:

  • Best-effort: Optimize one objective while ignoring other factors such as QoS requirements.

  • Deadline-constrained: Scheduling based on the trade-off between execution time and monetary cost under a deadline constraint.

  • Budget-constrained: The objective is to finish a workflow as fast as possible at given budget.

  • Multi-criteria: Several objectives are taken into account.

  • Workflow-as-a-service: Multiple workflow instances submitted to the resource manager.

  • Robust scheduling: Able to absorb uncertainties such as performance fluctuation and failure.

  • Hybrid environment: Able to address requirements of hybrid clouds.

  • Data-intensive: Data-aware workflow scheduling.

  • Energy-aware: Able to save energy while optimizing execution.

The authors also mention other properties such as makespan (which fits the Best-Effort category). Moreover, the multi-criteria category represents the convergence of several objective functions, such as cost and performance. Workflow-as-a-Service (WaaS) is the scheduling of multiple workflows onto a cloud infrastructure. Robust scheduling refers both to reliability and to performance fluctuations, both factors that can affect the performance and consequently the effectiveness of a schedule. Finally, hybrid environments, data-intensive workflows, and energy-aware scheduling represent the novel challenges in terms of cloud scheduling resource management according to the authors.

Singh and Chana [108] define a taxonomy based on twelve properties:

  • Cost-based: Organized in multi-QoS, virtualization-based, application-based, and scalability-based.

  • Time-based: Organized in deadline-based and combination of deadline and budget.

  • Compromised Cost-Time: Based either on workflows or workloads.

  • Bargaining-based: Organized in market-oriented, auction, and negotiation.

  • QoS-based: Based on several QoS aspects, including security and resource utilization.

  • SLA-based: Based on several SLA types, including workload and autonomic aspects.

  • Energy-based: Combined with deadlines and SLAs.

  • Optimization-based: Optimization of several combinations of parameters.

  • Nature Inspired and Bio-Inspired: Including genetic algorithms and ant colony approaches.

  • Dynamic: Several combinations of aspects with dynamic management.

  • Rule-based: Special cases for failures and hybrid clouds.

  • Adaptive-based: Prediction-based and Bin-Packing strategies.

Several of the categories have direct correlations, and some are used to combine the aspects covered in other categories, such as optimization-based and the dynamic category.

Proposed taxonomy

The consolidated taxonomy focuses on addressing the requirements of heterogeneous environments composed by multiple environments (e.g., hybrid clouds and multicloud scenarios), with data-intensive workflows and high level of dynamic mechanisms. Also, properties from prior work were selected by identifying the commonalities between the works analyzed and also based on future challenges for large-scale execution of applications and workflows, such as data-intensive workflows, hybrid and multicloud scenarios, performance fluctuation, and reliability.

  • Makespan/Time: encompasses all aspects related to run time and time-based optimization.

  • Deadline: encompasses aspects also related to time but associated to predefined limits to finish a workflow – the central idea is not to finish the execution of a workflow as fast as possible, but simply to address a specific deadline and possibly save resources (i.e., reduce resource allocation) as long as the deadline is met.

  • Cost/Budget: encompasses all aspects related to financial cost and benefits, such as cost minimization and budget limitation.

  • Data-Intensive: works that effectively encompass one or more aspects inherent to data-intensive workflows.

  • Dynamic: works that employ some form of dynamic mechanism to continuously adjust the scheduling decision. This is a typical method to address issues related to unpredictability, such as performance fluctuation.

  • Reliability: works that encompass some form of reliability-related aspect, such as selecting nodes in a way to minimize the chances of failure, or providing mechanisms to circumvent failures.

  • Security: works that consider any aspect of security (in the sense of confidentiality).

  • Energy: energy-aware scheduling mechanisms.

  • Hybrid/Multicloud: works that address requirements of hybrid clouds and multicloud scenarios.

  • Workload/Workflow: works that address requirements for scheduling workflows on clouds.

Compared to the other taxonomies, the proposed one encompasses some of the fundamental properties connected to the QoS components that govern the scheduling decisions, such as makespan, cost, deadline, energy, etc. These properties are fully or at least partially covered by the other taxonomies, such as [9], with cost, makespan, and reliability; [112], with unpredictability management (closely related to dynamic properties and reliability) and rescheduling; [127], with deadline, budget, reliability, and energy; and [108], with cost, time, and energy. In addition, the proposed taxonomy encompasses some of the attributes of interest to this work, such as hybrid and multicloud aspects, and workflow resource management.

Survey

The method used to identify the surveys and other related work is based on searches performed in the following engines: IEEE Xplore, ACM Digital Library, ScienceDirect, Scopus, and Google Scholar. Moreover, two main search queries were used: “cloud scheduling survey” and “cloud resource management survey” (both without quotes). Some results were immediately discarded, such as ones addressing mobile cloud computing or other specific scenarios, such as Internet of Things and sensor networks. The focus of this analysis is to identify the surveys and taxonomies for cloud computing resource management focusing on five aspects: data-intensive loads, dynamic management, reliability, hybrid/multicloud scenarios, and workflow management. Works that do not cover at least one of these topics were not further analyzed, unless they represent solutions that led to the creation of others that do cover these aspects, such as DCP [57] and HEFT [117]. This led to selection of 113 works related to resource management and task scheduling with the majority focusing on cloud computing and a few works on distributed systems, such as [51] and [105]. The Table 4 shows the works, their highlights (very brief summary of contributions or main aspects addressed), and whether each category of the taxonomy was addressed or not. For each category three levels were considered:

  • Fully addressed: The work provides a solution that focuses on addressing the specific aspect, with clear mechanisms to cover it and potentially with experiments showing the effectiveness. For instance, [15] explicitly defines mechanisms to address the requirements of hybrid clouds.

    Table 4 Summary of identified related work classified using the consolidated taxonomy
  • Partially addressed: The work provides mechanisms that could be used to address the specific aspect, even if not explicitly mentioned in the work. For instance, [42] does not directly address deadline and cost aspects, but the solution proposed could be used to cover them with slight operational modifications.

  • Not addressed: The work does not address the aspect.

The majority of the works focus on aspects related to cost and time, such as makespan deadline-based solutions. Among them, makespan is addressed by 44 works (39%), deadlines are addressed by 31 works (27%), and cost is addressed by 43 works (38%). In contrast, none of solutions address security aspects related to confidentiality, such as safe zones to execute code and to store sensitive data.

Regarding support for workflows and workloads, 64 works (57%) provide some level of support to execute workflows using the resource management solution proposed. However, when combined to aspects related to dynamic placement and replacement of resources and tasks, only 19 (17%) provide support for both aspects (dynamic execution of workflows). Combining workflow support to data-intensive workflows leads to only 8 works (7%). Finally, combining workflow support to hybrid and multicloud scenarios, only 2 works (2%) address both aspects. None of the works combine workflow support, data-intensive loads, hybrid and multicloud scenarios, dynamic scheduling and rescheduling, and reliability aspects.

Data-intensive loads are explicitly supported by only 9 works (8%). Hybrid and multicloud scenarios are supported by 7 works (6%). This analysis reveals that while there are works addressing these aspects in separate, none provide explicit support for all aspects of interest and regarded as challenges for future deployments.

Further analysis

This subsection presents the works that were selected for further analysis to identify gaps and future challenges for cloud resource management regarding the execution of large-scale applications and workflows. The analysis of these works is summarized by Table 5.

Table 5 Summary of further analysis

Pandey et al. [82] propose a heuristic based on PSO that considers both computation and data transmission costs. The workflow is modeled as a DAG. Transfer cost is calculated according to the bandwidth between sites. Average cost of communication between two resources is considered to be applicable only when two tasks have file dependency between them. For two or more tasks executing on the same resource the communication cost is assumed to be zero. This implies no cost relative to sequential accesses to a file (e.g., the input file), but a rather uniform distribution of content among nodes. On the other hand, for a data-intensive workflow with large inputs and several I/O-heavy intermediary phases, even the cost of accessing resources on the same node cannot be overlooked. In terms of dynamic scheduling the authors claim that when it is not possible to assign tasks to resources due to resource unavailability, the recomputation phase of PSO dynamically balances other tasks’ mappings. However, there is no explicit mention to dynamically (re)scheduling based on other aspects, such as performance fluctuations and reliability issues. Workflow support is limited to the usual DAG-based description wherein computation costs of a task on a compute host is a known information and edges represent the communication among phases. This representation provides a limited amount of information regarding the workflow, such as performance fluctuation due to branches and other logic, requirements related to memory and local storage, and the actual performance observed when executing one of the phases on a node.

Lin and Lu [64] propose an algorithm named SHEFT, Scalable HEFT (Heterogeneous Earliest Finish Time). The authors claim that resources within one cluster usually share the same network communication, so they have the same data transfer rate with each other. While there might be network utilization fluctuations during the execution of a workflow (and even in idle state) that invalidate this assumption, the fact is that even locally (in the same node) there is data access imbalance due to contention – concurrency to access the same resources, in this case I/O. For example, if two containers (or virtual machines) located in the same node attempt to access a file or a network stream, they will naturally compete for resources. There is not clear support to dynamic scheduling to address reliability-related issues or performance fluctuations. The solution supports workflows but there are no details on how workflows are modeled or mapped into execution space.

Xu et al. [133] propose MQMW, a Multiple QoS constrained scheduling strategy of Multi-Workflows. Four factors that affect makespan and cost are selected: available service number, time and cost covariance, time quota, and cost quota. Workflows are modeled as DAGs but no specific information about the modeling is provided. The approach adopted by the authors to support multiple workflows is based on the creation of composite DAGs representing multiple workflows. DAG nodes with no predecessors (e.g., input nodes) are connected to a common entry node shared by multiple workflows. In this sense, new workflows to be executed are joined via a single merging point. Finally, there is no explicit support to dynamic scheduling or heterogeneous environments.

Weissman and Grimshaw [126] propose a scheduling solution for heterogeneous environments (wide-area systems) that encompasses data-intensive and dynamic scheduling properties. The solution also maintains local autonomy for scheduling decisions – remote resources are explored only when appropriate. Moreover, according to the authors the unpredictability of resource sharing in large distributed areas requires scheduling to be deferred until runtime. For data-intensive properties, it is assumed that the system infrastructure is able to access data and files independent of location. If data needs to be transported (e.g., jobs scheduled in a site that does not have direct access to needed data), the scheduling system assumes that data transport cost can be amortized over the course of job execution. This is not always possible as even local transfers can be expensive, especially if multiple local workers shared the same resources – a common scenario for cloud environments, with a high density of worker elements per physical node.

Chen and Zhang [23] use the Ant Colony Optimization (ACO) metaheuristic that simulates the pheromone depositing and following behavior of ants and it is applied to numerous intractable combinatorial optimization problems. QoS parameters are based on reliability, makespan, and cost. Reliability is defined as the minimum reliability of all selected service instances in the workflow. The actual reliability aspects or metrics used in the calculations, however, are not disclosed. Data communication and transfers are not explicitly addressed in the paper.

Rodriguez and Buyya [95] propose a resource provisioning and scheduling solution for execution of scientific workflows on cloud infrastructures. The solution is based on particle swarm optimization aiming at minimizing execution cost while meeting deadline constraints. The general approach adopted by the authors is similar to the one from [82]. Virtual machines are assumed to have a fixed compute capacity (measured in FLOPS), although some degree of performance variation due to degradation is considered in their model. In addition, the authors assume that workflows are executed on a single data center or region, and as a consequence the bandwidth between each virtual machine should be roughly the same. However, this might not be true even for a set of nodes connected to the same switch, especially during phases wherein several demanding data transfers are executed among nodes – for example, when inputs are distributed to all worker nodes. Finally, the transfer cost between two tasks being executed on the same virtual machine is assumed to be zero, while the actual communication can be much more expensive than that, especially if it is via file I/O. The workflow modeling is based on a DAG with fixed transfer costs (edges). Task costs are calculated based on the size of the task measured in FLOPS. The cost of a task, consequently, depends on the computational complexity of this task instead of the input data. Of course the number of FLOPS can be calculated based on the size of the input data, but no remarks are made in that sense. No other properties are defined, such as performance variation due to branching and input sizes.

Fard et al. [33] propose a multi-objective scheduling solution and present a case study comprising makespan, cost, energy, and reliability. The workflow is modeled as a very simple DAG with fixed size data dependencies among tasks. Nodes are modeled as a mesh network wherein each point-to-point connection has a different bandwidth. Cost is modeled as a sum of computation, storage, and transfer costs. Energy consumption is modeled only after the compute phases of the workflow. The authors state that their focus is on computational-intensive applications, thus only the computation part of the activities are considered in the energy consumption calculation, while “data transfers and storage time are ignored”. Finally, reliability is modeled using an exponential distribution representing the probability of successful completion of a task.

Malawski et al. [70] investigate the management of workflow ensembles under budget and deadline constraints on clouds. The authors state that although workflows are often data-intensive, the algorithms described do not consider the size of input and output data when scheduling tasks”. In other words, the scheduling cost is uniquely based on computation time. The authors complement by stating that data is stored in a shared cloud storage system and that intermediate data transfer times are included in task run times – transfer time is modeled as part of computation time. It is also assumed that data transfer times between the shared storage and the VMs are equal for different VMs so that task placement decisions do not impact the runtime of the tasks. It is clear, then, that any issues related to contention, performance variation due to network and I/O bandwidth utilization shared among several worker nodes and virtual machines, and the impact of sequentially distributing input among workers are partially or entirely overlooked depending on the case.

Sakellariou and Zhao [97] propose a scheduling mechanisms that considers executing carefully selected rescheduling operations to achieve better performance without imposing a large overhead compared to solutions that dynamically attempt to reschedule before the execution of every task. While the proposal is designed for grid computing, the ideas related to the selection of points of interest to execute the rescheduling operation is relevant also for cloud environments. The resource and workflow models adopted by the authors imply a fundamental simplification of how computation and transfer costs are calculated. Each task has a different cost for each machine, expressed as time per data unit. Although this attempts to model performance differences between nodes, this implies that the computation cost of each task linearly varies with the amount of input data. In contrast, if the assumption is that the costs are expressed as a fixed amount, then they are simply fixed to a value assuming a certain amount of input. Both cases do not consider a more sophisticated workflow model in which computation and communication costs vary according to the size of input data not linearly, but expressed as a general function that can be either predefined or dynamically obtained. This modeling affects both the initial static schedule and also subsequent rescheduling operations.

Wang and Chen [124] propose a cost function that considers the robustness of a schedule regarding the probability of successful execution. Based on the paper, failure is considered to be any event that leads to abnormal termination of a task, and consequent loss of all workflow progress thus far. Afterwards the cost function is used in conjunction with a genetic algorithm to find an optimized schedule that maximizes its robustness. However, in the definition of the cost of failure function the authors assume that the potential loss in the execution cost of each task is independent of the other workflow tasks. In other words, a failure always has a local scope, without possibility of chaining impact outside the workflow. Moreover, there is no workflow characterization in terms of data transfers and task costs. Robustness or failure rates are not specified or tied to a specific property such as MTBF (Mean Time Between Failures).

Poola et al. [86] propose a fault-tolerant workflow scheduling using spot and on-demand cloud instances to reduce execution cost and meet workflow deadlines. Workflow model is based on a DAG. Data transfer times are accounted for with a model based on the data size and the cloud data center internal bandwidth (assumed to be fixed for all nodes). Task execution time is estimated based on the information of number of instructions of the task. For fault-tolerance the authors adopt checkpointing, which consists of creating snapshots of the data being manipulated by the workflow and run time structures, if necessary. The core idea is to store enough information to restart computation in case of an error. One of the issues with the approach adopted is how checkpointing is considered in the model. Checkpointing worst-case scenario requires a full memory dump, meaning that 100% of the memory contents have to be written to a persistent storage (e.g., spinning disks). Depending on the memory footprint of the workflow phase this amount surpass the order of gigabytes. However, in the model proposed in the paper the checkpointing cost is not considered “as the price of storage service is negligible compared to the cost of VMs”. Moreover, while checkpointing time was considered in their model, the actual checkpointing time on spinning disks, especially for cloud systems that are not specialized for parallel I/O, can represent much more than 10% of overhead, which is the value expected for very large-scale machines such as APEX and EXASCALE. Thus, either the checkpointing size adopted is much smaller than what is observed for real scientific workflow or the checkpointing mechanism is creating partial checkpoints. Nevertheless, the results obtained by the authors show that having checkpoints actually reduces the final cost. Yet, the fault-tolerance provided by the method only covers the repair part, not the fault avoidance part. There is no (explicit) logic to predict the probability of occurrence of failures due to some hardware or software property, for instance.

Bittencourt and Madeira [15] propose HCOC, the Hybrid Cloud Optimized Cost, a scheduling algorithm that selects the resources to be leased from a public cloud to complement the resources from a private cloud. The objective of HCOC is to reduce makespan to fit a desired execution time or deadline while maintaining a reasonable cost. This cost constraint is introduced to limit the amount of resources leased from the public cloud, otherwise the public cloud would always be overutilized to address the time constraints. Intra-node communication is considered to be limitless, in the sense that the costs of local communication are ignored. Communication cost is calculated by dividing the amount of data by the link bandwidth, which is modeled as a constant value. Computation cost is based on the number of instructions and the processing capacity of a node, which is measured as instructions per time. There are several implicit assumptions in this model, such as fixed capacity for transferring and computing. There is not a function that varies the amount of computation based on the size of the input.

Vecchiola et al. [120] claim that scientific applications require a large computing power that typically exceeds the resources of a single institution. In this sense, their solution aims at providing a deadline-based provisioning mechanisms for hybrid clouds, allowing the combination of local resources to the ones obtained from a public cloud service. However, there are no specific details on how workflows are internally handled by their solution, nor how resources are mapped to workflow phases or how costs are calculated. Moreover, their solution (named Aneka) focuses on meeting a specific deadline, thus not addressing issues related to total execution time (makespan) or reliability.

Gaps and challenges

This section discusses the gaps and challenges identified in the investigation of related work.

Data-intensive loads

Regarding data-intensive loads, [82] states that they represent a special class of applications where the size and/or quantity of data is large. As a direct result, transfer costs are significantly higher and more prominent. While the authors do address data transfers in their resource model, several aspects of data access are not acknowledged. For instance, accesses to the same resource leads to a communication cost of zero. Transfer costs are calculated based on average bandwidth between the nodes, without regards to I/O contention, multiples accesses to the same resource, containers and VMs co-located in the same node sharing network and I/O resources, among other factors. This is also observed in other works such as [15, 33, 64, 95]. Other models consider transfers as part of computation time, such as [70]. This is depicted as a fundamental challenge by [127], which states that “in most studies, data transfer between tasks is not directly considered, data uploading and downloading are assumed as part of task execution”. Wu et al. [127] complements by stating that this may not be the case for modern applications and workflows –in fact, data movement activities might dominate both execution time and cost. For the authors it is essential to design the data placement strategies for resource provisioning decision-making. Moreover, employing VMs deployed in different regions intensifies the data transfer costs, leading to an even more complicated issue. This is correlated to having more complex cloud environments in terms of resource distribution, such as hybrid and multicloud scenarios.

Hybrid and multicloud scenarios

Regarding hybrid and multicloud scenarios, [127] states that it is necessary hybrid environments, heterogeneous resources, and multicloud environments. Singh and Chana [109] also highlights the importance of hybrid and multicloud scenarios for future deployments of large-scale cloud environments and reach performance comparable to large-scale scientific clusters. On the other hand, most of the scheduling solution still do not address hybrid clouds nor multiclouds. The few ones that do implement mechanisms that use the public part of a hybrid cloud to lease additional resources if necessary – the hybrid component of the setup is treated as a supporting element, not as protagonist. For example, [15] and [120] propose solutions that only allocate resources from the hybrid cloud (the public part of it) if the private part is not able to handle the workflow execution. Multicloud support is even more scarce or not explicit. Several of the proposed solutions could be adopted or adapted to multicloud environments, but there still is a lack of experimental results to match the predicted importance of such large-scale setups.

The motivation for multicloud environments vary from having more raw performance to match other large-scale deployments to having more options in terms of available services. Simarro et al. [107], for instance, states that resource placements across several cloud offers are useful to obtain resources at the best cost ratio. The same approach is adopted by [37] and [101]. Regarding the execution of large-scale applications on similar scale systems, [68] suggest a multi-site workflow scheduling technique to enhance the range of available resources to execute workflows. While their approach does consider data transfers and the costs of sending data over expensive (slower) links that connect different geographically distributed sites, their approach does not consider 1) performance fluctuations during execution of the workflow, which would suggest the implementation of rescheduling and rebalancing mechanisms; 2) reliability mechanisms to cope with performance fluctuations due to failures; and 3) the influence of contention in the general I/O operations, such as sequential accesses to the same data inputs.

Rescheduling and performance fluctuations

Performance fluctuations caused by multi-tenant resource sharing is one of the major components that must be included in the definition of uncertainties associated to scheduling operations [127]. The authors complement: “The most important problem when implementing algorithms in real environment is the uncertainties of task execution and data transmission time”. Moreover, most works assume that a workflow has a definite DAG structure while actual workflows have loops and conditional branches. For instance, the execution control in several scientific workflows is based on conditions that are calculated every iteration, meaning that branches are essential to determine whether the pipelines must be stopped or not. In this sense, rescheduling techniques are usually adopted to correct potential deviations from an original guess of the performance of a workflow on a system [61, 127].

Reliability

Several authors and works highlight the challenges and potential gaps in terms of cloud management and cloud resource management in terms of reliability. Bala and Chana [9] states that workflow scheduling is one of the key issues in the management of workflow execution in cloud environments and that existing scheduling algorithms (at least at that time) did not consider reliability and availability aspects in the cloud environment. Singh and Chana [109] directly addressed this issue by stating that the hardware layer must be reliable before allocating resources. While several subsequent works addressed these aspects, there still are gaps in the methodology. For instance, [23] implement a solution that considers a reliability factor but there is no explicit model on how to calculate this factor based on actual hardware and software reliability related metrics, such as hardware failure and software interruption rates.

Fard et al. [33] defines a reliability factor by assuming a statistically independent constant failure rate, but this rate only reflects the probability of successful completion of a task – there is no clear connection between this concept and a factual and measurable metric from hardware and software point of view. Hakem and Butelle [43] also proposes a reliability-based resource allocation solution by defining a reliability model divided in processor, link, and system. The model is based on exponential distributions which could be related to metrics such as mean time between failures (MTBF) and failure in time (FIT).

Other solutions such as the one from [87] use reliability-related methods such as checkpointing to decrease application failures, but in this particular case, for instance, the performance implications of having these mechanisms is not fully appreciated. The I/O cost in terms of storage and time to implement checkpointing are far from negligible. Still on reliability, [124] state that the main two strategies to calculate reliability factors is to either establish a reputation threshold or to treat nodes independently and multiply their probability of success. Still, the reliability approach proposed by the authors does not address measurable metrics to calculate these factors. Moreover, on one side there are the solutions only address failures after their occurrence, not before. For instance, [86] uses checkpointing to recover from failures but there is no mechanism in place to calculate the probability of failures and attempt to avoid nodes with higher probability of failure, or at least designate a smaller portion of tasks to this node. On the other side, solutions calculate reliability factors based on theoretical metrics that might not reflect the specificities of each node and there are no clear mechanism to combine prevention and recovery. In that sense, [49] provides a deeper analysis of fault-tolerance techniques for grid computing that could be applied to cloud computing. The authors clearly state that the requirements for implementing failure recovery mechanisms on grids comprise support for diverse failure handling strategies, separation of failure handling policies from application codes, and user-defined exception handling. In terms of task-level failure handling techniques the authors consider retrying (straightforward and potentially least efficient of the enlisted techniques), replication (replicas running on different resources), and checkpointing. Checkpointing is actively used in real scientific scenarios while replication usually leads to prohibitive costs, as in several cases running one replica is expensive enough in terms of resource demand. In addition, in terms of workflow-level failure handling, the authors propose mechanisms such as alternative task (try a different implementation when available), workflow-level redundancy, and user-defined exceptions that are able to fallback to reliable failure handling. In terms of evaluation the authors propose parameters such as failure-free execution time, failure rates, downtime, recovery time, checkpointing overhead, among others. These are measurable metrics that can be used to model and represent the failure behavior of systems and workflows.

Conclusion

This paper provided an extensive investigation of existing works in cloud resource management. The investigation started by providing several definitions and associated concepts on the subject, covering the rationale presented by several authors and publications from the academia. Three main works were selected in this sense, reflecting the works that provided a clear definition of distinct steps regarding cloud resource management. Among these works the common point is the association of management components to each phase of the resource lifecycle, such as resource discovery, allocation, scheduling, and monitoring. Moreover, the ultimate objective in all cases is to enable task execution while optimizing infrastructural efficiency. These are the two main points related to cloud resource management.

The next step in this investigation was to identify relevant works in the area, focusing on recent publications and others not so recent but still important, for instance covering a specific aspect of cloud resource management. The results of this analysis led to the identification of over 110 works on cloud resource management. A taxonomy was created based on the consolidation of characteristics and properties used to classify the selected works. Further analysis was provided to enhance the identification of gaps and challenges for future research on cloud resource management focusing on large-scale applications and workflows. The final step of this investigation was the formalization of these gaps and challenges obtained during the research. The challenges were organized in four topics: a) challenges related to data-intensive workflows, including lack of proper modeling of transfers, or modeling of transfers as part of computation; b) hybrid and multicloud scenarios, comprising large-scale deployments and more complex setups in terms of resource distribution; c) rescheduling and performance fluctuations, essentially addressing the lack of mechanisms to adequately cope with the inherent performance fluctuation of large scale cloud deployments, and the effects of multi-tenancy and resource sharing; and d) reliability, highlighting the lack of proper factors based on actual and measurable metrics such as failure rates. Based on these topics, four clear gaps are identified to be addressed by future research:

  • Lack of mechanisms to address the particularities of data-intensive workflows, especially considering that future trends point to the direction of I/O workflows with intensive data movement and with reliability-related mechanisms highly dependent on I/O as well.

  • Lack of mechanisms to address the particularities of large-scale cloud setups with more complex environments in terms of resource heterogeneity and distribution, such as hybrid and multicloud scenarios, which are expected to be the main drivers for large-scale utilization of cloud – scientific workflows being one important instance.

  • Lack of mechanisms to address the fluctuations in workflow progress due to performance variation and reliability, both phenomena that can be partially or even fully addressed by implementing controlled rescheduling policies.

  • Lack of reliability mechanisms based on actual and measurable metrics that can be derived from documentation and from collecting information of the system.

The results of this analysis combined to the requirements identified for future workloads leads to the conclusion that modern solutions aiming at providing resource management for large-scale deployments and to execute large-scale problems must provide mechanisms to address data movement in massive scale while adequately distributing resources to tasks, adjusting this distribution depending on the fluctuations observed in the system. Existing solutions can and should be adapted to address the specific requirements related to the challenges identified, but further research and development are necessary to cope with these requirements in a more comprehensive and decisive way.