5.1 Introduction

Cloud simulation tools have been extensively used for the analysis of cloud data centres, since the cost of experimentation using various scenarios is low. A number of different aspects, related to cloud environments, can be studied through simulation including resource allocation strategies, live migration of running applications to more efficient data centre resources, energy consumption , and hardware resource utilisation. Several cloud simulation tools have been developed during the past few years focusing on different aspects of cloud environments. These tools can be categorised into:

  • Discrete Event Simulators ( DES ): These examine macro-scale phenomena, such as application events that take place in certain moments in time while completely disregarding micro-scale phenomena, including network packet communication. DES are used to examine large-scale simulations, while focus is given among others in the study of cloud environments behaviour in terms of service delivery, Virtual Machine (VM) allocation policies, utilisation of resources, and the energy consumption of data centres.

  • Packet-Level Simulators ( PLS ): These examine micro-scale phenomena related to cloud environments, including packet loss and network communication protocols. PLS offer high levels of accuracy at the cost of performance though, since large-scale data centres cannot be studied due to the restricting resolution of the simulations.

Cloud infrastructures continue to grow in both size and diversity to cater for demand in terms of both user and data volumes and the variety of hardware resources. As a result, existing cloud simulation tools cannot be used to efficiently simulate these heterogeneous environments at scales several orders of magnitude greater than traditional data centres. By 2020, hyperscale data centres will account for a substantial portion of all cloud workloads and data (Cisco 2016). Furthermore, as hyperscale data centres consist of servers in distinct geographical locations, the efficient management of such infrastructures is made more difficult resulting in network congestion and underutilisation of resources. Resource heterogeneity further exacerbates these challenges. While hyperscale data centre operators increasingly offer specialised hardware, such as Graphical Processing Units (GPUs), Many Integrated Cores (MICs), and Field-Programmable Gate Arrays (FPGAs), existing cloud simulation tools do not support them. The efficient exploitation of the hardware infrastructure of heterogeneous hyperscale cloud environments is a topic of great importance during the last few years; thus, cloud simulation tools for studying heterogeneous cloud environments that can cater for hyperscale need to be developed.

The remainder of this chapter is organised as follows. Section 5.2 provides a summary review of common cloud simulation frameworks used by the scientific community and their limitations. A new simulation framework, the CloudLightning Simulator, designed to simulate hyperscale cloud environments composed of heterogeneous resources is presented in Sect. 5.3. This is followed by a discussion of initial experimentation using the CloudLightning Simulator to compare service delivery of three application scenarios: oil and gas exploration, ray tracing, and genomics, using (i) conventional cloud service delivery and (ii) cloud service delivery using a self-organising self-managing (SOSM) approach.

5.2 Cloud Simulation Frameworks

During the last decade, various cloud simulation frameworks have been proposed, such as CloudSim (Calheiros et al. 2011), DCSim (Tighe et al. 2012), GDCSim (Gupta et al. 2011), GreenCloud (Kliazovich et al. 2012), iCanCloud (Nunez et al. 2012), and CloudSched (Tian et al. 2015). However, no existing cloud simulation framework is designed for hyperscale simulations.

One of the main limitations of existing cloud simulation tools is the lack of scalability. Most existing cloud simulation tools do not support parallelism; thus, the simulation of very large data centres is not possible (Byrne et al. 2017). Parallelism is of great importance for the simulation of hyperscale cloud environments since both computational work and memory requirements can be distributed among multiple nodes, reducing the execution time significantly and enabling the simulation of large-scale data centres.

An important factor influencing scalability of the extant simulation tools is memory requirements. In DES a large number of events should be created and retained. The number of these events is closely related to the number of resources simulated as well as the input tasks. Discrete Event based simulators initialise the task list that will be executed for the whole simulation and augment it gradually with new events according to time. This process requires retaining a very large list in memory, its augmentation with new events, and its sorting in order to perform events in the correct order. Thus, memory requirements increase significantly with the number of resources or the simulation time. Memory restrictions also occur due to the high level of detail of the simulated components, such as in the case of the iCanCloud and GreenCloud frameworks, which becomes prohibiting in very large-scale executions.

The effective management of resources is a significant challenge as their number increases. More specifically, strategies which require the detection of specific hardware cannot be applied or require significant computational cost when hyperscale systems are considered. Also, status information corresponding to the underlying hardware resources is becoming outdated, and thus efficient management of the system becomes more challenging. Specialised strategies are required in hyperscale cloud environments for the efficient and up-to-date management of the system. Such strategies are not supported in existing simulation frameworks, and thus the simulation of hyperscale systems is difficult to perform.

Finally, the inclusion of heterogeneous resources is not supported by existing cloud simulation tools. Simple generic models are required for the simulation of heterogeneous resources in order to be integrated in cloud simulation environments (Makaratzis et al. 2017; Giannoutakis et al. 2017).

5.3 CloudLightning Simulator

Unlike existing frameworks, the CloudLightning Simulator has been designed from the ground up as a massively scalable solution, able to simulate hyperscale data centres consisting of millions of cloud nodes/servers. The framework is written in C++ and is parallelised using Message Passing Interface (MPI) (Gropp et al. 1996) and OpenMP (Dagum and Menon 1998) to enable the efficient handling of hyperscale simulations. CloudLightning supports the simulation of heterogeneous infrastructures (including GPUs, MICs, and FPGAs/DFEs) that are commonly used for the acceleration of High Performance Computing applications. One important characteristic of the developed framework is the use of a time-advancing loop, a technique that removes the need for pre-computation and storage of future events, resulting in a significant reduction of its memory requirements. This allows the integration of dynamic resource allocation policies, such as SOSM, enabling the efficient management of computer resources for simulating hyperscale environments. Moreover, the CloudLightning Simulator places an emphasis on the simplicity of the models it uses, focusing on models that require reduced number of computations for producing the results of the simulations without loss of accuracy. Finally, all inputs and outputs of the simulator are represented graphically.

The remainder of this section presents the generalised and extensible CloudLightning simulation framework for simulating heterogeneous resources using an SOSM approach.

5.3.1 Architecture and Basic Characteristics of the Parallel CloudLightning Simulation Framework

The CloudLightning Simulator was designed to simulate clouds relying on the Warehouse Scale Computer (WSC) architecture (Barroso et al. 2013). WSC has been adopted by a multitude of companies including Google, Amazon, Yahoo , Microsoft, and Apple, and has been widely used in the design of cloud environments (Mars 2012). In the WSC architecture, interconnected cloud computing nodes are grouped into cells that are centrally managed (Fig. 5.1).

Fig. 5.1
figure 1

Warehouse Scale Computer abstract architecture

In this architecture, the Gateway service is responsible for redirecting end user requests to the appropriate Cells. The Gateway service is the entry point of the system and is a cloud entity that receives resource requests from the end users and redirects them to the Cells. A conceptual cloud architecture with multiple Cells is presented in Fig. 5.2. The resources are organised and monitored by the Cell manager’s broker that is responsible for the provision of appropriate resources to end user requests and for the deployment of incoming tasks to the available resources. The broker component is composed of multiple services, including orchestration, telemetry, and identity service. Hyperscale cloud environments consist of a considerably large number of Cells.

Fig. 5.2
figure 2

Abstract cloud architecture with multiple cells

In the CloudLightning simulation framework, each Cell is hosted on a different computing node of a distributed system, while the Gateway service is hosted on the master computing node. The communication between the Gateway service and the Cells is performed using the MPI framework. The following operations are performed by each Cell (Filelis-Papadopoulos et al. 2017, b):

  • Receiving simulation parameters

  • Initialisation of different components, including hardware resources, the broker, network, telemetry, and the SOSM engine

  • Receiving the task queue in each time-step

  • Searching for available resources for the execution of the tasks, using the SOSM engine

  • Updating the state of the resources and controlling the execution of the tasks

  • Communicating status information to the Gateway Service

The operations performed by the Gateway service are the following (Filelis-Papadopoulos et al. 2017, b):

  • Retaining simulation inputs and communicating data to the Cells for the initialisation of the simulation components

  • Creation of the task queue in each time moment, fragmentation of the task queue into subqueues, and communication of the subqueues to the Cells, by maintaining load balance through all Cells

  • Receiving status information from the Cells

  • Processing and storing historical statistics and metrics

The parallelisation of the CloudLightning Simulator in distributed systems is of great importance, since simulating hyperscale infrastructures is a computationally and memory-intensive process. For this reason, various components of the CloudLightning Simulator use the OpenMP framework in different ways to accelerate their computations on shared memory multiprocessors. The Gateway Service processes statistics in parallel—the Cells perform resource discovery and task deployment as well as the update of the resources’ state on different multiprocessor cores. The SOSM techniques are also performed in parallel.

Figure 5.3 presents the software architecture of the CloudLightning Simulator (Filelis-Papadopoulos et al. 2017):

Fig. 5.3
figure 3

Software architecture of the parallel CloudLightning simulation framework

5.3.2 SOSM Engine

One of the most important characteristics of the CloudLightning Simulator is the use of SOSM techniques to control the underlying resources of the Cells in a more efficient manner (Filelis-Papadopoulos et al. 2017).

In traditional cloud architectures, the resources are managed by the broker, a central entity that is responsible for the search and deployment of the available resources with respect to incoming task requests, the collection of data for the state of all underlying resources, and the management of all underlying resources of the data centre. This centralised approach has limitations due to the computational complexity involved in locating specific hardware, especially when the number of resources increases. Locating the most appropriate server for the execution of a task is a computationally expensive operation in large data centres, and it is generally avoided in favour of strategies such as the “first-fit approach,” where a task is deployed on the first available server or coalition of servers. This type of strategy is not effective though in terms of both computational and energy efficiency, resulting largely in the underutilisation of the available resources (Filelis-Papadopoulos et al. 2017). More effective strategies, such as SOSM, need to be applied to achieve high levels of resource utilisation and thus computational and energy efficiency.

In the CloudLightning architecture, each Cell is organised in a hierarchical tree structure. As discussed earlier, the tree contains different entities, including prescription Routers (pRouters), prescription Switches (pSwitches), and virtual Rack Managers (vRMs). Figure 5.4 presents an example of the CloudLightning tree structure. In this structure, the resources are locally managed by the vRMs which in turn are locally managed by the pSwitches, while the pSwitches are locally managed by the pRouters. The local management of the architectural components allows the efficient collection and analysis of data that can lead to an improved decision-making process. Each component can describe the state of its underlying resources since metrics describing the state of the resources are collected with respect to an interval and averaged by each component to form its own state. Also, weights describing the desired state of the system are communicated from the Gateway Service to the underlying components. By using these metrics and weights, each component’s Suitability Index is computed. The Suitability Index expresses how appropriate is a component to receive an incoming task. By using the Suitability Index, each incoming task can be subsequently directed to the most efficient resources.

Fig. 5.4
figure 4

Hierarchical structure of the SOSM engine

The exchange of metrics and weights between the components is part of the Self-Management actions and is performed by all the components of the SOSM engine. The Self-Organisation techniques, on the other hand, are solely performed by the vRMs and the pSwitches. In the case of vRMs, there can be an exchange of resources between vRMs that are hosted by the same pSwitch, in order to maximise the efficiency of the system and to host tasks that require more resources than available on a vRM. New vRMs can also be created, while vRMs that do not contain any resources to manage can be destroyed. Similarly, pSwitches that are hosted by the same pRouter can exchange vRMs; new pSwitches can be created, while existing pSwitches can be dismissed when they have no vRMs to manage.

Each pRouter of a Cell is homogeneous, as it contains resources of the same type. In order to maintain the homogeneity, Self-Organising actions are not performed at the pRouter level; thus, pSwitches cannot be exchanged between pRouters. For this reason, pRouters are the entry point for the selection of a specific type of resource inside a Cell (Filelis-Papadopoulos et al. 2017).

The SOSM system improves significantly the scalability of cloud environments since the most appropriate hardware for the execution of a task can be located fast and with low computational cost, even in data centres with a very large number of resources. In the CloudLightning Simulator , the SOSM engine is implemented in parallel using the OpenMP framework. Power Consumption Modelling

To estimate the power consumption of large-scale heterogeneous data centres, a number of different power models for both Central Processing Unit (CPU) servers and combined CPU-accelerator pairs were developed. The power models are generic with low computational cost (Filelis-Papadopoulos et al. 2017; Giannoutakis et al. 2017). For this reason, the CloudLightning Simulator is capable of computing the power consumption of very large heterogeneous data centres without a significant impact on its scalability. The following subsection gives a detailed presentation of the integrated power consumption models. CPU Power Models

Piecewise interpolation methods between recorded CPU power consumption levels, and generic models that estimate the trend of the power-utilisation diagram of CPUs by using the idle and maximum power consumption of the CPU servers, have been integrated.

The interpolation methods are performed between recorded CPU power consumption levels that are available mainly as part of the Standard Performance Evaluation Corporation (SPEC) benchmark (SPEC 2008). Existing simulators, such as CloudSim, use linear interpolation between power measurements on rounded utilisation intervals (i.e. 0%, 10%, 20%, etc.) (Beloglazov and Buyya 2012). In order to achieve improved accuracy, the interpolation methods in the CloudLightning Simulator are applied on the exact utilisation intervals of the power measurements (i.e. 0%, 10.2%, 19.7%, etc.) as the error of the rounded interpolation intervals increases when simulating very large data centres (Giannoutakis et al. 2017). Two different interpolation methods were used, the linear and the “not-a-knot” cubic spline interpolation.

Generic models were also integrated, since they require less computational cost and power measurements compared to the interpolation methods. The models estimate the power consumption of CPU servers by using the utilisation of the CPU server and its power consumption in idle and max states. The linear, square, cubic, and square root models that have been used in existing cloud simulators (i.e. CloudSim) were integrated (Beloglazov and Buyya 2012). For the CloudLightning Simulator , a generic CPU power model was used based on a third-degree polynomial, which estimates more accurately the trend of the power-utilisation diagram of CPU servers (Filelis-Papadopoulos et al. 2017). The trend of the generic models compared with the actual CPU measurements provided by SPEC (SPEC 2008) for an HP Proliant DL560 Gen 9Footnote 1 is presented in Fig. 5.5.

Fig. 5.5
figure 5

Generic CPU power models compared to the power-utilisation diagram of an HP Proliant DL560 Gen 9 server

Existing cloud simulators (i.e. GreenCloud and CloudSim) support the use of real application traces in order to compute the power consumption of the simulated applications in each time-step. This approach would negatively affect the scalability of the simulator in large-scale simulations, and for this reason, mean values of real application traces were computed and integrated. More specifically, the mean value of the CPU utilisation for each application is used to compute the mean power consumption of the application. Then, the energy consumption of the application is computed by multiplying the mean power consumption of the application with its execution time. This approach provides a lower computational cost, while the result of the energy consumption of the application is computed with approximately the same accuracy that would have been obtained if all the power traces were used. This methodology has been tested, achieving high levels of accuracy in the estimation of the energy consumption of applications (Makaratzis et al. 2017). Combined CPU-Accelerator Power Models

A generic power consumption model was used for the estimation of the power consumption of accelerators such as GPUs, MICs, and DFE (Giannoutakis et al. 2017). This model was built around the idea that the maximum power consumption of an accelerator is consumed when an application is executed on the accelerator, while the idle power consumption is consumed when the application is executed only on the CPU. This binary model provides simplicity and increased accuracy (Makaratzis et al. 2017). The model for the power consumption of hardware accelerators is described as follows:

$$ {P}_{acc}\left(\rho \right)=\left(1-\rho \right){P}_{acc-\min }+\rho {P}_{acc-\max } $$

where Pacc ‐ min and Pacc − max are the minimum and maximum power consumption values, respectively, that the application can consume on the accelerator, while ρ is the percentage of the application that is parallelised on the accelerator, thus in each time moment. Similarly, with the utilisation parameters of the CPU power model, the mean value of parameter is computed based on real application traces, thus the mean value of the power that is consumed on the accelerator is computed for the total execution time of the application. The combined CPU-accelerator mean power consumption of the application is computed as the sum of the mean power consumption of the CPU server and the mean power consumption of the accelerator. The energy consumption of an application that is executed on a heterogeneous node is computed by multiplying the combined CPU-accelerator mean power consumption with the execution time of the application.

To conclude, in order to keep the computational cost low, generic CPU and accelerator power models were integrated in the CloudLightning Simulator . The simplicity of the models is of great importance since models that are based on architectural details of the hardware resources require a substantial number of computations, considering the heterogeneity and the very large number of resources in the simulations. These models were validated on heterogeneous testbeds and a good accuracy level was achieved (Makaratzis et al. 2017). Memory, Storage, and Network Modelling

Detailed modelling of memory would negatively affect the scalability of the simulator, especially in large-scale simulations, since it would require an increased amount of computations. Memory was implemented as a resource, measured in GBytes, that is used in the allocation of VMs to physical servers. Memory overcommitment was also implemented; thus, the total available memory was computed as the product of the total physical memory and the overcommitment ratio. The power consumption of memory was included in the power consumption of the CPU servers, eliminating the need for a separate memory power consumption calculation.

The modelling of storage was also implemented with simplicity in order to keep the computational cost in low levels. The storage was implemented as a resource measured in TBytes. Global storage was not implemented, though its impact can be added directly to the time span of tasks. Detailed modelling of the power consumption of storage was not implemented since it would require substantially large number of computations, which would negatively affect the scalability of the simulator. The energy consumption of storage is considered to be included in the energy consumption of the CPU servers, similar to memory modelling.

The network was implemented as a global component, visible from all the underlying resources, with the network bandwidth being shared among the arriving tasks of the system. When the requested network bandwidth exceeds the available capacity, the execution of applications is affected negatively (in terms of the execution time). It should be noted that the network model of the CloudLightning Simulator was implemented through a catalogue of tasks, retaining all tasks executing at a given time-step. A linear model for computing the time required to transfer initial data and output data was implemented with a function of the following form:

$$ NT(t)= fileSize/ bandwidth $$

where fileSize is the size of the file to be transferred and bandwidth is the available physical bandwidth. Application Models

In the design of the CloudLightning Simulator , the execution of VMs is part of a given task and their life cycle is directly connected to it. Each task is defined based on the following characteristics (Filelis-Papadopoulos et al. 2017):

  • Type of application (Genomics, Oil and Gas, Ray Tracing)

  • Available implementations (CPU-only, CPU+GPU, CPU+DFE, CPU+MIC)

  • Number of instructions (in Millions of Instructions [MIs])

  • Required number of VMs

  • Required number of processing units per VM

  • Required memory per VM (in GBytes)

  • Required storage per VM (in TBytes)

  • Required accelerators per VM

  • Required network bandwidth

The minimum and maximum values are defined for the actual utilisation of the CPU, the memory, and the network. The actual resources used by an application (utilisation) are computed based on application traces as a percentage of the requested resources over a number of predefined intervals. These utilisation parameters are considered as mean values with respect to the total execution time of the application. This approach maintains the computational cost low, while the desired metrics are obtained with the same accuracy that would have been obtained if all the application traces were used.

All task parameters, including the number of instructions, the required number of VMs, and memory size, are randomly generated using a uniform random number generator with respect to predefined intervals. The intervals are computed based on real application characteristics.

This approach of application modelling reduces computational cost, allowing for large-scale simulations, while also providing realistic results during the simulations. Execution Models

Existing cloud simulators generally create a priori task lists for the whole duration of the simulation, augment, and sort that list with respect to events triggered by inputs and so on. However, this has the disadvantage of simulation data storage, not only for the current event but also for future ones, restricting the execution of large-scale simulations over long time periods. In contrast, the CloudLightning Simulator is based on a time-advancing loop, where incoming tasks are created dynamically in each time-step and where each time-step is independent from any previous or future ones (Filelis-Papadopoulos et al. 2017). A task list is then created at the beginning of each time-step, removing the need for data storage of future tasks of the simulation. Creating task lists per time-step reduces significantly the memory requirements of the simulation and offers the ability to simulate dynamical components that change their state according to dynamic strategies, including pRouters, pSwitches, and vRMs while allowing for the simulation over extended time periods.

In the execution of tasks, the time-step is used as the control mechanism of the execution. The performance of applications is measured in MIs while the computational capability of the physical servers is measured in Millions of Instructions per Second (MIPS). In each time-step, the number of instructions that can be executed by the available resources is subtracted from the total number of instructions of the application. This time-step-controlled execution model offers significant capabilities since the impact of various phenomena can be modelled by applying penalties on the execution of tasks. For example, phenomena such as performance degradation due to cache sharing or “noisy-neighbours” can be modelled by reducing the computational capability, meaning that fewer of the application’s instructions will be executed on the current time-step. Similarly, the usage of hardware with a higher computational capability, that is, accelerators, can be modelled by increasing the computational capability of the current time-step. Service-level Agreement violations concerning memory, storage, or network limitations can be modelled by applying similar penalties in the execution of tasks.

This approach of execution modelling allows the integration of possible extensions on the simulator, since any phenomenon can be modelled during a simulation by applying penalties or gains in the execution of the applications. Also, this execution model allows the simulation of very large time periods and millions of cloud servers, since the memory requirements of the execution model are very low.

5.4 Experimental Results

This section presents the experimentation framework and the numerical results occurred after simulating the traditional cloud delivery system and the SOSM framework.

The experiments were performed on a cluster consisting of four Dell PowerEdge C4130 nodes, each containing two 10-core Intel Xeon E5-2630 v4 CPUs running at 2.20 GHz (3.10 GHz Max Turbo frequency) with 128GB of Random Access Memory (RAM), and a Dell PowerEdge R730 node containing two 8-core Intel Xeon E5-2609 v4 CPUs running at 1.70 GHz. During the simulation, the Dell PowerEdge R730 node was used to host the Gateway service, while the 4 Dell PowerEdge C4130 nodes were used to host the Cells.

The time period of the simulation was set to one week (604,800 seconds), with a time-step of 1 second. The update interval of the Gateway Service was chosen to be 200 seconds, while the update interval of the pRouters, pSwitches, and vRMs was 20 seconds. The cloud nodes of the simulated data centre were selected to use an Intel Xeon E5-2699 v4 2.20 GHz-based node with 44 cores and 385,063.42 MIPS, 128 GBytes of RAM, and 40 TBytes of storage.

Each Cell consisted of four different types of hardware, that is, CPUs+GPUs, CPUs+MICs, CPUs+DFEs, or CPU servers with no accelerators. Each heterogeneous node consisted of a CPU and four accelerators. The characteristics of the CPUs and the accelerators are presented in Table 5.1. It is noted that the linear interpolation method on uneven utilisation intervals was used for the estimation of the power consumption of the CPU servers, where the power values for the various utilisation intervals were obtainedFootnote 2 from SPEC (SPEC 2008).

Table 5.1 Selected simulation framework characteristics (adapted from Byrne et al. 2017)

During the simulations, three different types of applications were considered. The characteristics of the applications are presented in Tables 5.2 and 5.3.

Table 5.2 Hardware characteristics
Table 5.3 Application characteristics

The CloudLightning Simulator was executed for different number of resources, Cells, and submitted tasks. Each Cell was hosted on a Dell PowerEdge C4130 node, while in the experiments with eight Cells, each computing node was hosting two Cells. Three different configurations were tested. In the first configuration, 11,000 resources per Cell were utilised, while the experiment was performed for different number of Cells. Similarly, in the second configuration, 110,000 resources per Cell were used, and in the third configuration, 1,100,000 resources per Cell were considered. The maximum number of submitted tasks was set equal to four per second when one Cell was used, while this number was multiplied with the number of Cells when additional Cells were used. The VM allocation policy used was the “first-fit approach,” according to which tasks are placed on the first available server found.

Table 5.4 presents the outputs, in terms of the number of accepted tasks, the average processor and accelerator utilisation, the average network utilisation, the energy consumption of the data centre, and the execution time of the CloudLightning Simulator, simulating a traditional centralised cloud service delivery system.

Table 5.4 Execution of the CloudLightning simulator for different number of resources, different number of Cells, and different number of tasks for the traditional centralised cloud service delivery system

For all different configurations, it can be observed that the total number of rejected tasks was high, with an ~86% task rejection on average. The task rejection was caused mainly by the network congestion appearing early in the simulated cloud (Fig. 5.6). Despite the fact that the selection of applications and their corresponding implementations (Table 5.3) were performed randomly using a uniform random generator, accelerator implementations were starting to be rejected after a period of simulated time, since the network resources are shared between the resources hosted across a Cell. This yields the acceptance of additional CPU tasks that in general require more computational time for execution and consequently overload the network.

Fig. 5.6
figure 6

Energy consumption, processor, accelerator, and network utilisation of the cloud over time for the traditional cloud delivery simulation

The energy consumption estimation of the cloud infrastructure increased with the number of resources per Cell and the number of Cells. It is expected that, except from the idle servers that consume the minimum power, when the utilisation of the cloud increases, the energy consumption will proportionally increase.

The CloudLightning Simulator was also tested using the SOSM resource allocation framework, for 100 resources per vRM, 10 vRMs per pSwitch, and 5 pSwitches per pRouter. The VM allocation policy was “Task Compaction,” where the system is provisioning as many VMs as possible on each physical server. Table 5.5 presents the outputs of the CloudLightning Simulator, in terms of the number of accepted tasks, the average processor and accelerator utilisation, the average network utilisation, the energy consumption of the data centre, and the execution time of the simulator, when using the SOSM engine.

Table 5.5 Execution of the CloudLightning simulator for different number of resources, different number of Cells, and different number of tasks by using SOSM engine

During the SOSM resource allocation simulation, it can be observed that there was a more balanced utilisation between CPUs and accelerators. More specifically, accelerators tended to be utilised at the same levels as CPUs, while in many cases, their utilisation percentages overcame the corresponding CPU ones. This was due to the fact that the system (SOSM framework) decides the resources (and types of implementations) to be allocated for a task, according to the predefined assessment functions, that targets on (a) improved service delivery, (b) computational efficiency, (c) improved energy consumption, and (d) efficient management of underlying resources. Since accelerators are more efficient in terms of computational efficiency and energy consumption, the system’s choice is apparent.

It can also be seen that the total number of rejected tasks was very low (~0.05%), but the total estimated energy consumption of the cloud was close to the estimations of the traditional delivery system, due to the utilisation of the energy-efficient accelerators. Thus, the SOSM- based cloud environment was able to execute more tasks consuming almost equal energy. This was expected, since the SOSM selects the most efficient resources, executing the task faster, thus freeing those resources faster, and consequently leading to more tasks being accepted.

In order to examine the energy efficiency of the two resource allocation techniques in more detail, the ratio of the total energy consumption of the data centre over the number of accepted tasks was computed for all experiments. In Table 5.6, the number of Wh that is consumed per task for all configurations is presented. It can be observed that the number of Wh per task is substantially smaller when the SOSM engine is used. This is due to the fact that when the SOSM engine is not used, the resources that are utilised are selected randomly, while with the SOSM engine the resources are selected by the system, according to the predefined strategies; thus, the most energy efficient solution is always chosen.

Table 5.6 Ratio of the total energy consumption of the cloud over the number of accepted tasks for all configurations

In Figs. 5.6 and 5.7, time-dependent charts are presented for the last experiment of the third configuration (eight Cells, 1,100,000 servers per Cell). In Fig. 5.6, the energy consumption, the processor utilisation, the accelerator utilisation, and the network utilisation of the cloud are presented with respect to the simulated time for the traditional centralised cloud service delivery. In Fig. 5.7, the energy consumption, the processor utilisation, the accelerator utilisation, and the network utilisation of the cloud are presented through the simulation time when using the SOSM engine.

Fig. 5.7
figure 7

Energy consumption, processor, accelerator, and network utilisation of the cloud over time for the SOSM simulation

5.5 Conclusion

This chapter presented the work towards demonstrating the scalability of the CloudLightning simulation framework. Cloud simulation tools are examined, since demonstrating scalability in hyperscale clouds is unfeasible. The design and implementation of the CloudLightning simulation framework were presented, a framework that overcomes the limitations of the existing simulation platforms. The main innovations of the framework lie in the fact that it is implemented for parallel computing systems (using MPI and OpenMP), it is based on a time-advancing loop instead of a discrete sequence of events, it allows the integration of dynamic resource allocation systems such as SOSM, and it supports hybrid CPU-accelerator resources. Finally, the CloudLightning Simulator was developed to be easily extensible, since the time-advancing execution model allows the integration of any strategies or phenomena observed in cloud environments.

From the experiments that were performed, the CloudLightning simulator was found to be capable of simulating clouds with large number of resources. Different executions were performed with the traditional cloud delivery system and with the use of the SOSM framework, for a various number of resources and Cells. Both the simulation platform and the SOSM framework were found to be scalable; simulations up to 8,800,000 hardware resources grouped into eight Cells were performed, only limited by the available hardware used for experimentation. SOSM was found to provide a more balanced distribution of tasks on the available hardware resources, with a much lower number of total rejected tasks. The energy consumption was found to be equivalent to the energy consumed when simulating a traditional cloud delivery system; however, the SOSM system was able to service a significantly larger number of tasks. Thus, the energy consumed per task in the SOSM system was substantially reduced compared to the traditional approach.

The CloudLightning Simulator and Simulator Visualization Tool are available for download under the Apache 2 open source licence at https://bitbucket.org/cloudlightning/cloudlightning-simulator and https://bitbucket.org/cloudlightning/cl-simulatorvisualization, respectively.

5.6 Chapter 5 Related CloudLightning Readings

  1. 1.

    Byrne, J., Svorobej, S., Giannoutakis, K., Tzovaras, D., Byrne, P. J., Östberg, P. O., et al. (2017). A review of cloud computing simulation platforms and related environments. In Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER 2017) (pp. 679–691). SCITEPRESS-Science and Technology Publications, Lda.

  2. 2.

    Filelis-Papadopoulos, C. K., Gravvanis, G. A., & Kyziropoulos, P. E. (2017). A framework for simulating large scale cloud infrastructures. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2017.06.017

  3. 3.

    Filelis-Papadopoulos, C. K., Gravvanis, G. A., & Morrison, J. P. (2017). CloudLightning simulation and evaluation roadmap. In Proceedings of the 1st International Workshop on Next Generation of Cloud Architectures, Vol. 2. ACM.

  4. 4.

    Filelis-Papadopoulos, C. K., Grylonakis, E. N. G., Kyziropoulos, P. E., Gravvanis, G. A., & Morrison, J. P. (2016). Characterization of hardware in self-managing self-organizing Cloud environment. In Proceedings of the 20th Pan-Hellenic Conference on Informatics, Vol. 56. ACM.

  5. 5.

    Filelis-Papadopoulos, C. K., Giannoutakis, K. M., & Gravvanis, G. A. (2017). Large-scale simulation of a self-organizing self-management cloud computing framework. The Journal of Supercomputing. https://doi.org/10.1007/s11227-017-2143-2

  6. 6.

    Giannoutakis, K. M., Makaratzis, A. T., Tzovaras, D., Filelis-Papadopoulos, C. K., & Gravvanis, G. A. (2017, April). On the power consumption modeling for the simulation of Heterogeneous HPC Clouds. In Proceedings of the 1st International Workshop on Next Generation of Cloud Architectures, Vol. 1. ACM.

  7. 7.

    Lynn, T., Gourinovitch, A., Byrne, J., Byrne, P. J., Svorobej, S., Giannoutakis, K., et al. (2017). A preliminary systematic review of computer science literature on cloud computing research using Open Source simulation platforms. In Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER 2017) (pp. 537–545). SCITEPRESS-Science and Technology Publications, Lda.

  8. 8.

    Makaratzis, A. T., Giannoutakis, K. M., & Tzovaras, D. (2017). Energy modeling in cloud simulation frameworks. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2017.06.016