6.1 A Facial Recognition Surveillance System

tags: open, two classes, Source/Queue/Class-Switch/Sink, JSIMg.

We consider a surveillance system consisting in the facial identification of passengers flowing in an airport. It is implemented with a Edge computing architecture. Similar systems can be applied in several environments such as railway stations, shopping malls, roads, airways, banks, public buildings, museums, hospitals, etc. It is a simple model that represents a first step towards the solution of the complex problems of security control.

6.1.1 Problem Description

Currently available Internet of Things (IoT) devices are equipped with powerful processors, large storage and actuators that generate huge amounts of data that must be transmitted through the network. Cloud computing, with its characteristics of large availability of highly scalable servers, is very appropriate also for the IoT-based architectures. However, since the distance between the IoT devices and the cloud servers is typically large, the resulting latency is not negligible and exhibit unpredictable fluctuations.

This characteristic is very negative for most IoT applications that are delay-sensitive because are based on decision/reaction cycles (see, e.g., virtual reality, smart building, video surveillance, facial recognition, e-health, monitoring, automotive and traffic control). Minimizing the time required to process the data generated by the IoT devices is essential for the correct execution of these applications.

To approach this problem, massively distributed architectures that allow the implementation of the Edge computing paradigm have been introduced. In these environments, the components, referred to as Edge nodes, that process the data are placed as close as possible to IoT devices, i.e., at the edge of the network.

Typically, the Edge nodes have sufficient processing power and storage capacity to execute efficiently most of the tasks of the applications. Coordinator servers that perform application management tasks (when they are needed) are placed near the Edge nodes and are connected to them with fast links. Only the heaviest tasks are sent to the Cloud servers.

The latency reduction is achieved in two ways: on one side most of the tasks are executed locally by the Edge nodes in close proximity of IoT devices, on the other side only the heaviest tasks are sent to powerful cloud servers. As a consequence, the data transmitted over the network and the Response times can be minimized.

In this case study we describe the Edge computing environment (see Fig. 6.1) which is used in an airport to implement a surveillance system based on facial recognition.

Fig. 6.1
figure 1

The surveillance system based on the identification of facial scans of passengers

The identification system detects the faces of passengers passing by the scanners, those that go up and down the escalators, those in line at check-in desks, and those flowing in various areas of the airport (e.g., waiting rooms, shops, bar, restaurants). Five categories of persons are considered, corresponding to five types of scans: poor-quality image, regular, suspect, dangerous, and unknown person.

To identify the category to which they belong, the faces detected by a scanner are first compared with those of an in-memory database stored in the directly connected Edge node. The reaction actions that must be taken after a scan identification vary greatly depending on its category. The scans belonging to poor-quality and regular (safe people) categories only require accesses to the local databases but no further actions. The scans of suspect and dangerous categories require, among others, very quick actions to synchronize the scanners along the path followed by the person to be tracked and must transmit messages to the interconnected security agencies.

The algorithms for matching the detected faces with the images stored in the local in-memory database are executed on the Edge nodes. The Edge nodes also interact with each other to synchronize the in-memory databases and to coordinate reaction actions.

The scans of the unknown category (i.e., those not present in the local Edge nodes databases) are sent to the cloud for more in-depth analysis. At the cloud layer, very large NoSQL distributed databases for Big Data (such as Apache HBase, Hive, Cassandra, Mongo DB), with documents, social media profiles, biometric data, and voice traces are used for extensive identification analysis with the most advanced face detection algorithms. Machine learning algorithms are implemented to train the system to minimize false identifications. The results of these additional processing are sent back to the Edge nodes to update the local databases and then sent to the System Coordinators to implement the reaction actions. To make the presentation simple, in the implemented model we have not explicitly considered the System Coordinator servers as very often they are not present or are powerful servers that do not cause performance problems.

The capacity planning study is structured in two main phases, referred to as initial sizing phase and performance forecast phase, respectively.

The objective of the initial sizing phase is the calculation of the number of Edge nodes that guarantee the achievement of the performance targets with the planned workload (referred to as original workload). The most important performance constraint is the time required to analyze a scan by Edge nodes, i.e., their mean Response time, that must be less than 3 s for all scan categories, excluding the unknown. This constraint is important as most of the reaction actions to be effective must be activated within 3 s from the image detection. The configuration of each Edge node consists initially of one server mounted in a rack located in a dedicated room. The computed number of Edge nodes coincide with the number of server rooms. To ensure the highest level of physical safety of the equipments, the locations of the rooms are kept secret. To increase the availability of the global system, each room has independent equipments for fire detection, flooding protection, cooling and power continuity UPS. For several important reasons, the number of server rooms can not change over a long period of time. Initially, the scan flows arriving at the Edge nodes are considered balanced across all nodes. This first phase of the study is also important for its connections with the building constructions of the airport.

The second phase of the study is devoted to the assessment of the impact of different workload growth patterns on the Response times of Edge nodes. Several factors, in fact, as the results of the commercial policies of the airlines or the success of the new destinations served, make the forecast of workload trends very uncertain. In short periods of time, huge differences can occur between the scan streams arriving at the various Edge Nodes. Therefore, it is required that the implemented model be able to simulate very different workloads in terms of arrival patterns and mix of scan categories. Two types of workload growth are considered: increase of traffic intensity keeping fixed the fractions of scan categories of the original workload, and workloads with significant differences in the mixes of categories in execution. The impact of these types of workload changes on the performance of the global system are studied. This knowledge is fundamental for the implementation of the scalability feature of the Edge nodes with respect to the workload growth.

The main results of this case study are:

  • computation of the number of Edge nodes required to meet the performance target of their Response times with the original workload

  • identification of the number of scanners for each Edge node in order to provide a balanced load of scans based on the airport map on the Edge nodes as a function of the characteristics of the layout of the airport and differences (in intensity and mix of scan categories) of passenger flows. This result allows to identify the physical locations of Edge nodes in the various buildings

  • computation of the number of servers for each Edge node required to meet their performance target as a function of the workload growth. The computed performance metric can be used to drive the horizontal scaling component (that can be implemented in the system) of each Edge node separately

  • show how a complex model can be decomposed into several simpler models that can be solved separately, the results thus obtained can be combined to provide the solution of the global model (see, e.g., the incremental approach in Sect. 1.1 and Fig. 1.2). 

6.1.2 Model Implementation

The scanners generate the face scans sent in input to the model and represent the Source of the identification requests arriving at the Edge nodes (Fig. 6.2). Depending on the processing time and the path between the resources, the requests can be divided into two groups, i.e., two classes. The first class (class-E) comprises the scans belonging to the poor-quality, regular, suspects, and dangerous categories, that are completely processed by the Edge nodes. The second class (class-C) consists of the scans of the unknown category that require additional processing by the Cloud servers.

Figure 6.2 shows the model of the global system. The solid lines represent the path between the resources of class-E requests while the dashed lines represent the path of class-C requests.

Fig. 6.2
figure 2

Model of the global facial recognition system. Solid lines represent the path of class-E requests while the dashed lines the one of class-C requests

All the requests sent by the Source stations to the Edge nodes are initially of class-E type. When their identification process is completed, requests that belong to poor-quality, regular, suspect, and dangerous categories leave the model through Sink class-E stations. They will be subsequently processed by the System Coordinators servers, not considered in the model presented here, to implement the reaction actions. The requests of the unknown category are instead routed to the Cloud servers for a more extensive analysis. The class of these requests is changed from class-E to class-C in the Class-Switch station before joining the Cloud servers.

Fig. 6.3
figure 3

Parameter settings of class-E and class-C requests (a); probability that a request change class in the Class-Switch station (b)

Once their processing is completed, these class-C requests are sent back to the Edge nodes to update the local in-memory databases and other data structures before leaving the model through the Sink class-C stations. In Fig. 6.2, \(p_c\) represents the fraction of the requests generated by the scanners connected to an Edge node that is sent to the Cloud servers, i.e., those that belong to the unknown category.

An example of the parameter settings of the two classes is shown in Fig. 6.3a. Note that only class-E requests are generated by the Source (in the figure, their arrival rate is set to \(\lambda =0.2\;\)req/s with exponential distribution) because the class-C requests are generated in the Class-Switch station from the arriving class-E requests (thus no parameters must be set). As a consequence, the Reference stations of the two classes are Source and Class-Switch, respectively. The selection of different Reference stations is important for the computation of the correct values of the per-class performance indexes.

The window for the definition of the parameters of the Class-Switch station is shown in Fig. 6.3b. In the considered problem it consists of a 2x2 matrix, whose entry i-j represents the probability that a class-i request entering the station will be changed to class-j when it exit. In our model this matrix is simple as the arriving requests at Class-Switch station are only of class-E and are all changed to class-C. Indeed, the class-C requests arriving at the Edge node after being processed by the Cloud servers, are sent directly to the Sink class-C station by the routing algorithm.

For each Edge node an instance of a Virtual Machine (VM) is launched in a Cloud server to process all the requests sent by that node.

The global processing time required by the face recognition algorithm to solve the pair matching problem (i.e., to find which person among the set of the local in-memory database the scan represents, if any) on an Edge node is \(D_{E,E}=500\,\)ms. The time required by a scanner to detect, pre-process and transmit an image is negligible compared to the time required for its analysis. We take care of it by applying a small increase in the service demand \(D_{E,E}\).

Scans of category unknown are sent to the Cloud servers as class-C requests, and require \(D_{C,C}=800\,\)ms for their processing. The results of this analysis are sent back (still as class-C requests) from Cloud servers to Edge nodes that require additional \(D_{E,C}=100\,\)ms for their analysis (to update several data structures and the local in-memory database) before sending them to Sink class-C. The Network is not represented in the model as a separate component since the transmission time of data to and from the Cloud servers is negligible with respect to their processing demands.

Table 6.1 summarizes the mean values of the Service demands(exponentially distributed) of the two classes of requests. The weight \(p_c\) of the class-C demands

Table 6.1 Service demands [s] of class-E and class-C requests. \(p_c\) is the fraction of class-E requests that belong to unknown category, sent to Cloud servers as class-C requests

is introduced to take into account that only the fraction \(p_c\) of the requests generated by the scanners is sent to the Cloud servers as class-C requests. The difference between the processing time of the two class of requests is very large: class-C require 900 ms while class-E require 500 ms! Clearly, the value of \(p_c\) deeply influences the performance of the system. Therefore, by changing the value of \(p_c\) from 0 to 1, we can model all the possible configurations of the workload.

Important simplifications of the global system model can be obtained by applying the assumptions introduced in the project description. These made it possible to adopt the incremental approach technique (see Fig. 1.2). Among them, the most important are: once computed in the initial phase, the number of Edge nodes must be kept constant while the number of servers of each node can increase as a function of the performance requirements, the scanners cannot change the Edge node to which they are directly connected but their number can change according to several parameters (e.g., high or low traffic locations serverd, bursts of arriving people, layout of the building), the fraction \(p_c\) of the unknown scans received by the Edge nodes is initially considered the same for all nodes, the load of each Edge node can vary according to several parameters that are dependent only from the node itself, there is no interference among the VM instantiated in the cloud by the various nodes.

As a result of these assumptions, Edge nodes can be considered independent from each other and the global system model can be subdivided into as many simple models, referred to as elementary components, as there are Edge nodes. The model of an elementary component is shown in Fig. 6.4. Therefore, we can approach the capacity planning problem of the overall system by investigating the performance behavior of each Edge node separately.

As requested by the application, the mean time required by the analysis of a scan by an Edge node must be less than 3 s, i.e., for class-E requests it must be \(R_{Edge}^E\le 3\,\)s. Each Edge node must meet this performance constraint processing scans of all categories except for the unknown ones.

Fig. 6.4
figure 4

JSIMg model of an elementary component consisting of one Edge node. Solid lines represent the path of class-E requests while the dashed lines the one of class-C requests

In the initial sizing phase, the overall intensity of the original workload has been subdivided evenly among all the nodes. The load of each node is assumed the same both in intensity and composition (i.e., mix of scan categories) for all the nodes. The number of Edge nodes computed in this phase, initially configured with one server each, is the minimum required to satisfy the performance constraint with the original workload. The correspondent arrival rates, referred to as guard values, are considered as thresholds that cannot be overcome. When a node guard value is reached (or rather approached), a new server will be allocated on its rack (or switched to on-line status if it is already mounted) and the incoming requests to the node will be balanced among all servers in the rack. This scaling policy is applied on all Edge nodes separately.

The JSIMg tool has been used to implement the simulation models.

6.1.3 Results

The objectives of the study were many. In what follows we will describe the activities regarding the following two:

— Obj.1: Initial sizing and Dynamic Scalability of the Edge nodes

— Obj.2: Investigate the behavior of the Response times of the Edge nodes as a function of the mix of scan categories in execution

— Obj.1: Initial sizing and Dynamic Scalability of the Edge nodes.

In the design-phase of the project, the arrival rate of scans for the entire airport, referred to as original workload, is set to \(\lambda _0 = 42\) scan/s. The fraction of the detected scans that belong to the unknown category is 40% (\(p_c=0.4\)) and is assumed to be the same for all Edge nodes. The service demands of the two classes of requests, are shown in Table 6.1. According to the hypotheses, the original workload of rate \(\lambda _0\) is subdivided evenly among all the nodes \(N_{EN}\). We considered the model of an individual Edge node (see Fig. 6.4) subject to arrival rates of scans ranging from 0.2 to 1.75 scan/s. Let us remark that the saturation load of an Edge node is \(\lambda ^{sat}=\) 1/[0.5+(0.1x0.4)] = 1.85 scan/s. The Response times of an Edge node in the initial configuration with one server, obtained with a What-if analysis, are shown in Fig. 6.5. We assume exponential Interarrival times of the scans, and PS (Processor Sharing) scheduling policy in the queue stations  that is typically used to simulate multiclass workload parametrized with the service demands. This policy capture the reality better than FCFS since the Service demands are obtained summing the Service times of all the visits of a request. So, with PS, the executions of all the requests are seen to progress concurrently also at the demands level. Furthermore, several analytical solvers (see JMVA) when this policy is adopted provide exact solutions of models with multiclass workload (see, the BCMP theorem in Sect. 3.1).

Fig. 6.5
figure 5

Response times [s] of an Edge node with a single server for arrival rates of scans \(\lambda = 0.2 \div 1.75\) scan/s, 40% of them belong to the unknown category (\(p_c=0.4\))

The threshold value of the average processing time (i.e., the Response time) of class-E scans of the Edge nodes is 3 s. According to the hypothesis that the original workload is initially equally divided among all the Edge nodes, we computed their minimum number \(N_{EN}\) needed to satisfy the constraint. With \(N_{EN} = 30\,servers\), the arriving rate of scans at each node is 1.4 scan/s (\(\lambda _0/N_{EN}\)), and the correspondent mean Response time of the Edge node with one servers is \(\simeq 2\;\)s (see Fig. 6.5).

Higher arrival rates, e.g., 1.55 scan/s, could also have been considered. But the adoption of \(\lambda = 1.4\,\)scan/s as guard value for the scalability monitor is motivated by the tolerance of unexpected fluctuations in the flow of passengers without seriously violating the constraint of Response times (e.g., the 10% increase in load corresponds to a Response time of \(\simeq 3\;\)s, which in any case still satisfies its limit value).

The global number of scanners \(N_{Scan}\) to be installed in the airport has been computed considering the technical characteristics of the scanners, the processing capacity of the Edge servers, and the intensity of the flow of passengers in the airport. The result of the computation is \(N_{Scan}=\) 840, an average of 28 scanners per node. Let us remark that due to the heterogeneity of traffic in the paths of the airport, this initial subdivision does not correspond to the assignment of the same number of scanners per each node but to the one that generate a balanced load across all nodes. With the arrival rate of 1.4 scan/s at each node, and the average of 28 scanners per node, each scanner generate an average of 1 scan every 20 s.

To avoid instability, a new server is allocated in a node when its arrival rate is higher than the guard value for a time interval whose duration depends on the path considered. The load will be re-balanced among all installed servers after a transient period considering the arrival of new requests and the exit of executed requests. In Fig. 6.6 the behavior of the Response times of an Edge node with arrival rate ranging from 0.2 to 10 scan/s and a number of servers from 1 to 6 is shown. This diagram is fundamental for the horizontal scalability  of the Edge nodes since it provides the number of servers that are needed to meet the performance constraint on Response times as a function of the load behavior. When an autoscaling component is used, it provides the values for triggering scale-up actions.

Fig. 6.6
figure 6

Response Times of an Edge node for the analysis of a face-scan versus the arrival rate for various number of servers; 40% of the arriving scans belong to the unknown category (\(p_c=0.4\))

— Obj.2: Investigate the behavior of the Response times of the Edge nodes as a function of the mix of scan categories in execution.

In systems with multiclass workloads, the bottleneck (i.e., the resource with the highest utilization) can migrate between resources depending on the mix of classes of requests being executed. The greater the difference between the maximum service demands of the classes (when they refer to different resources) the deeper the effects of bottleneck migration on system performance. For example, in our system when the load consists of class-E scans only (i.e., with \(p_c=0\)) the maximum Throughput is \(X_0^{max}= 1/D_E^{max}\) = 2 scan/s while with class-C scans only (i.e., with \(p_c=1\)) it is \(X_0^{max} = 1/D_C^{max}\) =1.25 scan/s (37.5% less!).

Thus, it is important to consider the resource utilizations of each class of requests. The capacity planning study must evaluate the projections on performance of all possible changes in the workload, not only in terms of intensity but also in the mix of classes of requests being executed.

In Obj.1 the fractions of scan categories arriving at the Edge nodes (initially they are all of Class-E) were considered constant: 40% of them were of unknown category (\(p_c=0.4\)). Now we relax this assumption and investigate the behavior of the Utilizations and Response times with all the possible mix of scan categories. By applying the utilization law \(U_i=\lambda D_i\) to the Edge and Cloud servers of the open model of Fig. 6.4, we can easily obtain the mix of requests that balances the load on the two resources.

By simplifying the two equations of \(\lambda \) and equating the global service demands \(D_E\) and \(D_C\) (see Table 6.1) we have 0.5+0.1\(p_c=\) 0.8\(p_c\). Thus, we can derive the fraction \(p_c=\) 0.71 of all incoming scans that generate the equiutilization of the resources. For \(p_c < 0.71\) the most utilized resource is the Edge node, while for \(p_c > 0.71\) it is the Cloud server. Figure 6.7 show the behavior of Utilizations and Response times of the Edge node (with one server) and the Cloud server for all the possible mix of scan categories in execution with arrival rate of scans \(\lambda =1.4\) scan/s.

The impact of the different mixes on the performance of the two classes of requests is evident.

Fig. 6.7
figure 7

Utilization and Response time of the Edge node and the Cloud server with respect to the mix of scan categories in execution, with arrival rate \(\lambda \) = 1.4 scan/s

As computed above, the mix corresponding to the equiutilization of the two resources is obtained with \(p_c=\) 0.71. The Utilizations of Edge node range from 0.7 (with \(p_c=0.1\)) to 0.8 (with \(p_c=0.9\)) and the corresponding Response times are 1.7 and 2.9 s. Note that the Cloud server Response time increases rapidly when its arrival rate of class-C requests approach the saturation value of \(1/D_C^{max}=1/0.8=1.25\;\)scan/s. Since the arrival rate of class-C requests to Cloud server is \(\lambda \,p_c\), its utilization is given by \(U_{C}= \lambda \, p_c\, D_{C,C}\). The values of \(p_c\) that generate the saturation can be easily obtained from this equation considering \(U_C =1\), \(\lambda =1.4\;\)scan/s, and \(D_{C,C}=0.8\) S. The result is \(p_c=0.892\). Clearly, the models with higher values of \(p_c\) are not in equilibrium and are unstable since one resource is saturated. To improve the performance of the system with higher values of \(p_c\) it is necessary to use more powerful VMs of the Cloud servers.

6.1.4 Limitations and Improvements

The model described is clearly a simplified version of a global surveillance system model. However, with limited effort it can be improved in different directions. Among them are:

  • Fractions of scan categories: The assumption that the fraction of scans of unknown category is the same for all the Edge nodes is a limitation that can be easily relaxed. In this case, it is enough to make a model for each distinct Edge node with the fractions of scan categories arriving at the node. In many cases it is sufficient to identify groups of Edge nodes having similar characteristics with respect to the flow of passengers and the fraction of scan categories and to implement only their models.

  • Interarrival time distributions of scans: To capture the differences of arriving traffic of scans among the various categories several classes can be considered. So, for example, bursts can affect one category while another one can have a different distribution. For each class, follow the sequence Define customer classes, Edit, and select the distribution from the list, e.g., Burst general.

  • Interconnection network: Depending on the characteristics of the network connecting the Edge nodes to the cloud, it is possible to model it with a dedicated component, e.g., a delay station, with the mean service time and variance collected directly from the network.

  • Allocation/Deallocation of servers: A policy similar to that described for the dynamic allocation of the servers to the Edge nodes can be used for their deallocation. In this case, a new guard value of the Response time of the Edge nodes, i.e., its minimum mean value, must be defined by the users and set in the autoscaler component. When reached, a server of the node can be deallocated and its load redistributed among the remaining ones or simply stop to load it.

  • Fluctuations: Depending on the environments considered, the traffic of arriving scans can be affected by fluctuations with very high peaks and deep valleys. In these cases, to avoid problems of instability of the number of servers of the Edge nodes, it can be useful to define for each of the two guard values used by the allocation/deallocation policies of the autoscaler, ranges of tolerated values instead of the two mean values only.

6.2 Autoscaling Load Fluctuations

tags: open/closed, two classes, Source/Queue/Place/Transition/Sink, JSIMg.

In this case study we describe a multi-formalism model [5, 20] (with Queueing Networks and Petri Nets stations integrated) that simulates an autoscaler component that manages congestion created by fluctuations in incoming traffic and computational demands. The focus is on the description of the dynamic routing mechanism (that is state-dependent) of the arriving requests as a function of the load fluctuations of an online web service center. The solution described allows cost savings, in terms of resources used, while preserving the expected system performance and can be applied with considerable savings to exascale data centers.

6.2.1 Problem Description

Most data centers of Internet Service Providers experience load fluctuations caused by the combined effects of variability in incoming traffic rate and the computation time of the requests [13]. Depending on the service provided, fluctuations can have very different intensities and time scales.  For example, in e-commerce sites, the increase of load due to seasonal sales can last several weeks with medium intensity and quarterly frequency, while unexpected events, such as special offers, create high spikes in requests with short duration and heavy computation time.

We can basically distinguish between long-term and short-term fluctuations. The former have low frequency, small/medium intensity and are generated by the typical growth trend of workloads. The latter have a short duration, high intensity and can occur at unpredictable times.

In such a variable scenario, the right-sizing problem, i.e., the identification of the minimum number of resources that must be used to achieve the performance objectives, is a very difficult problem. Over-provisioning may result in a waste of resources and money. On the other hand, under-provisioning can lead to violating customer expectations in terms of Quality of Service (QoS) with negative effects on business.Autoscaling techniques are increasingly used to dynamically allocate and release resources both in clouds (e.g., AWS Auto Scaler [1], Microsoft Azure autoscaler [27]), and in private data centers (e.g., [18, 22, 30]).  Basically, these dynamic scaling techniques (usually divided into horizontal and vertical scaling techniques) monitor one (or more) performance indicator and when its target value is reached (or approached) trigger decisions to adapt the number of resources as the load increases or decreases. In the following we consider only the increase case because is the most critical for performance and furthermore the decisions taken in the decrease case are usually the opposite of those made in the first case.

When the target value of the performance indicator is detected, horizontal scaling typically allocate new resources while vertical scaling increases the capacity share of the resources. 

Horizontal scalers provide good results when used with loads subject to long-term fluctuations such as those generated by physiological workload trends, whose growth rate increases progressively and continuously.  But their application to loads subject to short-term fluctuations is unsatisfactory. The presence of load spikes has a very negative impact on performance as it creates a sudden congestion of resorces which is responsible for the high Response times. Furthermore, they can foster horizontal scalers to make contradictory decisions in a short time that could generate dangerous oscillations in the number of resources provided. These unstable conditions must be avoided as much as possible as resources allocation are costly and time-consuming operations.

To address these drawbacks, we designed a hierarchical scaler (see, e.g., [33, 34]) with two operational layers shown in Fig. 6.8. The objective of the horizontal scaler at Layer 1 is the typycal one: to provide the minimum number of resources (referred to as Web Servers) that should be used to achieve the performance target. This scaler has been enhanced with a second operational layer, Layer 2, consisting of a Spike Server that allocates CPU capacity to execute load spikes according to a vertical scaling technique. A request can be executed by a Web Server or by the Spike Server depending of the load conditions.

Fig. 6.8
figure 8

Hierarchical autoscaler for load spikes

At Layer 1, a new Web Server is allocated when the monitored performance indicator reaches, or is close to, its threshold value. To make decisions on whether to scale or not, we have considered the performance indicator mean Response time \(R_0\) of the data center, i.e., the mean time required by the execution of a request. Layer 2 operations are triggered when a load spike is expected to arrive at one of the Web servers.

While the evaluation of \(R_0\) is a well defined process, the load spikes are usually not so easy to predict with reasonable accuracy. Instead of running complex and time-consuming analyses on the traces of arriving requests, we considered the signals that anticipate the arrival of a potential peak load. More precisely, we consider a Spike Indicator (SI) metric whose alarm threshold \(SI^{max}\), when reached, indicates that a peak load is likely to occur. Since a spike is anticipated by an increase in load in the system, we associate SI with the number of concurrent requests in execution in the considered Web Server. This metric is very suitable for the autoscaler as it is easy to measure and can detect the creation of peaks in their early stages not only in the arrival traffic flow but also in the request execution times. When SI reaches, or approaches, the threshold \(SI^{max}\), a scaling decision should be made quickly to alleviate the congestion of the Web Server: the new incoming requests are routed to the Spike Server. As a consequence, the load of Web Server will decrease as running requests complete their execution. The routing of the requests is switched back to Web Server when SI decreases below \(SI^{max}\). To minimize the fluctuations in the number of resources allocated, a range of values can be considered that includes \(SI^{max}\) instead of a single value (that we consider for simplicity).

Clearly, the detection of the correct value of \(SI^{max}\) is a very critical operation for the effectiveness of the autoscaler. If too many false positive spikes are detected the Spike Server tends to be congested. On the other side, if too many false negative spikes are detected the mechanism fails to reduce the congestion of Web Servers. The \(SI^{max}\) value is influenced by the characteristics of the workload, both by the arrival patterns and by the execution times, and by the performance objectives. In the following we will describe one of the possible approaches to tune \(SI^{max}\).

The presence of the Spike Server has a very positive impact in reducing the System Response Time \(R_0\). In fact, the larger values of Response times, mainly due to the congestion states of the Web Server, are replaced with smaller values obtained from the executions of the Spike Server, which is typically not congested. This smoothing effect reduces the variance of Response times, their mean value, and therefore the number of scale actions and their oscillations. The efficacy of the introduction of Spike Server is related to the following basic principle that applies to open models: the increase in Response time due to an increase \(\Delta \lambda \) of load is greater than its decrease due to the same decrease \(\Delta \lambda \). This effect is due to the vertical asymptote to which the Response time tends as the queue component approaches saturation.

The operating steps of the hierarchical autoscaler are:

  1. 1.

    at Layer 1, the horizontal scaler monitor the performance metric System Response Time \(R_0\) and triggers congestion management actions when a threshold value has been exceeded. The value of \(R_0\) is computed applying a moving window technique whose duration is a function of the characteristics of the workload. In the computation of \(R_0\) both the execution times of the Web server and those of the Spike Server must be considered. According to the rules set at design phase, when the alarm threshold of \(R_0\) is reached, or approached, the scaling decisions concerning the provisioning of new Web servers must be activated.

  2. 2.

    the control of the arrival of load spikes is always active, Layer 2, through the monitoring of the number of requests SI concurrently in execution in the Web Server. When the alarm threshold \(SI^{max}\) is reached, the dynamic routing to the Spike Server of new arriving requests is activated. When SI falls below \(SI^{max}\), the incoming requests will be routed again to the Web Server. To avoid fluctuations, a range of tolerated values can be adopted instead of a single value \(SI^{max}\).

  3. 3.

    if the System Response Time \(R_0\) does not drop below its alarm threshold with the spikes control, then it is necessary to activate the actions triggered by the rules set in the autoscaler (typically increase in the number of servers). A further decrease of \(R_0\) can be obtained by vertical scaling actions applied to the Spike Server increasing the share of the CPU dedicated to the application (when this is possible).

 In this case study we focus on the model of the workload fluctuations and on the identification of the alarm threshold \(SI^{max}\) for the control of load peaks. Among the problems that can be studied are:

  • evaluation of the influence of fluctuations in arriving requests with different time scales and intensities on the system performance

  • impact of variability of service demands of requests on performance metrics

  • assess the influence on System Response Time \(R_0\) of the alarm threshold value \(SI^{max}\) for significant changes in workload arrival rate, e.g., up to about 40,000 req/h per Web Server

  • identification of the value \(SI^{max}\) that minimizes the System Response Time \(R_0\) for a given workload

  • behavior of the autoscaler (in terms of the number of scaling up actions) with respect to the size of the moving window considered for the computation of the metrics used as performance indicators (e.g., the System Response Time, the Utilization of the Web Server and of the Spike Server)

  • effects of vertical scaling actions of the CPU share of Spike Server on the number of servers provisioned as a function of arrival rates.

6.2.2 Model Implementation

The implemented multi-formalism model consisting of both Queuing Networks and Petri Nets components is shown in Fig. 6.9. Since this case study is focused on the autoscaling of load fluctuations, below we will concentrate on the description of Layer 2 operations. At Layer 1 the horizontal scaler performs the typical provisioning actions of new servers when the performance indicators exceed the threshold values (see, e.g., Sec.6.1)  balancing the load between them according to the policy adopted.

Fig. 6.9
figure 9

Model with one Web Server1 and one Spike Server for the auto-control of fluctuations

To simplify the presentation, we have introduced some assumptions that have small or no influence on the validity of the results. First, we modeled the app with only its most utilized resource, i.e., the bottleneck, that has a deep impact on the performance. The error introduced on the performance indexes ignoring the other resources should be very low as they are usually utilized much less than the bottleneck. Indeed, in many real-world cases, several important tasks of an app are allocated on a single (or very few) host server, typically very powerful and the most secure, which quickly becomes congested as the workload increases (e.g., the tasks that execute the front-end modules, the catalog and the cart services, the management of encryption/decryption keys for the payments, the 3D-secure procedure for online shops). The resource of the model that executes the requests is the queue station referred to as Web Server1. This is the resource that is replicated by the Layer 1 autoscaler provisioning actions. The Spike Server at Layer 2 is dedicated to the execution of the spikes of load.

Furthermore, to better investigate the behavior of the spike control, we have considered in the model only one server (Web Server1) with the connected Spike Server. Clearly, the results obtained for this initial configuration, with a web server and a spike server, also apply to each web server in the data center (if there are more than one), regardless of their number. Indeed, all servers can be considered independent of each other as their arrival rates are computed by the horizontal scaler algorithm, which is typically executed by a dedicated resource. Since the CPU capacity of a Spike Server is shared among several Web Servers, it is necessary to apply adequate scaling up actions on them as the number of Web Servers increases.

The workload consists of two classes of customers: the incoming requests submitted by the users of the application, and the tokens. The arriving users requests, referred to as ArrivReq and represented with an open class, are generated by the Source1 station and routed to place Arriving. The tokens, referred to as maxReqLink1, are modeled with a closed class and are associated to the requests in execution (the SI metric), one token per request. Their maximum value represents the maximum number of requests that can be executed concurrently by the Web Server1 (referred to as alarm threshold \(SI^{max}\)) for the load spikes control. At the beginning of the simulation, all tokens are located in place MaxReqServer1. The transition JoinWebserver1, see Fig. 6.10a, is enabled when a request arrives in place Arriving and there is at least one token available in place MaxReqServer1.

Fig. 6.10
figure 10

Enabling and Inhibiting conditions of the three transitions

At each activation, a request is sent to Web Server1 and the number of available tokens in place MaxReqServer1 is decremented by one. When a request is completely executed, transition Rel routes it to Sink1, see Fig. 6.11b, and returns the token to the place MaxReqServer1 increasing the number of requests that can be in execution by one. When the number of tokens in MaxReqServer1 is zero, the maximum number of requests in execution \(SI^{max}\) is reached and the autoscaler control routes new arriving requests to the Spike Server, i.e., the transition JoinWebServer1 is no longer activated. This is achieved through the inhibiting arc from place MaxReqServer1 and transition JoinSpikesServer, see Fig. 6.10c. The value 1 in the inhibiting conditions of this transition means that when there are one or more tokens in place MaxReqServer1 the transition is blocked. The values \(\infty \) that appear in the inhibiting conditions indicate that the correspondent inhibitions are never met. To allow the computation of several interesting metrics, e.g., the Response times and the Throughput of the spikes, the requests executed by the Spike Server are addressed to the dedicated sink Sink2. The firing rules, i.e., the Throughput, of the three transition stations are shown in Fig. 6.11.

Fig. 6.11
figure 11

Firing rules of the three transitions

To reproduce the fluctuations of the incoming traffic, the distribution of the interarrival times of requests has been assumed hyperexponential with coefficient of variation cv = 4 and mean 0.15 s. The high variability of the service demands of the requests was modeled in both servers with a hyperexponential distribution with coefficient of variation cv = 4 and mean 0.16 s (see Fig. 6.12). The service demands of the tokens are set to zero to not interfere with the execution of arriving requests. The scheduling discipline of the two queue stations modeling the CPUs with two classes of customers is Processor Sharing (PS). This discipline is commonly used for the simulation of the time quantum policy of processors that share the capacity among all the requests to be executed, which can belong to different classes of workload. 

Fig. 6.12
figure 12

Parameters of the Web Server1 station

The objectives of the case study required the execution of different types of analysis. To analyze the behavior of the model, several single simulation runs were performed with the collection of traces (see, e.g., Figs. 2.10, 2.11) with the CSV values of the performance metrics over time. The usual capacity planning problems are solved with What-if analyses using various control parameters, e.g., the arrival rate of requests generated by Source1, and the value of the alarm threshold \(SI^{max}\) of spike control.

6.2.3 Results

Of all the possible objectives that can be achieved with the implemented model, we will describe the operations required by the following four:

— Obj.1: Implementation of the model of the autoscaler with two operational layers and evaluation of the correctness of its dynamic behavior to control load spikes.

— Obj.2: Given the arrival rate of requests of 400 req/min, evaluate the impact of different alarm thresholds \(SI^{max}\) of Web Server1 on performance metrics.

— Obj.3: Evaluate the behavior of System Response Time \(R_0\) as the workload grows to approximately 40,000 req/h.

— Obj.4: Analyze the impact of vertical scaling of Spike Server capacity on System Response Time \(R_0\).

Obj.1 shows the use of the CSV traces of performance metrics generated by the model executions for the analysis of its dynamic behavior. To efficiently use autoscaling techniques, it is very important to know the impact that the performance indicators monitored by autoscalers have on the satisfaction of service level agreements (SLAs). For example, what is the influence of the Spike Indicator SI on the System Response Time \(R_0\)? Obj.2 and Obj.3 address this issue. The impact of vertical scaling actions of CPU share of Spike server on the number of scaling actions is described by Obj.4

The description of the operations required to achieve the four objectives follows.

Obj.1: Model implementation of an autoscaler component that detects load peaks in Web Server1 and relieves its congestion by dynamically routing new requests to a Spike Server.

The structure of the model is described in the previous section.

To analyze the dynamic behavior of the model and to assess its correctness we collected the CSV traces with the values of several metrics over time (see Figs. 1.8, 2.10, 2.11). Several simulations were carried out using controlled workloads of increasing complexity. A visual evidence of the correctness of the load controller is provided in Fig. 6.13 that plots the Number of requests in execution in Web Server 1 and in Spike Server over time.

Fig. 6.13
figure 13

Number of requests in Web Server1 (a) and Spike Server (b) in the interval 120\(\div \)1200 s with the alarm threshold \(SI^{max}\) of Spike Indicator set to 140 req (initial population of place MaxReqServer1) and Arrival rate of 6.66 req/s

Fig. 6.14
figure 14

Response times of Web Server1 (a) and Spike Server (b) in the interval 120\(\div \)1200 s . with alarm threshold \(SI^{max}\) of Spike Indicator set to 140 req and arrival rate of requests 6.66 req/s

An interval of time of 1080 s, from 120 to 1200 s, has been considered. The alarm threshold \(SI^{max}\) (i.e., the max number of requests in execution in Web Server1) is set to 140 req. This value corresponds to the number of tokens of the closed class maxReqLink1 that at the beginning of the simulation are in the place MaxReqServ1. Figure 6.13b shows that when \(SI^{max}\) is reached, e.g., in the interval 320–490 s, the state-dependent control of the autoscaler routes the new incoming requests to Spike Server. As soon as some requests complete their execution in Web Server1, the SI indicator drops below 140, e.g., in the interval from 570 to 690 s, and then the new incoming requests will be directed to Web Server1 again.

The impact of load fluctuations on the performance are clearly shown in Fig. 6.14 which plot the Response times of Web Server1 and SpikServer for the interval 120\(\div \)1200 s. For example, towards the end of a long period of high-load, at about 480 s, a peak of 50 s of the Response time of Web Server1 occurs, see Fig. 6.14a. As expected, the Response times of Spike Server, see Fig. 6.14b, are much lower than those of Web Server1. A significant decrease of mean System Response Time \(R_0\) with the workload considered, can be obtained simply by decreasing the alarm threshold \(SI^{max}\). This action tends to balance the utilizations of the two servers, decreasing the congestion of Web Server1 while increasing the load of Spike Server (see the following objectives).

Obj.2: With the arrival rate of 400 req/min (6.66 req/s), compute the performance indexes of Web Server1 and Spike Server and the System Response Time \(R_0\) for the alarm thresholds \(SI^{max}\) ranging from 10 to 160 req. Identify the value of \(SI^{max}\) that should be provided as input to the autoscaler in order to obtain a mean System Response Time as close as possible to the target value of 8 s.

The parameterization of the workload is shown in Fig. 6.15. The flow of arriving requests (open class Arriv_Req) is generated by the Source station with a hyper-exponential distribution of Interarrival times with mean 0.15 s corresponding to the arriving rate of 6.66 req/s and coefficient of variation cv = 4.

Fig. 6.15
figure 15

Parameters of the Source station for the generation of the arriving flow of requests of 6.66 req/s and coefficient of variation cv = 4 (open class Arriv_Req), and setting of the alarm threshold \(SI^{max}=100\) req

The service times of the two servers are hyper-exponentially distributed with mean 0.16 s and cv = 4 to simulate the fluctuations of service demands. The global population of the closed class maxReq_Link1 corresponds to the value of the alarm threshold \(SI^{max}\) for Web Server1 (in Fig. 6.15 it is \(SI^{max}=\) 100 req). The value of \(SI^{max}\) represents the maximum number of requests that can be in execution on Web Server1 which, once reached, identifies a state of high-load which causes the routing of arriving requests towards the Spike Server.

To tune the autoscaler parameters we evaluate the effects of the alarm threshold \(SI^{max}\) on the System Response Time. We used a What-if analysis with control parameter \(SI^{max}\) ranging from 10 to 160 req with increments of 10, so overall 16 models are executed in sequence. We have considered such a wide range of values in order to provide a large set of data for the training set of the machine learning algorithm that will be applied in a second phase of the project. Some of the indexes detected are reported in Figs. 6.16, 6.17, 6.18. The number of times \(SI^{max}\) is reached decreases as its value grows from 10 to 160.

Fig. 6.16
figure 16

Requests in execution versus alarm thresholds \(SI^{max}\) with load of 6.66 req/s

Fig. 6.17
figure 17

Throughput of Web Server1 and Spike Server versus \(SI^{max}\) (from 10 to 160 req)

As \(SI^{max}\) increases, the Number of requests in execution, the Throughput and the Response time of Web Server1 increase while the corresponding indexes of Spike Server decrease. Let us remark that with the arrival rate of 6.66 req/s the Utilization of the Spike Server decreases from 0.4 (with \(SI^{max}=\) 10) to 0.12 (with \(SI^{max}=\) 160). This low Utilization is the motivation of the modest decrease of its Response times (see Fig. 6.18b) with the increase of \(SI^{max}\) from 10 to 160 req. The Utilization of Web Server1 increases almost linearly from 0.66 to 0.94 as \(SI^{max}\) increases. This is due to the scaling algorithm that route the arriving requests of Web Server1 dynamically to the Spike Server when \(SI^{max}\) is reached. It is important to note that we are evaluating the behavior of these performance indexes by keeping the request arrival rate fixed (6.66 req/s and, as seen above, not particularly high), so the saturation effects are very limited. In the following Obj.3, we will evaluate the system performance with different Arrival rates and the effects of server saturation on the System Response Time will be discussed.

The mean System Response Time \(R_0\) of the model in Fig. 6.9 is given by the sum of the mean Response times of Web Server1 and Spike Server weighted by the respective percentages of System Throughput \(X_0\):

$$\begin{aligned} R_0= R_{WebServer1} \;\frac{X_{WebServer1}}{X_0} \;+\; R_{SpikeServer} \;\frac{X_{SpikeServer}}{X_0} \end{aligned}$$

.

Fig. 6.18
figure 18

Response times of Web Server1 and Spike Server versus \(SI^{max}\) (from 10 to 160 req)

To identify the value of the alarm threshold \(SI^{max}\) that with the Arrival rate of 400 req/min (6.66 req/s) provide a System Response Time \(R_0 \le \) 8 s a What-if analysis that performs repeated executions with \(SI^{max}\) as a control parameter ranging from 10 to 160 (globally 16 models) has been utilized.

As shown in Fig. 6.19, with \(SI^{max}=90\) the mean System Response Time \(R_0\) is 7.98 s, too close to the target value of 8 s. A conservative answer to the question of Obj.2 is \(SI^{max}=80\) that provides \(R_0=\) 7.09 s.

Fig. 6.19
figure 19

System Response Time vs Alarm threshold \(SI^{max}\) with arrival rate of 6.66 req/s, the Interarrival times are hyper-exponentially distributed and cv = 4

Obj.3: Assess the impact of various alarm thresholds \(SI^{max}\) on System Response Time \(R_0\) for significant changes in workload arrival rate, from 1 to 12 req/s (43,200 req/h).

To achieve this goal, a What-if analysis was performed for \(SI^{max}\) values from 40 to 160 req with arrival rates as control parameter ranging from 1 to 12 req/s (60 to 720 req/min). The positive impact of the dynamic control of the high-load states of Web Server1 on System Response Time \(R_0\) is highlighted in Fig. 6.20.

Fig. 6.20
figure 20

System Response time vs Arrival rate (Interarr.time with hyper-exp distr. and cv = 4)

The lower curve represents \(R_0\) for \(SI^{max}=40\) req, while the upper one for \(SI^{max}=160\) req. These two values correspond respectively to the minimum and the maximum utilization of Web Server1. In Fig. 6.20 three different operational phases can be identified according to the workload intensity: light, medium, heavy.

In Phase 1, which includes arrival rates between 1 and 6 req/s, \(R_0\) is less than 6 s for all the \(SI^{max}\) values. It is important to note that without any scaling action the \(R_0\) corresponding to an arrival rate of 5 req/s is about 10 s, while small increments of successive arrivals cause its very large increases. This is because the Web Server1 with these arrival rates and no spike control is highly utilized (the Response time grows to infinity) and we are approaching the Throughput bound. Let us remind that the maximum arrival rate that Web Server1 can process when it is saturated, is given by \(\lambda _{max}=1/D_{WebServer1}= 6.25\) req/s (from the utilization law \(U_{WebServer1}= \lambda _{0}\, D_{WebServer1}\)).

The low arrival rates of Phase 1 reduced drastically the need of autoscaling actions, and thus the load of Spike Server. As a consequence, the contribution of the Spike Server to the System Response Time is minimal. In fact, even towards the extreme of the interval with the highest arrival rate of 6 req/s and \(SI^{max}=160\) req , the utilization of Spike Server is very low and thus its executions have minimum impact on the computation of \(R_0=5.42\) s. The correspondent utilization of Web Server1 is \(U_{WebServer1}=0.93\).

Phase 2 includes arrival rates between 6 and 10 req/s and shows increasing values of \(R_0\) until the arrival rate of about 8 req/s. This increase is mainly due to the increment of the arrival rate at the Web Server1 that now is close to congestion. A further increase in arrival rate from 8 to 10 req/s generates an increase in the number of high-load states of Web Server1 detected by the autoscaler and therefore the number of requests routed to Spike Server grows progressively. As a consequence, \(R_0\) is decreasing since the contribution to its computation due to the Spike Server executions, which are much shorter than those of Web Server1, becomes more substantial as its Throughput increases (with a medium utilization). Indeed, for example, with \(SI^{max}=160\) req the \(U_{SpikeServer}\) is 28% with 8 req/s arrival rate and is 60% with 10 req/s. The correspondent Throughputs \(X_{SpikeServer}\) are 1.76 req/s and 3.77 req/s, and the Response times \(R_{SpikeServer}\) are 0.69 s and 1.54 s , respectively. \(R_0\) will return to growth as the utilization of Spike Server increases and therefore its Response times increase (in the Phase 3).

It should also be emphasized that the pattern of the requests arriving to the Spike Server is typically bursty since in most cases consists of load spikes, see e.g. Fig. 6.13b, and it is known that the presence of bursts has a very negative influence on performance.

Phase 3 is characterized by two factors: the heavy workload (between 10 and 12 req/s) and the congestion of the Spike Server. Considering, for example, \(SI^{max}=160\) req, the utilization of the Spike Server \(U_{SpikeServer}\) for arrival rates of 10 and 12 req/s are 60% and 97%, respectively, and its Response times \(R_{SpikeServer}\) are 1.54 and 7.54 s. Since the corresponding Throughputs \(X_{SpikeServer}\) are 3.77 and 5.69 req/s (representing approximately 50% of the System Throughput \(X_0\)), the impact on the mean System Response Time \(R_0\) ?? of Spike Server executions becomes substantial. As the \(SI^{max}\) values decrease from 160 to 40 req the increases of \(R_0\) become more evident as the load of Spike Server increases.

The values shown in the previous figures are very important for setting the parameters of the autoscaler, and for a machine learning algorithm, in order to satisfy the target value of the selected performance metric. For example, consider an arrival rate of 9 req/s with cv = 4 and a target value of the scaling metric \(R_0 \le \) 8 s. From Fig. 6.20 it is possible to see that with the alarm threshold of 80 req this objective can be achieved. Note that the autoscaling policy tries to use a web server as much as possible as long as the specified target value of the scaling metric is met. With an increase of the arrival rate from 9 to 12 req/s this target value cannot be matched. In this case, a scaling action is needed. If the Spike Server has unused capacity, a vertical scaling can be activated (see the following Obj.4) increasing the share devoted to the application. If this is not possible, then a new server must be allocated (through a scaling action at Layer 1) to handle the service demands.

Obj.4: Assess the impact on System Response Time of a vertical scaling action that double the capacity share of Spike Server from 40% to 80%.

 When the Spike Server approaches congestion state (as in Phase 3 of Fig. 6.20), it causes a degradation of System Response Time \(R_0\). In this case, before activating the horizontal scaling actions by increasing the number of web servers, it can be very effective to apply a vertical scaling action by increasing, if possible, the CPU share of the Spike Server dedicated to the application. This vertical scaling action is typically much less expensive than the horizontal one and faster to apply. Clearly, in this case the CPU power of the Spike Server must be greater (at least two times or more) than that of the Web Server1.

Since in the previous Obj.3 the CPU share was 40%, in this Obj.4 we evaluate the effects on \(R_0\) obtained by doubling this share to 80%. The parameter of the model that must be changed is the service demand \(D_{SpikeServer}\) of the Spike Server that must be set to 80 ms instead of 160 ms. The \(R_0\) values for arrival rates between 1 and 12 req/s and the alarm thresholds \(SI^{max}\) 80 and 160 req are shown in Fig. 6.21.

Fig. 6.21
figure 21

Impact of doubling the CPU capacity share of Spike Server (vertical scaling from 40% to 80% ) on System Response time \(R_0\) with 80 and 160 req alarm thresholds

The dashed lines represent the \(R_0\) values with the original CPU share of 40% (considered in the previous Objectives) while the solid lines represent the corresponding values with the CPU share doubled to 80%. As expected, significant decreases in \(R_0\) values are achieved in the Phase 3 area where the Spike Server is more utilized. For example, with arrival rate of 12 req/s and \(SI^{max}=80\) req the target value \(R_0 \le 8\) s considered in Obj.3 can be reached with 80% share (\(R_0\) = 6.2 s), while with 40% share this is not possible (\(R_0\) = 9.83 s).

6.2.4 Limitations and Improvements

  • Various application scenarios: The case study described is focused on the implementation of a model that exhibit dynamic behavior as a function of the load characteristics. With simple modifications/upgrades, it can be used in various application scenarios, such as, for example, to model the dynamic load split between the servers of a private and a public cloud, or to evaluate the performance impact of the number of cores in the various partitions of an HPC system.

  • Oscillations control: To minimize the oscillations in the number of provisioned resources, it would be better to use a range of values as a target for the scaling indicators rather than just the mean values.

  • Workload with heterogeneous apps: The modeling approach described can also be used with multiclass workloads. The rules for enabling, firing, and inhibiting can be specified for each individual class.

  • Vertical scaling: For a given arrival rate of requests, the effects of vertical scaling of Spike Server  can be investigated with a What-if analysis that uses as control parameter its service demands scaled according to the CPU share policy adopted.

  • Horizontal scaling: The model described can be enhanced with the implementation of the horizontal scaling provisioning policy at Layer 1. The structure used for Web Server1 must be replicated for each of the considered Web Servers whose load is controlled by a new transition component that implement the replication policy.

  • Machine Learning: Efficient scaling policies in complex scenarios can be obtained integrating several techniques, like modeling, and machine learning into a single tool that dynamically tune the parameters according to the varying load conditions. For example, from the results of a sequence of models obtained with a What-if analysis, a machine learning algorithm can derive the set of parameters that keep performance indicators as close as possible to their target values.

  • Use of Finite Capacity Region: The model described can also be implemented using a Finite Capacity Region (with max capacity set to \(SI^{max}\)) for each Web Server and implementing the firing rule in the transition that manages the flow of arriving requests according to the planned scheduling policy.

6.3 Simulation of the Workflow of a Web App

tags: open, three classes, Source/Queue/Class-Switch, JSIMg.

While in the typical Queueing Network models the paths followed by the requests between the stations are defined according to probabilistic rules, with the use of the Class-Switch parameter of the requests it is possible to describe in JSIMg the paths in a deterministic way. A similar behavior can be modeled also using the Petri Nets stations (see, e.g., Sect. 6.2).

6.3.1 Problem Description

Regardless of the paradigm adopted in modern web application architectures (e.g., web services, microservices, serverless) software developers must describe the business logic of the apps through workflows representing the sequence of execution of the tasks. Depending on its layout, mapping a workflow to a queuing network model may not be an easy task. More precisely, we refer to the case in which a request after being executed by a station and flowed through the model, returns to that station and requires service times and routing very different from the ones required previously. The problem we face arises because JMT does not store the execution history of a request in terms of paths followed between the various resources. To solve this, we use the class identifier parameter Class ID associated with each running request to track only its recent execution history.

In fact, each request in execution is assigned a Class ID that is used to describe its behavior and characteristics, such as, type (open or closed), priority, mean and distribution of service times. Routing algorithms are defined on a per-class basis. Of fundamental importance to the problem approached is that a request may change Class ID during its the execution flowing through a Class-Switch station or when a specific routing algorithm is selected. Therefore, with the use of the Class ID parameter we can know the last station visited by a request and the path followed.

To describe this technique we consider a simplified version of the e-commerce application of an online food shopping company. The web services of the software platform are allocated on two powerful servers, referred to as Server A and Server B, of the private cloud infrastructure. Figure 6.22 shows the layout of the data center with the paths followed by the requests and the relative Classes. The sequence of execution of the paths for each request coincides with the numbers of the Class IDs.

Fig. 6.22
figure 22

The data center with the path followed by the requests and their Class IDs

Server A is a multicore system that execute several services of Front-End. Among them are: customer authentication, administrative and CRM processes, interaction with the payment service (for the strong authentication for payments), checkout operations with the update of the DB, invoice generation, shipping and tracking services, and update of customer data. Server B is a multiprocessors blade system, highly scalable, fault tolerant, with redundant configuration for continuous availability, equipped with large RAM memory and SSDs storage for the DBs. Among the most important services allocated are those for browsing the catalog, processing the shopping cart, and managing the DBs of products and customers. To provide the minimum Response time to customers, an in-memory DB is implemented to dynamically cache each customer’s most recent purchases.

A third server Server P, located in the data center of an external provider, is used for payment services.

To reduce the complexity of the description we have considered a simplified version of the workflow of the e-commerce app, see Fig. 6.23, consisting only of the services that are needed to describe the problem approached and its solution. Figure 6.23 shows the services considered and the servers where are stored.

Fig. 6.23
figure 23

Short version of the workflow of an order submission to an online grocery store

According to the business logic of the e-commerce app, the complete execution of a request requires three visits to Server A, one to Server B and one to Server P. At each visit to Server A, different web services are executed which require different mean service times. The sequence of visits to the three servers A, B, and P during a complete execution is A-B-A-P-A. With a smart use of the Class IDs we may model this deterministic routing of the requests among the servers.

In addition to the implementation of the model for executing the workflow of Fig. 6.23, the capacity planning study requires:

  • the performance forecast of the e-commerce app with the current workload and the one-factor authentication level for payments for a wide range of arriving requests;

  • the impact assessment of a new web service for the Strong Customer Authentication (SCA)

     Indeed, according to PSD2 (Payment Services Directive EU) a new authentication service is planned to replace the current one to enhance the security of online payments with a two-factor authentication levels;

  • the Throughput bound of the current system and the actions to be applied to process a workload 15% higher than the current one (with max arrival rate of about 5000 req/h).

6.3.2 Model Implementation

The model implemented with JSMg is shown in Fig. 6.24. We use the Class ID of the requests to trace the path between stations followed during their executions. The sequence of paths modeled is shown in Figs. 6.22 and 6.23. The arriving requests from Source station, generated with Class1 as Class IDs, are sent to Server A. After the execution of the services scheduled for this first visit to Server A, the requests are routed to Server B and then to Class-Switch station CS. This station, which has zero service time, change the Class IDs of incoming requests to new ones according to the probabilities described in its parameters. In our case, see Fig. 6.25, the Class IDs of the requests arriving from Server B (that are Class1) are changed to Class2 before being redirected to Server A.

Fig. 6.24
figure 24

The JSIMg model implemented

Fig. 6.25
figure 25

Class-Switch probabilities of the CS station

After the execution of the services scheduled for this second visit to Server A (which are different from those executed in the first visit that were for Class1 requests) the Class2 requests are routed to the payment server Server P of an external provider. At the end of this service, the Class IDs of requests are changed to Class3 by Class-Switch CS station that route them back to Server A. The service demands of this third visit to Server A are those for the requests of Class3 type. Then, the routing algorithm sends them to the Sink station where they exit the model.

In the model implemented we have not explicitly represented the network connections to payment server Server P and the User think times. Indeed, the service times of the network components, typically modeled with Delay stations, are negligible compared to the service times of the other components of the model and therefore their impact on the performance is practically zero. As for the User Think Times it should be emphasized that with this type of e-commerce app, related to online grocery shopping, their values, especially those on the browser-side (i.e., between the selection of products), are highly variable from user to user as they are deeply influenced by the characteristics of individual customers (e.g., age, type of network connection, digital equipment used). Thus, having a reliable forecast of their values and distribution is practically impossible and somewhat useless. Furthermore, not considering the User think times increases the reliability of the metric System Response Times R to evaluate the differences between the various versions of web services and security protocols. We must keep in mind that in this case we simulate the worst case scenario related to the stress of the resources since the load is the maximum possible.

The Service times required by all services executed during each visit have been parametrized with their global Service demands Ds. Three different workloads should be considered: the current, (called light workload), with an average of 25 products in the shopping cart per customer in each session and one-factor authentication for secure payment, the same workload with a new two-factor authentication system, and the new expected workload (called heavy workload) with a 15% increase in incoming traffic compared to current one. Tables 6.2 shows the Service demands of the first two.

Table 6.2 Service demands [s] to the servers of the current workload with one-factor (left) and with two-factor (right) authentication for payment security

The fluctuations in the number of items purchased per session and service times are considered in the distributions of the global service demands Ds. The values in the boxes are those modified by the two-factor payment system. To model the fluctuations in the number of items and in service times required by their different types we have assumed the exponential distribution of service demands Ds. If necessary, it is possible to select distributions with the same mean and greater variance, for example hyper-exponential, with a single click in the station parameterization windows. The scheduling discipline of the servers is PS, processor sharing. The traffic intensities analyzed range from 0.5 to 1.2 req/s (about 4300 req/h).

6.3.3 Results

In what follows we will describe the activities regarding the following three objectives:

— Obj.1: Implementation of a model to execute the workflow of Fig.6.23

— Obj.2: Capacity planning of the data center with the current workload and evaluation of the impact of a two factor authentication system

— Obj.3: Computation of the Throughput bound and prediction of the performance of a new heavy workload that has a max arrival rate of 5000 req/h

— Obj.1: Implementation of a model to execute the workflow of the e-commerce app with deterministic paths.

The workflow with the tasks that are executed for an online order submission is shown in Fig. 6.23. To construct a simple example that allows the visual evidence of the sequence of visits to the three servers A,B, and P we have implemented the model of Fig. 6.24 assuming that all the parameters have constant values. In this example: the interarrival times are 3 s, the service demand of the first visit to Server A is 0.2 s (the request is of Class1), the service demand to Server B is 0.8 s (the request is of Class1), the service demand of the second visit to Server A is 0.4 s (the request is now of Class2), the service demand to Server P is 0.4 s (the request is of Class2), and the service demand of the third visit to Server A is 0.1 s (the request is now of Class3).

The temporal diagram of Fig. 6.26 provides visual representation of the sequence of visits A-B-A-P-A to the three servers during the complete execution of a request. The values plotted in this diagram are obtained simply by flagging the checkbox Statistical Results (Stat.Res.) of the correspondent performance index selected in the Performance Indices window (see Fig. 1.8) (the CSV files with the values of the Response Times will be generated automatically). 

Fig. 6.26
figure 26

Temporal sequence of visits to the three servers during the execution of a request

— Obj.2: Capacity planning of the data center with the current workload and evaluation of the impact of a two-factor authentication system for secure payments.

The traffic intensities considered range from 0.5 to 1.2 req/s (4320 req/h). The What-if analysis of Fig. 6.27 execute 15 independent models with arrival rates increasing by 0.05 each model .

Fig. 6.27
figure 27

Execution of 15 models with Arrival rates from 0.5 to 1.2 req/s

Figure 6.28 show the Number of Requests N in concurrent execution and the System Response times R with the Service demands of the current workload with one-factor authentication security system (see Table 6.2).

Fig. 6.28
figure 28

Number of requests N in execution (a) and System Response times R [s] (b) with one-factor and two-factor authentication layers of security

The values of N range from 1.4 to 30.81 req while those of R range from 2.9 to 25.8 s. Recall that, as already pointed out, the values of R do not include User Think times both at the browser and the session levels.

The bottleneck is server B whose utilization increases from 0.39, with 0.5 req/s, to 0.95, with 1.2 req/s. Its service demand is 0.8 s, the maximum among all servers, and therefore the Throughput bound of the system is \(X_0 =1/D_{max}=1.25\) req/s.

To detect the impact on performance of the new two-factor authentication system for secure payments we executed the What-if analysis with the service demands of the new payment service (see Table 6.2). Figure 6.28 allow the visual comparison of the values of N and R obtained with the two-factor authentication with those obtained with the one-factor authentication. With 1.2 req/s the new values of N and R are 35.9 req and 34.8 s, respectively. The 9 s increase in R compared to the old value obtained with a single-factor is mainly due to the increase in service demands of the new authentication system.

— Obj.3: Computation of the Throughput bound and performance prediction of a new heavy workload that has a max arrival rate of about 5000 req/h.

The current workload intensity is expected to increase by approximately 15% following the acquisition of a new online grocery store company. Arrival rates with a maximum value of about 5000 req/h are also expected.

The bottleneck with the current workload is Server B which constrains Throughput to be at most 1/\(D_{max}=\) 1.25 req/s, which is less than the new required maximum 1.4 req/s. Among the possible actions to improve the Throughput bound it has been decided to replace the current Server B with a new one that is twice as fast (equipped with new processors, more cores, and larger RAM). As a consequence, the service demand of Server B is 0.4 s, half of the previous one. The new \(D_{max}\) is the one of Server A, equal to 0.7 s, that becomes the bottleneck. So, the new Throughput bound is 1.42 req/s, which satisfies the constraint of 5000 req/h. Figure 6.29 shows the System Response time R of the two data center configurations with the old (upper curve) and new (lower curve) Server B respectively.

Fig. 6.29
figure 29

System Response time R [s] of the two data center configurations

It must be pointed out that the performance gains obtained with the new Server B that is twice as fast of the old one are not as expected, e.g., the Throughput bound increased of about 14% only. Indeed, with the new Server B the limit to the extent of the improvement is imposed by Server A which has become the new bottleneck with the second highest service demand of the original data center, i.e., 0.7 s, as it was the secondary bottleneck

6.3.4 Limitations and Improvements

  • Workload characterization: identifying multiple classes of customers (rather than a single one as done in the case study) may be better for providing customers with more accurate Response times with respect their characteristics (in terms of the number of products purchased).

  • Flow of incoming customers: the pattern of arriving requests can be simulated with high precision capturing the fluctuations with some of the distributions implemented in JMT, e.g., Hyper-exponential, Coxian, Phase-Type, Burst, Markovian Arrival Processes) or using the Replayer to replicate the data collected from a real workload.

  • Load balancing: regardless of the paradigm adopted in web application architectures, the identification of elementary tasks, their dependencies on other tasks and their allocation among web services, are the actions that play a key role in the load balancing of servers in a data center. 

6.4 A Crowd Computing Platform

tags: open/closed, multiple class, Source/Delay/Queue/FCR/Sink, Exp/Hyper-exp, JSIMg.

This case study describes an application of a simple but powerful structure that can be implemented with one of the JSIMg features: the Finite Capacity Region (FCR). It can be used either stand alone, as described below, or as a part of more complex models [29], for example to simulate the servers downtime (due to failure or other causes of shutdown) in large data centers, to control the load to a set of servers, or to implement the zig-bee energy savings feature [7].

6.4.1 Problem Description

The crowd paradigm has been used for centuries to solve problems whose difficulty is beyond the capacity of single individuals or organizations: a group (i.e., the crowd) of subjects cooperate to solve a problem. With the evolution of digital technologies, and particularly the Internet, crowd applications encompass a wide range of real-world problems of both a scientific or non-scientific nature from agriculture, to health-care, funding, searching, social productivity, distributed weather forecast, problem-solving and ideas-sharing.

In this case study, we consider a crowd of individuals that collectively contribute with their digital devices (computers, servers, tablets, etc.) to the implementation of a large computing infrastructure. A device can be added or removed from the infrastructure by each contributor. The members of this infrastructure belong to two groups: contributors and associates. The former are authorized to add and remove their equipments to the infrastructure of the crowd that they can use free of charge. The latter can only use the infrastructure devices and are charged for their computations. Associate members have been introduced to increase the economic sustainability of the crowd, their number is larger than that of contributors. We will collectively refer to the members of the two categories as users and we assume that the service demands of both the categories are similar. In this ideal crowd computing platform (Fig. 6.30), the contributors receive by the crowd manager the app that allow them to add their computers to the platform, becoming a node accessible by the community, or to remove it. The crowd manager is responsible for the managing of the resources of the platform through dedicated servers. Among others, scalability is one of the important features that they exhibit. For their characteristics, these types of infrastructures can also be referred to as open cloud computing systems.

Fig. 6.30
figure 30

Layout of the considered crowd computing scenario

In these applications, the processes of contribution (i.e., arrivals) and removal (i.e., delete) of the equipments to the platform are very peculiar and follow unpredictable distributions.

In the following sections, we focus on the simulation of these two processes and we analyze their impact on the performance of the crowd platform. More precisely, we evaluate the behavior of System Response Times of the user requests as a function of the variance of unaivalability time of the nodes.

6.4.2 Model Implementation

The model implemented with JSIMg is shown in Fig. 6.31. The flow of computational requests submitted by contributors and associates members, has been simulated with the requests of the open class users generated by the source station Source1.

Fig. 6.31
figure 31

The crowd computing model: the CrowdPlatform region has a limited capacity, class-2 Node customers have higher priority than class-1 User customers

The Service demands of both the group of members have the same statistical characteristics, i.e., the same distribution (exponential) and the same mean \(D_{user}= 4\,\)s. Thus we assume that all the computational requests belong to the same class.

The number of contributors is 200, each can add/remove a system that can execute the user requests. We simulate the computational devices of the platform, i.e., the nodes, with the 200 servers of the single queue station CompServers. To model the add/remove behavior of the 200 nodes, we use the Finite Capacity Region (FCR) CrowdPlatform, with capacity \(N_{FCR}=200\) customers, and a closed class Node with 200 customers. The Node customers flow through the queue station Unavailable and the delay station Available.

Thus, the workload of the model consists of two classes of customers: User (open) and Node (closed). In JSIMg the queue of requests entering an FCR is unique. In Fig. 6.31 two queues are drawn only for reasons of graphical representation. The parameter settings of the two-class workload is shown in Fig. 6.32a. The open class User describes the computational requests submitted by all the users, contributors and associates, arriving at the platform with rate \(\lambda =4\,\)req/s and exponential distribution of Interarrival times (whose mean is \(1/\lambda = 0.25\,\)s). The closed class Node of 200 customers and priority 1 (higher than that of the User class) has been added to represent the systems of the platform that may be available/unavailable to execute the User requests.

Figure 6.32b shows the parameters of the FCR. The maximum Region capacity is \(N_{FCR}=200\) customers, including both user and node. The default values (infinite) of the maximum number of customers per class have no effects as in any case the maximum value of 200 customers in the FCR is a constraint that cannot be exceeded. The Drop policy is set to false for both the classes since we do not want to drop the requests (both User and Node) arriving when the FCR is full but we want to keep them in a queue waiting to be admitted inside.

Fig. 6.32
figure 32

User (open) and Node (closed with high priority) classes (a), and the FCR (b)

The queue station CompServers, located inside the FCR, has a single queue and 200 servers, each server will execute a user request. Since the number of customers in the FCR is limited, i.e., it is \(N_{FCR} = N_{FCR,User} + N_{FCR,Node} \le 200\), any Node customer within the FCR (in the Unavailable station) decreases the number of servers available for user computations in the CompServers station.

The primary effect of removing a node is represented in the model by an increase of customers at the Unavailable station and thus a decrease in the number of servers available for computations at CompServers station. Similarly, the primary effect of adding a node is represented by a decrease of the Unavailable customers (a customer move to the Available station outside the FCR) and an increase in the number of servers available for computations at CompServers station. The result will be an increase in the node activity and an improvement in the platform performance.

The behavior of the model is as follows:

  • if a User request arrives at CrowdPlatform when there is at least one server available in the CompServers station, i.e., when it is \(N_{CompServers} + N_{Unavailable} < 200\), then it is executed immediately;

  • if a User request arrives to CrowdPlatform when no computing server is available to users (i.e., when it is \(N_{CompServers} + N_{Unavailable} = 200\)) then it must wait until a user request complete its execution or a new node is added and that the eventual queue of requests already waiting for a server (i.e., in queue to enter the FCR) becomes empty;

  • when a Node removal request arrives at CrowdPlatform, i.e., a Node customer is released by the Available station, and it is \(N_{FCR} < 200\), then the number of servers available in the CompServers station is decreased by one unit;

  • when a Node removal request arrives at CrowdPlatform and it is \(N_{FCR} = 200\) then it must wait in queue to enter the FCR until a User request complete its execution and release the server or a new server is added (i.e., a Node customer exit the FCR). Indeed, in spite that the Node requests have higher priority than User, since the scheduling discipline of the queue of requests waiting to enter the FCR is FIFO non-preemptive, i.e., an arriving removal request of a node does not interrupt the execution of a User request but waits for its completion to lock the server. The requests in queue are served according to their priority

In this case study we focus on the behavior of the number of nodes of the platform that are available/unavailable for computations. Typically, in this type of applications the time in which a node is unavailable follows an unpredictable distribution with a very large variance. This can be explained by considering that each contributor is independent of the others and follows custom working schedules. As a first approach, we consider the mean unavailability time \(S_{Unavailable}=1\,\)s (Service time of the Unavailable queue station, located inside the FCR), and hyper-exponential distribution.  Several models with the same mean Service time and different coefficients of variation cv are executed. Let us remind that to model a hyper-exponential distribution in JSIMg it is sufficient to set its mean value and coefficient of variation (see, e.g., Fig. 5.10). The availability times are modeled with the service times of the delay station Available (located outside the FCR), with mean \(S_{Available}=\,60\,\)s and exponential distribution.

6.4.3 Results

The simple model implemented allows to answer several capacity planning questions. For example: how does the platform Response time vary with the number of nodes? which is the impact on performance of the arrival rate of user requests and of the distribution of interarrival times? which is the bottleneck of the infrastructure which constraints the Throughput? what happens if we alleviate/remove the bottleneck? which will be the effect on Response times of an increase of the number of associates members? which are the scalability limits of the platform (hardware components capacity, software requests, distributions of service demands, variance of interarrival times, ...)?

In this case study we concentrate on the modeling of the addition/removal of nodes to the crowd platform. The primary effect of the interaction of these two processes is reflected by the dynamic changes in the number of nodes available/unavailable for the execution of user requests. As described in the previous section, we use the exponential distribution to model availability times and the hyperexponential distribution to model unavailability times.

Among the possible objectives of the capacity planning study, we describe in detail the following two.

Obj.1: Evaluate the impact of the variance of unavailability times of the platform nodes (keeping constant the mean value) on the Response Time of User requests.

To achieve the objective of the study we cannot use the What-if feature since in JSIMg the variance of a distribution is not one of the control parameters admitted. Thus, we ran five independent JSIMg models with different values of the variance of the unavailabity times. More precisely, for the Service times of Unavailable station we considered the same mean \(S_{Unavailable}= 1 \,\)s and five different coefficients of variation cv = 1, 5, 10, 15, and 20 of their hyper-exponential distribution.

The graphs in Fig. 6.33 show the results of five models of Fig. 6.31 obtained with different cv of \(S_{Unavailable}\). For each cv, the corresponding 99% confidence interval is also shown.

Fig. 6.33
figure 33

System Response time (a) and System Number of User requests in the platform (in execution and in queue for FCR) (b) versus coefficient of variation of Unavailability time

As expected, the high variance of the Unavailability time causes a degradation in the performance. As the cv increases, the System Response time of User requests grows. For example, with cv = 20 the model yielded \(R_{0,User}=\) 203.4 s (see Fig. 6.33a). Note that this index is at the System level because it also includes the queue time (if present) of the requests waiting to enter the FCR when all nodes are unavailable.

Similarly, Fig. 6.33b shows that with the increases of cv, also increases the mean number of User requests in the system. Note that this index is defined as System Number of customers because it includes both User requests that are submitted to the platform but that are queued waiting for a node (to enter the FCR) and requests that are in execution (inside the FCR). Indeed, its mean values can be higher that 200 (e.g., with cv = 20 it is \(N_{0,User}=834\,req\)).

Obj.2: To answer some questions of the capacity planning study it is required a detailed statistical analysis of the values of three performance indexes: Number of nodes available in the platform, Response time and Number of user requests arrived at the platform. The study of their behavior over time is also requested.

To achieve this objective it is necessary to tick the checkboxes Stat.Res. in the Performance Indices definition window corresponding to the indexes analyzed. In Fig. 6.34 a statistical analysis is requested for the indexes Number of customers of Available station, System Response times of the User requests, and System Number of Customers because the corresponding three checkboxes Stat.Res. are checked.

For each selected index, a CSV file with all its values is generated. In Fig. 6.34b a sample of the CSV file generated by the Number of customers in the Available station is shown. The values of the three columns are: the time stamps of the event, the actual number of customers in the station, and the time interval since the last event, respectively.

Fig. 6.34
figure 34

Selection of statistical analyses (see Stat.Res. check boxes) of three performance indexes (a) and a sample of the CSV file of Number of Customers of the Available station (b)

Fig. 6.35
figure 35

Statistical indexes computed for the Number of Unavailable nodes

For example, Fig. 6.35 shows the statistical indexes computed for the Number of customers in the Unavailable node. The histogram graph style has been selected.

Fig. 6.36
figure 36

Behavior of the number of nodes available in the crowd platform (a) and of the System Response times of User requests (b) in the time interval \(0\div 20000\,\)s

By processing the CSV files there is the possibility to analyze the behavior of the indexes over time. Figure 6.36a represents the behavior of the Number of Customers of class-Node in the Available station (that correspond to the number of nodes available in the crowd platform) over time. Each increasing step means the occurrence of an arrival (a contribution) of a new system (node) or the completion of the execution of a User request that releases the server. Each decreasing step means that a system has been removed from the crowd or that a server has been assigned to a newly arrived User request. The values plotted have been obtained in a model with mean unavailability time \(S_{Unava}=1\,\)s and cv = 20.

Figure 6.36a shows for the same time interval the behavior of the System Response time of the User requests. The values of this index are highly fluctuating. As can be seen from Fig. 6.36, the high peaks of Response times (see, e.g., the one ending at about 15000 s in Fig. 6.36b) occur after periods where the number of nodes available for computations is very low or null (see Fig. 6.36a). In fact, the primary effect of these periods is represented by the fast increase of the queue of User requests waiting to enter the FCR, resulting in a significant increase in their Response time.