Keywords

1 Introduction

Consuming “Business Process Management as a Service” (BPMaaS) offer benefits that IT people widely acknowledge. It reduces the operational burden and allows to rely on the provider for the maintenance and provisioning of the service. However, from the BPMaaS provider point of view, it increases the operational complexity. The provider must ensure that all his customers receive the same attention and a defined quality of service at all time. He must also ensure that it operates at the best possible cost.

A Business Process Management System (BPMS) deployment is complex: it requires application servers, process engines and database management systems. Clustered installations requiring load balancers can also be deployed for high availability. Each customer has a different usage pattern and the number of tasks to execute evolves constantly. Each cloud compute resource is costly and has a capacity corresponding to its CPU, memory, storage and network speed, for a defined response time. In order to maintain an optimal infrastructure cost, processes executions must be distributed on different cloud resources. Figure 1 shows an example with two customers (or tenants). They require different capacity at each hour. We want to allocate the cheaper Cloud resources and distribute the process execution of tenants on these resources to optimize the operational cost.

Fig. 1.
figure 1

Multi tenant resource allocation and distribution tenants

In order to optimize resource usage, we must sometimes migrate tenants from an installation to another. In Fig. 1, at time 5, tenant 1 is moved on the same resource as tenant 2 because both require less capacity and fit on this resource. A migration generates disruption of service on the customer side. We must stop the processes and move the data from one installation to the other [1]. We must find the best distribution of tenants on resources for each interval of time while controlling the number of migrations for each tenant. This problem becomes complex for a high numbers of tenants.

We propose an integer linear programming (ILP) model and a genetic algorithm that aims at finding the best allocation strategy while limiting the number of migrations per tenant. We improve our previous results [2] with a method that provides a better cost elasticity.

In the next section, we describe our migration strategy model, our genetic algorithm approach coupled to a solving of our model, and our previous iterative heuristic. We then study experimental results on both of the approaches compared to a baseline approach. We also show how we achieve interesting experimental results. Then, we compare our results with the state of the art. The last part concludes and presents our future work.

2 A Migration Strategy Based Model

In this Section we present the context for BPMaaS, our hypothesis and our method to optimize resource cost. This can be seen as a resource allocation problem with a constraint regarding the number of reallocations. Moving a tenant from a resource to another generates a service disruption. We want to limit their number to ensure stable quality of service for the customers. Thus, one of our problems is to find the right time to move tenants. We call it the migration strategy.

2.1 Context and Constraints

Our approach is tenant-centric. All customers (tenants) processes are executed on the same BPMS installation. It is easier to manage deployments by customers rather than by processes. Customer processes share business data and security configuration that we need to manage together. Our assumptions are the following:

BPMS Do Not Scale Infinitely. There is no such thing as an infinitely scalable BPMS installation. Even clustered installation reach a limit due to the transactional nature of the database interactions. Thus, we use several BPMS installations. In our approach, we assume our tenants can fit on the “bigger” resource available.

Provisioning and Deprovisioning Takes Time. We cannot change instantly tenant distribution, as computing instances instantiation, software installation, and data migration takes time. Thus we compute resource allocation in a discretized manner at fixed time interval or time slot. A time slot is a significant period of time for the provider: it could be a few seconds or an hour.

A Tenant Is a Customer of the BPMaaS. Tenants run BPM processes composed of tasks. To execute them, the BPMS needs computing power, network bandwidth, disk and memory. It relies on separated compute instances for database systems for data persistence, and load balancers for clustered installations.

Task Throughput Is Our Performance Metric. It corresponds to the number of BPM tasks executed for one period of time (e.g. per second). This metric is meaningful for the customers.

Our Approach Is Offline. We assume we know the required BPM task throughput for each tenant and each time slot.

A Cloud Resource Has a Capacity Expressed in BPM Task Throughput. A cloud resource (or resource) is one or several cloud compute instances that we use for the database tier, BPM system tier, load balancer tier, etc. It supports a full BPMS installation. A cloud resource can host several tenants. In our case, we assume a tenant fits on one resource: tenants won’t be distributed on several cloud resources.

The Number of Migrations Must be Limited. We name migration the action of moving a tenant data and processes from one cloud resource to another. It requires the target cloud resource to be up and running. This action takes time to be executed. If all tenants of a cloud resource are migrated, it can be released. Migrations generate QoS breaks for the customers [1]. We limit their number for each tenant depending on the Service Level Agreement.

Without an optimization method, a solution to allocate tenant to resources would be to allocate each tenant on a resource that supports its maximum task throughput. We call this solution, baseline method. This method is often used in production, but can become very expensive.

We proposed in [2] a method based on an iterative heuristic and time series segmentation. It computes the list of necessary cloud resources and a mapping of tenants on each resource, time slot per time slot. We present it in the next Section.

2.2 Allocation with an Iterative Heuristic and Time Series Segmentation

In this part, we recall briefly our previous method [2]. It has two parts: first, we choose a migration strategy i.e the time slots where each tenant can migrate, and, second, we apply a heuristic using resources prices and capacity, tenants needs, and their migration strategy in order to obtain the cloud resources and the placement of tenants.

Fig. 2.
figure 2

Example of a migration strategy (Color figure online)

For the first part, a time series segmentation [3] allows to select good migration times for each tenant. We present in Fig. 2 two examples of migration strategies, in red for two tenants. In this figure, Tenant 1 can only be migrated at the end of time slots 1 and 4. For Tenant 2, it is time slots 2 and 3. This means both tenants will be migrated at most 2 times. They will stay on their origin resource if there is no migration authorization.

For the second part, our iterative timeslot heuristic is based on Variable Cost and Size bin packing in which we added repacking steps. It takes as input the tenant throughput by timeslot, the cloud resource prices and capacity, and the migration strategies for the tenants.

This coupled approach provides better results than the baseline approach. Still, the comparison with the solution computed with a solver showed that it is far from the optimal cost. It is possible to enhance this solution. We propose to use a genetic algorithm to find migration strategies that reduce the resource cost, and propose an alternative to the iterative timeslot heuristic based on integer linear programming. We present this model in the next Section.

2.3 An Efficient Model for Migration Strategies

Our problem is to find resource and tenant distributions for each time slot, for given tenants’ loads, resource prices and capacity, and migration strategy. The model we propose answers to this problem. Our model principle is based on the absence of tenant migration when there no authorization to move. In this case, simple placement constraints should exist, and no constraint exist when there is an authorization to move in the migration strategy.

Let the following variables:

  • \(\mathcal {T}\), the set of cloud resource types, with t its cardinality.

  • \(\mathcal {I}\), the set of tenants with n its cardinality.

  • \(\mathcal {J}\), is \(\mathcal {T} \times \mathcal {I}\) the set of all possible cloud resources associated with each tenant. Its cardinality is \(m=t \times n\).

  • \(C_j\), and \(W_j\) represent respectively the cost and the capacity in terms of BPM task throughput for the configuration j, with j in \(\mathcal {J}\).

  • \(w_i\)(k), the required capacity in terms of BPM task throughput for the tenant i during time slot k.

  • \(\mathcal {K}\) defines all the time slots, from 0 to D, where \(D+1\) is the number of time slots.

  • \({x_j}^i(k)\), the assignment of tenant i to configuration instance j during time slot k.

  • \({y_j}(k)\), the activation of configuration j during time slot k.

  • M, the maximum number of migrations of tenants between cloud resources on all time slots.

  • \(h_i(k)\) with \(0 \le k \le D-1\). \(h_i(k)\) is equal to 0 if the tenant i is not allowed to be migrated between time slot k and \(k+1\), and equal to 1, if it is allowed. The set of all \(h_i(k)\) (for each tenant and each time slot) is a migration strategy.

  • migration strategies assume the maximum number of migrations allowed per tenant: \(\forall i \in \mathcal {I} \sum _{k}^{k \in \mathcal {K}} h_i(k) = M\) where M is the number of migrations. This number depends on the SLA.

The objective is to minimize the total cost for all active cloud resources, for each time slot, as shows Eq. 1.

$$\begin{aligned} \min \sum _{j}^{j \in \mathcal {J}} \sum _{k}^{k \in \mathcal {K}}{C_j y_j(k)} \end{aligned}$$
(1)

We must ensure that the following constraints are enforced:

$$\begin{aligned} \forall i \in \mathcal {I}, \forall k \in \mathcal {K} \sum _{j}^{j \in \mathcal {J}} {{x_j}^i}(k) =1 \end{aligned}$$
(2)
$$\begin{aligned} \forall j \in \mathcal {J}, \forall k \in \mathcal {K} \sum _{i}^{i \in \mathcal {I}} w_i(k){{x_j}^i}(k) \le W_j {y_{j}(k)} \end{aligned}$$
(3)
$$\begin{aligned} \forall j \in \mathcal {J}, \forall i \in \mathcal {I}, \forall k \in \mathcal {K} | h_i(k) = 0, {{x_j}^i}(k) = {{x_j}^i}(k+1) \end{aligned}$$
(4)
$$\begin{aligned} \forall i \in \mathcal {I}, \forall j \in \mathcal {J}, \forall k \in \mathcal {K}, {x_i}^j(k) \in \{0,1\}, {y_j}(k) \in \{0,1\} \end{aligned}$$
(5)

Equation 2 represents the obligation for a tenant to be placed at each time slot on an active cloud resource. Equation 3 means that the sum of the required capacity for each tenant on one cloud resource cannot exceed the capacity of the cloud resource. Equation 5 represents the variables we use. Equation 4 represent the migration strategy. The equality constraint means that for a tenant i and a time slot k, assignation values \({x_i}^j(k)\) will stay the same on time slots k and \(k+1\). When a tenant is authorized to migrate between resources, there is no constraint for this tenant. Generalizing this on all resources produces the desired effect: tenants will be migrated from one resource to another only during the time slots specified by the migration strategy. The pre-defined migration strategy is symbolized here by the variable \(h_i(k)\).

Finding cheap migrations strategies is primordial in our approach. This is the goal of our genetic algorithm. We present it in the following Section.

2.4 Cost Optimization via Genetic Algorithms

Fig. 3.
figure 3

A genetic algorithm to find better strategies

A genetic algorithm is a well known meta-heuristic belonging to the family of evolutionary algorithms, and inspired by natural selection [4]. Its principle is to produce directed random evolutions on a population of individuals until it obtains one or several individuals with a good fitness value. Individuals are usually vectors of boolean values, whose corresponding fitness can be evaluated. Iterations are triggered until an end condition is reached.

We want to find the best migration strategy for all the tenants and time slots. Figure 3 shows our approach. The general principle is to use our iterative time slot heuristic (or the restricted model we have described in the previous Section) for evaluating migration strategies, until we find the best. To represent an individual, we vectorize a migration strategy by concatenating migration strategies of each \(|\mathcal {I}|\) tenants (each one corresponding to a vector of D boolean values, \(D+1\) being the number of studied time slots). The size of the vector will be \(D \times |\mathcal {I}|\), with each element being equal to zero or one. For instance with two tenants and three time slots, the first migrating on the second time slot and the second tenant on the third time slot, we will have the following representation: \(\begin{bmatrix}0&1&0&0&0&1\end{bmatrix}\).

Fig. 4.
figure 4

Genetic algorithm phases

The reader will find in Fig. 4 a brief description of the different steps of a genetic algorithm. Compared to the traditional approach, we have switched the mutation phase and the crossover phase. The co-hosted mutation requires to know the cost of the migration strategy in the population. We compute the cost in the fitness evaluation phase. The crossover phase generates potentially unknown (not yet computed) migration strategies. Thus, we do it after the mutation phase. In our case, “parents” are mutated instead of the offspring. In the following, we describe the solution we designed.

Population Initialization. We initialize the population with several segmentation algorithm combinations (as we described in Sect. 2.2), and with random individuals with the correct number of migrations for each tenant.

Fitness Evaluation. We want to find the migration strategies that produces the lowest cost. The fitness score corresponds to the total cost of all the active resources on the time slots. To evaluate it on the different individuals, we compute the cloud resources allocation and placement of the tenants on it. In our case, we run our iterative time slot heuristic [2] or a solver on our model presented in Sect. 2.4, for each individual (migration strategy) we need to evaluate. We keep the cloud resources and tenants assignation distribution in memory for the next steps, and the fitness score.

Termination Condition. We use a time limit termination condition. This allow us to compare different solutions based on this limit.

Parent Selection. For this step, we use a classical rank selection strategy. We sort the population by fitness and we select randomly, and with a higher priority, the individuals with the higher rank (lower fitness or price in our case) for parents.

A Specific Mutation: Co-hosted Tenant Migration Mutation Strategy. In classic approaches, mutation updates randomly individuals, depending on a mutation rate, switching scalar values from zero to one or the other way around [4]. Here, the goal is to generate brand new individuals in the population, with non tested configurations. In our case, we cannot use a totally random approach, as the number of migrations for each tenant is bounded. However, even a randomized approach keeping a fixed number of migrations will not provide the desired effects as we can see on the left side of Fig. 5. We have noticed that most of the times, resources are not liberated as only one of their tenants is authorized to migrate. It limits the savings of resources liberation. We developed an alternative mutation more suited to our problem.

Fig. 5.
figure 5

Basic tenant mutation vs cohosted mutation

It consists in shifting the authorization to migrate for each co-hosted tenant at the indicated time slot for the reference tenant’s resource. To achieve this goal, for each tenant, we browse the past time slots until we find an authorization to migrate or until we reach the beginning of the time slot space. If we find one, we set it to zero while setting to one the “destination” time slot. If the “destination” time slot is already set to one, we ignore this behavior. The example on the right side of Fig. 5 describes this principle. There, it is possible to migrate all the tenants of resource R1 to the cheaper resource R2, and thus reduce costsFootnote 1.

Specific Offspring Generation: The Tenant Crossover Strategy. The crossover phase consists in randomly mixing individuals (parents) of the current population in order to generate new individuals (children) having characteristics of both parents. The crossover technique that we use consists in switching the migration time of random tenants. First, two children identical to two migration strategies parents are generated. Then, depending on the number of tenants specified, each one will see its migration times switched in the children.

Generational Replacement. We replace the entire population with the offsprings, except for the best individuals from the original population (named elites). They replace the less fit offspring in the future population.

In the next Section, we present our experiments and the results.

3 Experimentation

We have conduct tests with the cloud resource prices and sizes, and the seeds of our previous work [2]. We consider 12 configurations, each composed of two Amazon Web Services compute resources: one database resource (RDS) for the database, and one compute instance (EC2) for the application server. Prices are comprised between 0.177$ per hour for a BPM task throughput of 16.4 tasks per second, and 4.126$ per hour for a BPM task throughput of 129.279 tasks per second.

For the customer part, we vary the number of tenants (10, 25, 50 and 100), and we use different throughputs in terms of BPM task per second. These throughputs are based on usage of anonymous customers of the BPMS BonitaFootnote 2. We consider 6 configurations, needing a throughput respectively between 1 and 120, 14 and 16, 0 and 120, 1 and 3, 5 and 120, and 0 and 4Footnote 3.

We generate each tenant’s initial time slot throughput randomly following an uniform distribution between the two throughputs. Our next step is to generate the variation of throughput between time slots by adding or removing a random value limited to one quarter of the difference between the maximum and the minimum throughput. For our experiments, we used the Python library Inspyred [5] for the genetic algorithm that integrated well with our environment.

Experiment Parameters. In order to obtain significant and realistic results, we used the following parameters:

  • each test is launched for 10 different random seeds (i.e tenants’ loads)

  • a time slot size of one hour, as it was the reference duration of AWS cost model for computing instances at the time of the experiment.

  • we choose to consider 4 migrations per day. A migration produces an interruption of around 10 s depending on the quantity of data.

  • we consider a 2 days period (thus limiting migrations to 8, for 48 time slots).

  • we consider the following parameters for the genetic algorithm: the number of elites individuals to 5, a mutation rate of 0.4, a population size of 20, a number of mutation points corresponding to the number of tenants divided by 5. These parameters were chosen following tests on multiple values for a limited number of seeds each. Details on this choice cannot be included for space restriction reasons.

  • we limit the genetic algorithm computation to 600 or 1800 s and the solver computation time to 5 s.

3.1 Results

In Fig. 6, we show the relative gain of this approach compared to our previous approach (segmented approach) in red (in the upper part of the figure), and to a baseline approach in blue (in the lower part of the figure). The gain is better for 10 tenants than for 100 tenants since the system has more time to search for the cheapest solution. For 10 tenants, we obtain more than 10% enhancement compared to the previous approach, and more than 45% compared to the baseline approach. However for 100 tenants, we have only a 1% enhancement.

Fig. 6.
figure 6

Mean genetic algorithm gain on best initial segmented population for 600 s of running time (Color figure online)

It appears that either the iterative usage of the heuristic, the genetic algorithm or the two of them is more efficient for a small number of tenants for the same number of generations. This is why we conducted experiments where we apply the proposition to subsets of the tenants and we aggregated the results as described in the next Subsection.

3.2 The Splitting Strategy

For this solution, we split the set of tenants into small groups selected randomly. We have tested different size of groups with various number of tenants and we applied the previous method keeping the same total computation time. Figure 7 shows the results we obtained with the genetic algorithm and the iterative heuristic. The x axis corresponds to the size of the groups of tenants. The y axis shows the relative gain compared to the results with no partition. A subset size with the same size as the number of tenants corresponds to no split, the gain is zero.

Fig. 7.
figure 7

Gain depending on splitting strategy for various split quantities.

We obtain the best results with partitions of 5 tenants in all cases. For the experiments we ran, the gain varies from 5% to 15%. We have no good explanation for this result that we can reproduce. Our tests with the solver give the same results for the size of the groups as with the heuristic. In the next Section we present our results with groups of 5 tenants.

3.3 Results for Solver and Iterative Heuristic

We implemented our model (presented in Subsect. 2.4) using the optimization library PuLPFootnote 4 with the Gurobi solverFootnote 5. For execution time and cost reasons, we were not able to test every set of parameters. For instance, with our current implementation, we managed to obtain results with the solver only up to a size of 25 tenants for the partition. Indeed, the duration of the initialization part and the required memory makes it impossible to run with more tenants. Thus we have limited our tests to parts of 5 tenants, for a total of 50 and 100 tenants. As we can see, the results stay close to the results of the heuristic. Figure 8 shows the absolute gain we obtained, and the corresponding percentage compared to the baseline approach cost, for 600 s and 1800 s of running time. We also present the non-split result for the segmented approach (results of the previous paper), and the split segmented approach where we apply time series segmentation on the groups of 5 tenants instead of all the tenants simultaneously.

Fig. 8.
figure 8

Mean cost comparison for 50 and 100 tenants per group of 5

For 1800 s of execution time of the genetic algorithm, split heuristic give the best results. Mean distribution costs are 51.34% for 50 tenants, and 51.72% for 100 tenants compared to the cost of the intuitive approach. Using the solver gives good results but more expensive (respectively 55% and 54.19%). For 600 s of execution time, the results are more balanced: they vary between 54.2% and 55.64%. The genetic algorithm does not enhance the results a lot for both approaches after 600 s: 3% for the heuristic and less than 1% for the solver. Still, it enhances the initial split segmented results from 61.3% to 51.34% for 50 tenants, and from 59.34% to 51.72% for 100 tenants.

We observe that the split segmented approach allows to gain more than 2%, and to unleash the results of the genetic algorithm. Without splitting we gain around 1% for 600 s of genetic algorithm compared to the original population (non split segmented). When splitting, the genetic algorithm results in a gain of 7.1% for 50 tenants, and 4.69% for 100 tenants compared to the split segmented strategy. The absolute gain compared to the intuitive approach remains worthwhile for 2 days: we save 1702$ for 50 tenants and 3319$ for 100 tenants for a cost of respectively 3498$ and 6874$. The respective gain compared to our previous work is 425$ and 763$.

4 Related Work

Many researchers have studied elasticity in the cloud and elasticity for BPM or orchestration systems. Schulte et al. [6] did a general review on the topic and gave directions for future research. In this paper, we focus on the resource allocation and scheduling problem and use a tenant-centric approach based on BPM task throughput, instead of the BPM process-centric from other approaches. Rekik et al. [7] propose an integer programming model based on general hardware metrics for BPM elasticity on the cloud. They base their approach on resource allocation and BPM task scheduling. They do not consider multiple time slots, tenant migration or multi-tenancy. Other attempts on BPM elasticity in the cloud exist such as [8,9,10]. Though not cloud-related, Djedovic et al. propose in [11] a genetic algorithm for BPM task scheduling to their corresponding resources. It uses a representation of each resource. They want to minimize the waiting time and the global resource cost. Junhke et al. [12] propose a task focused genetic algorithm for BPEL workflows scheduling in distributed Clouds.

On other subjects than BPM, the machine reassignment problemFootnote 6 is close to our problematic. It considered software reassignment problem on virtual machines including the migration cost. Gavranović et al. [13] obtained the better results to this challenge. However, this problem is based on hardware metrics, and aggregate migration cost in the objective function. Our problem is not exactly virtual machine allocation since the hardware is already defined. Numerous other attempts target virtual machines, such as [14]. Automated approaches based on cloud offers retrieval and hardware requirement for software such as [15] are also valuable.

These works do not consider simultaneously multi-tenancy, multiple instance types, and migrations, except in the form of data transfer cost for [10] or aggregated migration cost for [13]. In this paper, we present an evolution of our previous work [2]. We have based our approach on time series segmentation for deducing the “good” time slots to migrate tenants, and on the iterative use of an enhanced version of our time slot heuristic [16]. We also presented the corresponding ILP (Integer Linear Programming) model. Results were encouraging compared to a baseline approach, but could be improved regarding the results that we obtain with a solver. We could not compare with other approaches since most of them do not consider migrations of data as an issue. They scale up by adding compute resources to the process engine, considering that access to the database is not a problem. From our experience, at some point, the database is always a bottleneck.

5 Conclusion

In this paper, we proposed a method for cost optimization of BPMaaS deployment based on tenant migration strategies and a genetic algorithm. We presented a new integer programming optimization model. Both allows to obtain substantial gains for BPMaaS providers. The result we obtain when we group the tenants is interesting. It may be explained by the size of the objective space. The fact that it is reproducible for different number of tenants shows that testing multiple sizes may allow providers to save on the operation cost. Moreover, using other metaheuristics such as simulated annealing or hill climbing could provide even better results.

Our method was tested with BPM task throughput but could work with other metrics that can be expressed as a scalar for both the cloud resources and the tenants. We can consider for instance the number of processes, or the number of HTTP requests that lead to transactional processing. Our methods can then be generalized on systems non related to BPMS using multi-tenancy and tenant-related persisted data. A BPMS execution engine behaves more or less like a transactional web application. Our approach is offline and require to anticipate on the tenant load. For many business cases, this is a valid assumption. The server load is relative to the number of employees and the number of cases they can execute everyday or hour with little variations. A next obvious step would be to couple our algorithm with prediction systems. This would provide an effective online algorithm. It could adapt to unforeseen variations.