An introduction to stochastic bin packing-based server consolidation with conflicts

The energy consumption of large-scale data centers or server clusters is expected to grow significantly in the next couple of years contributing to up to 13% of the worldwide energy demand in 2030. As the involved processing units require a disproportional amount of energy when they are idle, underutilized, or overloaded, balancing the supply of and the demand for computing resources is a key issue to obtain energy-efficient server consolidations. Whereas traditional concepts mostly consider deterministic predictions of the future workloads or only aim at finding approximate solutions, in this article, we propose an exact approach to tackle the problem of assigning jobs with (not necessarily independent) stochastic characteristics to a minimal amount of servers subject to further practically relevant constraints. As a main contribution, the problem under consideration is reformulated as a stochastic bin packing problem with conflicts and modeled by an integer linear program. Finally, this new approach is tested on real-world instances obtained from a Google data center.


Motivation and problem statement
Nowadays, data centers are representing one of the most significant elements in the next stage of growth for the information and communication technology (ICT) industry (Corcoran and Andrae 2013). By way of example, as the importance of cloud computing has been steadily increasing over the past couple of years, already today a considerable portion of the global IP traffic is processed and stored in data centers. According to a forecast made by Cisco Systems (Barnett et al. 2018), the global data center IP traffic is expected to grow more than threefold between 2017 and 2022, leading to a compound annual growth rate (CAGR) of 26%, see also Fig. 1 for a general overview and Fig. 2 for some of the key reasons of this considerable increase.
Naturally, to cope with this huge amount of traffic, a very large number of processing and storage servers is required in the data centers. More problematically, already nowadays these computational units inevitably consume a significant amount of energy (Arjona et al. 2014;Koomey 2008), which is going to increase exponentially over the next couple of years, see Fig. 3. From an overall point of view, in a pessimistic scenario, data centers will contribute to about 13% of the global energy demand in 2030 (compared to roughly 1.5% in 2010), see (Jones 2018).
Trying to keep the environmental consequences of this increase within a tolerable limit, concepts and measures to reduce energy consumption and emissions [such as the integration of renewable energies in data centers (Goiri et al. 2015;Oro et al. 2015)] have been extensively dealt with in the literature, see (Andrae and Edler 2015) and further references therein. However, note that most of these  Barnett et al. (2018) 1 3 "green energy" approaches are not designed for (and thus not successful in) reducing the absolute energy demand.
Another approach to improve the energy efficiency of data centers or server clusters is motivated by the observation that processing units consume a   Fig. 4), but also recently appeared in a modified form in Jones (2018) 1 3 An introduction to stochastic bin packing-based server… disproportional amount of energy whenever they are idle, underutilized, or overloaded, see (Hähnel et al. 2018) or (Wu 2013, Fig. 2.2). Moreover, independent studies revealed that existing servers are typically not used optimally for fear of not being able to guarantee high availability during peak times (Dargie 2015;Manvi and Krishna Shyam 2014;Möbius et al. 2014). Consequently, efficient server consolidation strategies are a key element to obtain an improved resource utilization. In recent years, several approaches have been presented in the literature, but all of them share the challenging task to accurately estimate the future workloads to balance the demand for and the supply of computing resources. Whereas traditional strategies tend to allocate the given services with respect to a deterministic prediction of the expected workloads, see ) and references therein, recent measurements and studies suggest that a considerable amount of data center workload for different applications is highly volatile (Benson et al. 2010;Kandula et al. 2009), see also Fig. 4 for the fluctuations of a realworld example.
However, reliable and reasonable deterministic estimators are difficult to find without running the risk of wasting resources based on too pessimistic predictions (Chen et al. 2011;Wang et al. 2011). To better display the uncertainty of the future resource demands, characterizing the services in a probabilistic way turned out to be a more promising approach (Hähnel et al. 2018;Monshizadeh Naeen et al. 2020;Wang et al. 2011;Yu et al. 2020). More precisely, we consider the following offline scenario: Given a fixed number n of jobs (services, tasks, etc.) whose resource demands are described as a stochastic process X ∶ × T → ℝ n , where ( , A, ℙ) is a probability space and T ∶= [0, ] describes a bounded time horizon with > 0 . We aim at computing the lowest number of servers (machines, processors, cores, etc.) of capacity C > 0 that is able to accommodate all jobs subject to further constraints, the most important of which separating conflicting jobs and ensuring that

Related literature and contributions
From a mathematical point of view, the setup mentioned above can basically be referred to as a stochastic bin packing problem (SBPP). In that interpretation, the items would have nondeterministic item lengths, while the bin capacity is fixed to some constant. The ordinary bin packing problem (BPP), or the neighboring cutting stock problem (CSP), is one of the most important classical representatives in combinatorial optimization and still attracts significant scientific interest according to several data bases, see (Delorme et al. 2015, Fig. 1) for a trend of the related publications. Starting with early works (Gilmore and Gomory 1961;Kantorovich 1960), over the last decades, the BPP and the CSP have been studied extensively within the literature. By way of example, we refer the reader to some (by far not exhaustive) surveys (Delorme et al. 2016;Valério de Carvalho 2002) and standard references about approximation algorithms (Coffman et al. 2013(Coffman et al. , 1984, branch-and-bound based techniques (Belov and Scheithauer 2006;Valério de Carvalho 1999;Vance 1998;Vance et al. 1994), classical pseudo-polynomial integer linear programming (ILP) formulations (Dyckhoff 1981;Martinovic et al. 2018;Valério de Carvalho 2002), or modern and advanced approaches (Brandão and Pedroso 2016;Clautiaux et al. 2017;Delorme and Iori 2020;Wei et al. 2020). Moreover, in the last couple of years, (deterministic) generalizations with respect to a temporal dimension have been proposed in various articles (Aydin et al. 2020;de Cauwer et al. 2016;Dell'Amico et al. 2020).
As regards the stochastic bin packing problem, probably two of the earliest references are given by Coffman et al. (1980) and Shapiro (1977). Therein, the item sizes are once drawn according to a specific probability distribution, and then exemplarily scheduled based on a next-fit heuristic. Whereas a true time dimension or the volatility of item sizes over time is not considered in these works, the potential applicability of bin packing (or related problems) to multiprocessor scheduling is already pointed out with respect to makespan minimization (Coffman et al. 1978). In recent years, also server consolidation or load balancing has been addressed in connection 1 3 An introduction to stochastic bin packing-based server… with the SBPP. However, the approaches presented in the related literature are different from our's because of: -In many cases, specific assumptions on the distributions of the given workloads are explicitly required, see, e.g., Kleinberg et al. (2000) for Bernoulli-type random variables or Goel and Indyk (1999) for exponentially distributed workloads. -The stochastic independence of the workloads is often assumed (Cohen et al. 2019;Monshizadeh Naeen et al. 2020;Wang et al. 2011;Yu et al. 2020). -A significant amount of publications deals with the so-called effective item sizes (Kleinberg et al. 2000;Wang et al. 2011), meaning that again the random variables are replaced by an appropriately defined deterministic value (that tries to use information provided by the distribution). Sometimes, these effective item sizes are still too difficult to handle, so that (easier) lower and upper bounds for these values are considered instead.
Moreover, whichever the case may be, all of these articles only address the approximate solution based on heuristics rather than providing models or strategies to exactly solve the problem under consideration. To the best of our knowledge, the latter has first been attempted by Martinovic et al. (2019), where two exact solution approaches for normally distributed and independent workloads have been presented. Note that, as extensively described by Martinovic et al. (2019), the introduced approach can also be applied to handle a wide variety of other distributions, as long as they are somewhat "stable" under convolution. However, many of these other distributions would lead to ordinary bin packing problems with possibly modified (deterministic) item sizes or bin capacity (Martinovic et al. 2019, Remark 1). Hence, also in this work, we will focus on normally distributed workloads which is a common approach (Cohen et al. 2019;Jin et al. 2012;Wang et al. 2011) or reasonable approximation, see (Martinovic et al. 2019, Remark 3) or (Yu et al. 2020, Fig. 4), and also warrantable for many real-world data, see also Fig. 5. To say it more clearly: in this paper, the given real workloads are approximated by a normal An exemplary schematic of four real-world CPU utilization characteristics. Obviously, approximating these workloads by perfect normal distributions will lead to sufficiently accurate descriptions of the jobs distribution by matching the first two moments. Then, the assignment is calculated based on the idealized workloads. In general, this does not necessarily have to result in reasonable approximations for the original data-but since, in our scenario, the true workloads can be approximated very well by a normal distribution, it is permissible to proceed in this way. Recently, the method from Martinovic et al. (2019) has been compared to other consolidation strategies with respect to different performance and execution metrics (e.g., job completion time, system overload probability), see (Hähnel et al. 2018). In each category, our approach (Martinovic et al. 2019) incurred a modest penalty with respect to the best performing approach in that category, but overall resulted in a remarkable performance clearly demonstrating its capacity to achieve the best trade-off between resource consumption and performance. Given the fact that exact formulations for server consolidation are currently (still) limited to moderately sized instances, the main challenge in this area is scalability. Hence, it is of paramount importance to foster theoretical approaches contributing to further increase the size of problems that can be solved in a reasonable amount of time, even if their numerical properties do not yet allow an instantaneous and unrestricted application in fast real-time scheduling for large server clusters. Moreover, from a practical perspective, knowing the exact solution is a requirement to accurately judge the quality of lower bounds and heuristic strategies.

Remark 2
From an overall point of view, the applicability of exact approaches mainly depends on the precise characteristics of the considered data center. For example, as reported in Dargie (2019), the Enterprise Cloud Infrastructure (ECI) at the Centre for Information Services and High Performance Computing (ZIH Dresden) 1 consists of 59 computing servers and additional 29 storage servers. Last year, it was hosting 1100 commercial virtual machines (i.e., jobs in a more abstract sense). The virtual machines were compute-intensive and their computation time was relatively long. By comparison, one of the Alibaba's Production Clusters (APC) 2 hosted more than 44000 Linux Containers on 3985 servers. Comparing the ratio of jobs to servers, the first yields 12.6, whereas the second is around 11. That means, as far as the server workload was concerned, we can consider both systems large scale. Indeed, server consolidation on the first makes sense, because (the total number of jobs is not too large and more importantly) the execution duration was significantly long justifying all the computation costs, whereas almost all the jobs on the Alibaba servers were short-lived making long-term consolidation difficult to achieve. In the latter case, a fast heuristic solution is definitely required. Altogether, the exact methods presented in this article are particularly intended for data centers being predominantly confronted with rather long-lasting jobs, as, in this scenario, the costs of calculating an optimal solution will be recouped by the resulting savings (over the long period of execution).

3
An introduction to stochastic bin packing-based server… Whether one regards an entire server with a large number of processor cores or a single multi-core processor, it is imperative to co-locate virtual machines (or simply jobs, to use a more abstract term) in such a way that they neither contend for resources unnecessarily nor underutilize them considerably. Indeed, ideally, the colocated jobs should complement one another (such as one is active when another is inactive). While, in our previous paper (Martinovic et al. 2019), we addressed the optimal assignment of jobs to processor cores by assuming that each job generates a stochastic workload, we did not, however, regard the mutual interactions of the jobs. This resulted in a very extended selection process when dealing with a large number of jobs. In this paper, we, among others, also take the mutual characteristics of the workloads of co-located jobs into account, especially to identify pairs of jobs having overlapping resource utilization characteristics which must not be co-located. Such exclusion not only facilitates the consolidation of a large number of jobs, but also avoids contentious jobs from sharing a processor core or server. More precisely, the main contributions (in particular, compared to Martinovic et al. (2019)) of this article are the following: -We consider a more general and application-oriented scenario, where the given workloads do not have to be stochastically independent. -We present the concept of overlap coefficients to reduce the number of conflicting jobs being allocated to the same server. -The computational experiments are based on real data from a Google data center (Reiss et al. 2011). By theoretical and numerical arguments, we particularly discuss the optimal choice of a parameter determining how many pairs (of jobs) are forbidden to be co-located.
As we will show within the paper, we can take into account these contributions by considering a stochastic bin packing problem with conflicts (SBPP-C). Moreover, the new exact ILP model can cope with much larger instance sizes than the less general formulation from Martinovic et al. (2019). This can be used in particular to assess the quality of heuristic approaches for a larger class of instances. This article is structured as follows: In the next section, we properly introduce the concept of overlap coefficients and present the mathematical basics of our approach. Afterward, in Sect. 3, an exact ILP formulation as well as a lower and an upper bound are proposed. In Sect. 4, we present the results of numerical simulations and explain the methodology and assumptions used therein. Finally, we give some concluding remarks and an outlook on future research.

Preliminaries and notation
Throughout this work, we will consider a given number n ∈ ℕ of jobs (or services, tasks, items), indexed by i ∈ I ∶= {1, … , n} , whose workloads can be described by a stochastic process X ∶ × T → ℝ n , where ( , A, ℙ) is a probability space and T ∶= [0, ] , > 0 , represents a time horizon (i.e., an activity interval for the jobs). Moreover, as motivated in the previous section, the jobs are assumed to follow a normal distribution. More formally, we have X t ∼ N n ( , ) for all t ∈ T , where ∶= ( i ) i∈I and ∶= ( ij ) i,j∈I are a known mean vector and a known positive semidefinite, symmetric covariance matrix, respectively, of an n-dimensional multivariate normal distribution N n . In particular, this implies that any individual workload Remark 3 For the sake of completeness, observe that the opposite is not true, in general. More precisely, a vector formed by n normally distributed random variables does not have to be normally distributed (in dimension n).
These jobs shall once be assigned to a minimal number of servers (or machines, processors, cores) with given capacity C > 0 , i.e., it is not allowed to reschedule the jobs at any subsequent instant of time. 3 Similar to the ordinary BPP, we use incidence vectors a ∈ n to display possible item combinations. Here, a i = 1 holds if and only if job i, i ∈ I , is part of the considered subset. To represent a feasible combination of jobs (for a single server), this vector has to satisfy two important conditions: (A) (stochastic) capacity constraint: For a given threshold > 0 , we have to demand ℙ[X ⊤ t a > C] ≤ for all t ∈ T to limit the probability of overloading the bin capacity, see also Fig. 6. Fig. 6 The consolidation of two (independent) normally distributed workloads on one processor. This assignment satisfies the capacity constraint (A) whenever > 0.082 is considered 1 3 An introduction to stochastic bin packing-based server… (B) non-conflict constraint: Let F ⊂ I × I describe a set of forbidden item combinations (to be specified later). Then, a i + a j ≤ 1 has to be true for all pairs (i, j) ∈ F . The motivation behind this constraint is to basically separate those pairs of jobs, which are likely to influence each other's performance.
Definition 1 Any vector a ∈ n satisfying Conditions (A) and (B) is called a (feasible) pattern or a (feasible) consolidation. The set of all patterns is denoted by P.
In what follows, we aim at finding a more convenient and computationally favorable description of the pattern set P. To this end, knowing the distribution of the random variable X ⊤ t a , t ∈ T , is required in Condition (A). Fortunately, for any t ∈ T , this linear transformation of the normally distributed random vector X t ∼ N n ( , ) is again normally distributed (even if the individual components of X t are not stochastically independent!) (Balakrishnan and Nevzorov 2003, Chapter 26), see also Fig. 6, meaning that holds for all t ∈ T . Consequently, we obviously have is a representative random vector (in terms of the distribution) for the workloads. Hence, from now on, we do not always have to explicitly mention the time indices t ∈ T (or the time horizon T, in general) in the following formulas and discussions.
Based on these observations, it is possible to briefly refer to the server consolidation problem as a stochastic bin packing problem with conflicts (SBPP-C). To this end, we introduce the following term.
Definition 2 A tuple E = (n, c, C, , , F, ) consisting of -a deterministic server (bin) capacity C ∈ ℕ, -an error bound ∈ (0, 1) for the violation of the bin capacity, -n ∈ ℕ jobs (items) with (not necessarily independent) normally distributed workloads (weights) c ∼ N n ( , ), -a set F of forbidden item combinations is called an instance of the SBPP-C.
Remark 4 Obviously, we have to demand ℙ[c i > C] ≤ for all i ∈ I to ensure the solvability of E. Moreover, without loss of generality, the bin capacity (and the workloads) can be normalized to C = 1.
Thus, we can state

Lemma 1 Let E be an instance of the SBPP-C. Then, a ∈ n satisfies Condition (A) if and only if
Proof Due to (1) and the definition of the quantile function, we obviously have Hence, it is possible to rephrase Condition (A) as a deterministic (but still nonlinear) inequality; a fact already recognized in very early publications on chance constraints, see (Hillier 1967, p.37) or (Kataoka 1963, p.184). At least for some values of , an easier representation can be obtained by the following observation: Proof Let a ∈ n satisfy (2) which is equivalent to Condition (A). Due to 0 < ≤ 0.5 , we certainly have q 1− ≥ 0 , so that (2) directly leads to Moreover, by squaring this inequality, we obtain Rearranging the terms leads to where a i = a 2 i for a i ∈ and ij = ji have been used in the last line. In the reverse direction, basically, the same steps can be applied. Here, the property C ≥ ⊤ a is important to take square roots on both sides of (C − ⊤ a) 2 ≥ q 2 1− a ⊤ a without causing a case study. ◻ Consequently, as already seen for a less general case in (Martinovic et al. 2019, Theorem 2), Condition (A) can be expressed as a pair of one linear and one quadratic constraint. Moreover, note that the assumption 0 < ≤ 0.5 does not incur an actual restriction, since, typically, only a modest error bound is given for practically meaningful instances (Hähnel et al. 2018).
As regards Condition (B) from the feasibility definition, we only have to clarify how to obtain an appropriately chosen set F of forbidden item combinations. To this end, note that demanding Condition (A) only states an upper bound for the overloading probability of a server. However, this does not mean that for a specific realization ∈ , the consolidated jobs cannot have a workload c(w) ⊤ a larger than C. In particular, this can happen (even for all instants of time t ∈ T ) if many workloads are larger than their expectations, i.e., if c i ( ) > i is true for several i ∈ I with a i = 1 . Practically, this would then lead to some latency in the execution of the services. Hence, it is desirable to somehow "avoid" these performance-degrading situations. To tackle this problem, as already mentioned in the introduction, one of the main novelties of our approach is given by the consideration of overlap coefficients.

Definition 3 For given random variables
denoting the expected value and Example 1 To demonstrate the intention of the overlap coefficient by means of a preferably simple introductory example, let us first define a uniformly distributed random variable ∈ U([0, 2)) . Then, we consider the three following random variables: where 1 A is the indicator function of A ⊆ ℝ . Obviously, any of these variables represents a Bernoulli trial with probability p = 0.5 , so that the mean value and the variance are given by 0.5 and 0.25, respectively. It can easily be calculated that we obtain X,Y 1 = −1 and X,Y 2 = −0.9 for the respective overlap coefficients, meaning that the level of interaction between X and Y 1 is lower. On the other hand, we also have since X + Y 1 = 2 is impossible, whereas X + Y 2 = 2 holds precisely for ∈ 0, 1 10 . Altogether, both combinations would be feasible for ≥ 0.05 , but the situation is much better for the less interactive pair {X, Y 1 } (compared to {X, Y 2 } ) with the lower overlap coefficient, because here the capacity is never exceeded.

Lemma 3 Given two random variables Y, Z as described above, then we have
Proof This is an immediate consequence of the Cauchy-Schwarz inequality and the fact that we have R 2 (y, z) = 1 for all y, z ∈ ℝ . Indeed, we obtain ◻ Remark 5 Contrary to the ordinary correlation coefficient YZ , defined by the new value YZ does not "penalize" the situation, where both jobs Y and Z possess a relatively small workload (compared to the expectations Y and Z ), since this situation is less problematic in server consolidation. This means that only those cases where both Y and Z require more resources than expected will contribute to a positive overlap coefficient.
Based on these observations, we intend to limit the (pairwise) overlap coefficients of services that are executed on the same server by some value S ∈ [−1, 1] . Since we would like to exclude situations where the considered jobs are both operating above their expectations, a small value of S seems to be preferable. However, this could lead to too strong restrictions, meaning that the required number of servers becomes much larger.
As it will turn out, choosing S ≈ 0 can be considered reasonable (for the numerical data and the intended application dealt with in this article) for several reasons. While the practical reasons will be discussed in more detail in the

3
An introduction to stochastic bin packing-based server… computational part (see Sect. 4), the main theoretical justification is based on the following observation.
Proof Without loss of generality, we can assume = (0, 0) ⊤ . Otherwise, we could continue with the centered random variables Y − Y and Z − Z without changing the covariance structure of the random vector. At first, we note that the function R (from the definition of the overlap coefficient) can be expressed as a combination of indicator functions Consequently, the overlap coefficient of normally distributed random variables is always less than or equal to 0. Observe that, in the computational experiments, the overlap coefficients are attached to the real workloads of the jobs (which do not follow an ideal normal distribution), whereas the assignment of the jobs is calculated based on the normal approximation of the jobs. Hence, our input data will also contain positive overlaps, but choosing a threshold S = 0 (or at least S ≈ 0 ) does not affect those pairs that are formed by jobs (almost) following a perfect normal distribution.
Example 2 In addition to the highly simplified scenario considered in Example 1, we would now like to look at a more interesting case for our purposes. To this end, for two normally distributed random variables X, Y, we study the relationship between the overlap coefficient XY and the probability ℙ[X + Y > 1] to exceed the capacity of a server. More precisely, we randomly pick the parameters = ( 1 , 2 ) ⊤ and = ( ij ) of a bivariate normally distributed random vector (X, Y) ⊤ according to uniform distributions 4 1 , 2 ∈ [0.3, 0.4] and √ 11 , √ 22 ∈ [0.05, 0.15] . Moreover, based on the previous choices, the covariance 12 = 21 is randomly drawn from an interval symmetric to zero, so that the positive definiteness of is ensured. The results of a total of 500 test runs are summarized in Fig. 7. We can clearly see that (in the vast majority of cases) small values of XY correspond to small probabilities ℙ[X + Y > 1] , so the general trend observed in Example 1 also applies to normally distributed input data. Finally, we note that indeed all XY were negative, just as predicted in the previous theorem.
For a given threshold S ∈ [−1, 1] , the set of forbidden item combinations F ∶= F(S) , that is Fig. 7 For each test run, a ⋆ is drawn at position (x, y) , where x represents the overlap coefficient and y represents the probability to exceed the capacity of the server 1 3 An introduction to stochastic bin packing-based server… where ij represents the overlap coefficient between distinct jobs i ≠ j ∈ I , can be computed beforehand, since any required information are input data of an instance.
The following result now summarizes the main observations of this section and states an appropriately convenient description of the pattern set P.
Lemma 4 Let E be an instance of the SBPP-C with 0 < ≤ 0.5 . Then, a = (a i ) i∈I ∈ P holds if and only if the following constraints are satisfied: Note that the quadratic terms a i a j appearing in (9) can be replaced by additional binary variables (and further linear constraints) to obtain a fully linear description of the pattern set. To this end, different reformulation techniques have recently been investigated from a theoretical and practical point of view, see Furini and Traversi (2019). In that article, the approach originally presented by Glover and Woolsey (1974) is shown to offer a good balance in terms of computational properties (e.g., the strength of the obtained LP bounds) and modeling aspects (e.g., the numbers of required additional variables and constraints). Consequently, we only consider this linearization strategy in the next section.

An exact solution approach
To model the SBPP-C, we propose an integer linear program (ILP) with binary variables that is similar to the Kantorovich model (Kantorovich 1960) of the ordinary bin packing problem. More formally, given an upper bound u ∈ ℕ for the required number of servers (bins), we introduce decision variables y k ∈ , k ∈ K ∶= {1, … , u} , to indicate whether server k is used ( y k = 1 ) or not ( y k = 0 ). Moreover, we require assignment variables x ik ∈ , (i, k) ∈ Q , to model whether job i is executed on server k ( x ik = 1 ) or not ( x ik = 0 ), where Remark 6 Obviously, the x-variables could be defined for any pair (i, k) ∈ I × K , but to reduce the number of symmetric solutions, we implicitly renumber the servers in such a way that job i = 1 is scheduled to server k = 1 , job i = 2 is either scheduled to server k = 1 or to a new server k = 2 , and so on. With this approach, considering index set Q is sufficient.
Similar to this preprocessing of some x-variables, a lower bound ∈ ℕ for the optimal objective value can be used to define y 1 = y 2 = … = y = 1 in advance.
As already pinpointed at the end of the previous section, the quadratic terms in (9) will be replaced by additional binary variables k ij with k ∈ K and (i, j) ∈ T k where to only consider those index tuples (i, j, k) that are compatible with the indices of the x-variables.
Remark 7 For each quadratic term x ik x jk appearing in the feasibility conditions of pattern x k = (x ik ) , the three constraints k ij ≤ x ik , k ij ≤ x jk , and x ik + x jk − k ij ≤ 1 have to be added to ensure x ik x jk = 1 if and only if k ij = 1.
Altogether, with the abbreviated coefficients for i ∈ I and j > i appearing in (9), the exact model for the SBPP-C results in

An introduction to stochastic bin packing-based server…
Although this model seems to be quite complex, its structure is easily understandable. The objective function minimizes the sum of all y-variables, that is the number of servers required to execute all jobs feasibly. Conditions (11) ensure that each job is assigned exactly once. According to Lemma 4, for any server k ∈ K , conditions (12)-(14) guarantee that the corresponding vector x k = (x ik ) represents a feasible pattern. Remember that here we already replaced the quadratic terms x ik ⋅ x jk by the new binary variables k ij , so that conditions (15)- (17) have to be added to couple the x-and the -variables. Based on the observations made at the beginning of this section, conditions (18) already fix some of the appearing variables to reduce the number of symmetric solutions.

Remark 8
Of course, having O(n 3 ) binary variables and O(n 3 ) linear constraints, the above model can be considered relatively difficult to solve. However, in the previous publication (Martinovic et al. 2019, Tables 2-4), dealing with a less general scenario, it was shown on the basis of extensive tests that this additional effort in modeling offers significant numerical advantages compared to a nonlinear formulation. For this reason, we limit ourselves in this article to the examination of the linearized approach.
For a given instance E, there are different ways to obtain lower and upper bounds that can be used to formulate the assignment model. Whereas upper bounds for minimization problems are usually found by heuristics, lower bounds can be obtained by (combinatorial) investigations of the input data. Here, we choose an (adapted) material bound and the First Fit Decreasing (FFD) heuristic to compute the values and u, since (among other possibilities) especially the latter (i.e., the FFD approach) turned out to usually lead to good approximations, see (Martinovic et al. 2019) for a preliminary study on their performances for a less general related problem.

Lemma 5 Let E be an instance of the SBPP-C. Then, the value
defines a lower bound for the optimal objective value z ⋆ of the SBPP-C.
Let z ⋆ denote the optimal value of the given instance E. Then, any pattern a j , j = 1, … , z ⋆ (that is used in this solution) has to satisfy the feasibility conditions presented in Lemma 4. By representing a pattern with its corresponding set of active indices I j ∶= {i ∈ I ∶ a ij = 1} , we obtain a partition I 1 , … , I z ⋆ of I with for all j ∈ {1, … , z ⋆ } . An aggregation of all these inequalities finally leads to ◻ As regards the ordinary BPP, this bound is known to lead to rather poor approximations of the true optimal value, in general, since the absolute difference between both values can become arbitrarily large. Having the BPP as a special case of the SBPP-C, a similarly bad performance of should be expected in our calculations.

Remark 9
Actually, in the stochastic setting with conflicts, the situation is even worse. While we have a worst-case performance ratio of 2 in the BPP case, here the statement holds. This can be verified by considering an instance with n deterministic jobs having i = 1∕n (for all i ∈ I ∶= {1, … , n} ). Assuming that every combination of the items belongs to the set F, then we have z ⋆ = n and = 1 . Hence, for n → ∞ , the performance ratio can become arbitrarily large.
However, finding more appropriate lower bounds is not straightforward. By way of example, a reasonably performing combinatorial lower bound (known from stochastic bin packing (Martinovic et al. 2019)) cannot be applied in our scenario.
Remark 10 Contrary to Martinovic et al. (2019), it is not possible to use the lower bound in this setting. By way of example, let us consider an instance with n = 2 items satisfying 1 = 2 = 0.5 , 11 = 22 = 0.1 , and 12 = 21 = −0.1 . Moreover, we assume An introduction to stochastic bin packing-based server… C = 1 and = 0.1 . This leads to z ⋆ = 1 , since all jobs can be assigned to one server. However, we also obtain ̃ = 2 , so that this term cannot be a valid lower bound, in general.
To obtain an upper bound, we construct one feasible solution based on the following FFD algorithm, where the items are sorted with respect to non-increasing mean values.
Note that, for many cutting and packing problems, feasible solutions based on FFD heuristics are known to lead to reasonable approximations with respect to the optimal value, see (Dósa et al. 2013;Martinovic et al. 2019). By way of example, we have for any instance E of the ordinary bin packing problem (Dósa et al. 2013). However, the question whether the generally favorable performance of FFD also applies to the optimization problem investigated here can only be answered with the help of either good lower bounds or by knowing the actual optimum value. Since, as explained before, lower bounds of reasonable quality are missing so far, the possibility to calculate exact solutions is needed to evaluate the quality of heuristic approaches. Thus, the presentation of an exact formulation is useful even if the heuristic numerically proves to be nearly equivalent (in terms of the objective value), because without the knowledge of an exact solution, the very good approximation could not be manifested.
Before we finally move on to the test calculations, we would like to briefly discuss the fact that the previously mentioned heuristics and exact approaches determine assignments based on idealized normal distributions. We have already collected many theoretical arguments for the validity of this approximation in the previous sections, but now we would also like to shortly address the practical perspective at least by a numerical example. In that regard, we are primarily concerned with the question of whether the calculated assignments might not be applicable to the original workloads at all, because (in contrast to the perfect normal distribution used for the calculation) they could exceed the server capacity with a higher probability. As a conclusion of this section, this issue will now be investigated numerically. The necessary theoretical explanations can be found in Appendix A.
⌋ Example 3 Let us assume that the original workloads X, Y are given by correlated lognormally distributed random variables. For the sake of completeness, we mention that any lognormally distributed random variable W is defined by where G is a standard normal distribution, see Appendix A for more details. Obviously, it is possible to approximate the pair (X, Y) by a bivariate normal vector (N X , N Y ) by matching the first two moments and respecting the covariance structure. Let > 0 be fixed, then it would be interesting to know whether the feasibility condition ℙ[N X + N Y > 1] ≤ (with respect to the approximated workloads) also implies the feasibility condition ℙ[X + Y > 1] ≤ (with respect to the original workloads). To this end, we again consider 500 test runs with randomly drawn parameters ∈ [log(0.3), log(0.7)] and ∈ [0.05, 0.15] appearing in the construction exp( + ⋅ G) of a lognormal random variable. 5 For any of these scenarios, we collect the values P 1 ∶= ℙ[N X + N Y > 1] and P 2 ∶= ℙ[X + Y > 1] , the last of which can only be evaluated numerically, since the sum of two lognormally distributed random variables does not follow any particular known distribution. For the concrete details of this calculation, we refer again to Appendix A, where we also justify that our approximations are warrantable for the input data described here. A visualization of the obtained results can be found in Fig. 8 together with the function f (x) = x . Here, we clearly see that in almost all cases, P 1 is less than or equal to P 2 underlining that our approach to deal with perfect normal distributions does not Fig. 8 For each test run, a ⋆ is drawn at position (x, y) , where x and y represent the probability P 1 and P 2 to exceed the server capacity when using the approximated and original workloads, respectively. (The red line represents the function f with f (x) = x) 1 3 An introduction to stochastic bin packing-based server… effect the feasibility of the obtained solutions when (instead) the original workloads would have been considered.

Data set and methodology
To better highlight the computational properties of the presented approach, we provide the results of numerical experiments. To this end, we consider real-world data based on 500 workloads (jobs) appearing in a Google data center. These measurements were conducted over a period of 30 days (in May 2011), see (Reiss et al. 2011), and comprise a total number of roughly 12500 physical machines (or servers in our terminology) and 24 ′ 281 ′ 242 tasks (i.e., jobs). The most important characteristics of all jobs (e.g., start and stop time, resource consumptions, memory access, etc.) form a csv-file of roughly 167 GB and can be accessed online, see (Reiss et al. 2011) for the details. Obviously, considering all jobs would be too data-intensive, so that a reasonable subset of these tasks has to be chosen. Here, particularly, the following criteria were applied in the selection process: -As the jobs published in Reiss et al. (2011) have been collected over a period of 30 days, given a fixed job i, many of the other jobs were not executed at the same time. More precisely, there are many jobs j ≠ i starting after i has already been executed or terminating before i has actually begun. Consequently, such jobs can run on the same server, because they are operating in different time intervals and do neither influence each other nor the server capacity at the same instant of time. -The vast majority of the given jobs only possess very low resource consumptions, so that they hardly influence the total energy demand of the data center. By way of example, only 0.0118% (resp. 0.59%) of all jobs are responsible for roughly 15% (resp. 80%) of the CPU utilization.
Based on these properties, we first selected a (preliminary) subset containing the 2857 most resource-intensive jobs causing approximately 15% of the total CPU utilization in the data center. Hence, an efficient consolidation of these tasks could already improve the overall energy consumption significantly. As observed in Patel et al. (2015), the workloads from the Google data center can be partitioned into a small number of different groups of jobs, meaning that the jobs within one and the same group only differ slightly in terms of their characteristics (e.g., i and ii ). Hence, we selected a final subset of 500 representative jobs (from the 2857 jobs chosen before) whose time intervals are still similar, so that they could indeed influence each other if executed on the same server. This set of 500 jobs, the precise characteristics of which can be found in two histograms in Fig. 11 in Appendix B, forms the data basis for the computations reported in the next subsection. Being able to optimally allocate (a subset of) these representative jobs already provides valuable information to efficiently group the remaining (similar) jobs.
In our numerical experiments, for given n ∈ ℕ , we always constructed 20 instances by randomly drawing n jobs from our data basis. Then, we implemented the approaches from Sect. 3 in MATLAB R2015b and solved the obtained ILP models by means of its CPLEX interface (version 12.6.1) on an Intel Core i7-8550U with 16 GB RAM. Here, particularly the overlap coefficients ij , i, j ∈ I , and a reasonable threshold S are required. While the values ij (of the true workloads) are input information given by (4), an appropriately chosen parameter S should be in accordance with the considered input data. To this end, in Figs. 9 and 10, the distribution of the overlap coefficients is depicted as a histogram 6 (for the two data sets specified above). Because of these results, a value S ≈ 0 should be chosen to not exclude too many item combinations (which leads to servers only containing one single job) or to not allow arbitrary pairs (so that the overlap coefficients do not play any role). To stress the suitability of this parameter choice, we added an additional information (drawn as a red line) to the figures: Of course, it could happen that some jobs do not appear at all in the pairs (i, j) which are satisfying ij ≤ S for S ≈ 0 . Obviously, these jobs would later be exclusively assigned to a separate server so that the energy consumption is increased. However, the red line depicted in the figures counts the total number of jobs that appear at least once in the pairs (i, j) used to build the histogram. As we can clearly see, choosing S close to zero 7 leads to a situation, where at least one non-conflicting pair for any job i ∈ I is given. Hence, from a theoretical point of view, any job can be executed with at least one other job on the same server in a feasible consolidation.

Remark 11
This observation does not imply that an optimal consolidation has each server equipped with at least two jobs.
However, based on these two arguments (that are in accordance with the considered data sets) and the theoretical observation from Theorem 1, we will only consider values S ∈ [−0.1, 0.1].
For any instance, we collected the following data: -̃ ,ũ : lower and upper bound (for the approach from Martinovic et al. (2019)), -, u : lower and upper bound (as described in Sect. 3) -z ⋆ : optimal value (obtained by the assignment model), -n v , n c : numbers of variables and constraints (in the assignment model), t: time to solve the ILP (in seconds).
Note that the values ̃ and ũ are forming an interval for the optimal objective value z that would be obtained with the less application-oriented approach from Martinovic et al. (2019). For the instance sizes presented in the next subsection, the true optimal value of the former approach is not available, since, in Martinovic et al. (2019), only 1 3 An introduction to stochastic bin packing-based server… instances up to n = 18 could be solved to proven optimality while having relatively large running times (more than 10 minutes on average for n = 18).

Results and discussion
Based on the experiences made by Martinovic et al. (2019), we selected = 0.25 within our computations. Moreover, the considered workloads are normalized to a server capacity of C = 1 , and a performance threshold of S = 0 is chosen to preferably avoid the consolidation of "positively correlated jobs" (in the interpretation of the overlap coefficients), also taking into account the observation from Theorem 1. Furthermore, we choose a time limit of t (1) max = 300 s within our computations. In Table 1, we only refer to the average values (based on 20 instances each) instead of listing the results of any single instance. Obviously, for increasing values of n, the instances become harder with respect to the numbers of variables and constraints, so that more time is needed to solve the problems to optimality. However, any considered instance could be coped with in the given time limit.
Our main observations are given by: -Contrary to the results in Martinovic et al. (2019), the quality of the lower bound is much worse in this generalized setting, as we see 2.4 < z ⋆ ∕ < 2.9 for the average values from Table 1. The main reason for this bad performance is given by the fact that the lower bound does neither reflect any of the forbidden item combinations nor the covariances of the jobs, so that it does not use any of the new problem-specific input data.
-The upper bound obtained by the FFD heuristic is still very close to the exact optimal value, as already observed in Hähnel et al. (2018), Martinovic et al. (2019) (or for the ordinary BPP (Dósa et al. 2013)). More precisely, we can state u∕z ⋆ < 1.05 for the average values in Table 1. Here, the precise pattern definition (including the covariances and forbidden combinations) is always applied, so that the obtained consolidations satisfy all feasibility conditions. We would like to emphasize that a valuation of the upper bound u based solely on the value would not lead to any substantial information in this case. Only by knowing the actual optimal value z ⋆ the good quality of the approximation can be observed.
-In this generalized setting, it is possible to deal with much larger instance sizes.
Most probably, this is caused by the new set of inequalities (to avoid forbidden item combinations) which can be modeled without requiring new variables. Hence, if there are many of these constraints, the set of feasible solutions is (considerably) restricted which usually reduces the numerical efforts. Note that these additional constraints can also help to fix additional variables in different nodes of the branching trees. 1 3 An introduction to stochastic bin packing-based server… -Having a look at the intervals [̃ ,ũ] and the optimal value z ⋆ of the generalized problem, we can roughly observe z ⋆ ≈ 2 ⋅ � z , where z ∈ [̃ ,ũ] is the unknown optimal value of the less general approach. -For every n from our table, we see that n∕z ⋆ ∈ [2, 3] holds, meaning that (on average) a server is equipped with two or three jobs. This observation is in accordance with Fig. 11 in Appendix B and strongly related to the fact that we are considering the very resource-intensive jobs (and, on top, the further restrictions caused by the set F) from the Google data trace. Note that such instances are challenging especially due to the large number of binary variables caused by, among others, a relatively large cardinality of the set K (that is, a large number of possible servers passed to the optimization problem).
Altogether, although a generalized (and more complicated) scenario is considered here, instances of larger problem sizes can now be solved in reasonably short times. Consequently, this new approach does not only contribute to a more realistic description of the consolidation problem itself (since additional application-oriented properties are respected), but also to a wider range of instances that can be solved to optimality.

Remark 12
Obviously, in very large data centers, the challenge is to sometimes assign millions of jobs in a relatively short time period, and the approach presented here does not easily scale to this complexity. However, as observed earlier, the characteristics of these jobs can typically be grouped into a few different classes. Hence, the optimal assignment of a representative set of jobs (which our approach is able to compute) can already be very helpful to also schedule the remaining (similar) jobs in a reasonable manner.
In a second experiment, we would like to investigate the influence of the new threshold parameter S in more detail. So far, we could have got the impression that incorporating forbidden item combinations potentially boosts the performance of the ILP formulation (compared to the former approach from Martinovic et al. (2019)). To this end, for the two exemplary choices 8 n ∈ {25, 40} , we consider the instances used in the respective column of Table 1, and vary the value of S among five different constellations. Since, for S = 0 , these instances turned out to be quite easy, we selected a smaller CPLEX time limit t (2) max = 60 s for all computations of this experiment. Moreover, we added an additional indicator opt counting the number of instances that could be solved to optimality in that time. If an instance could not be solved successfully in 60 s, its data are, however, included in the averages. In these cases, we use t = t (2) max as the solution time and the best objective value available at the end of the time limit as (an approximation for) z ⋆ . Hence, for these instances, we are underestimating t while possibly overestimating z ⋆ .
Based on these computational results, the following main observations can be made: -By construction, the values of , ̃ , and ũ do not contain any information about forbidden item combinations, and hence, they do not change with varying threshold S. -The surprisingly good performance of the FFD approach (leading to the upper bound u) can be noticed for all choices of S. -Obviously, a larger value of S leads to a reduced number of item conflicts, so that a smaller number of servers is required, both in the approximate and exact solution obtained by the FFD heuristic and the ILP model, respectively. -The absolute increase in terms of z ⋆ is always the largest for the step from S = −0.05 to S = −0.10 . The reason for this observation is related to the red lines in Figs. 9 and 10, where we stated that only for S ≥ −0.07 , each job is guaranteed to have at least one non-conflicting partner. Hence, the solution for S = −0, 10 naturally contains some single-job servers (which can mostly be avoided for the other values of S), so that the increase in terms of z ⋆ is particularly high. -We can observe that the numbers of variables and constraints become smaller when S increases. This is mainly caused by two effects: On one hand, a higher value of S naturally leads to a fewer number of forbidden item combinations, so that there is a smaller number of constraints of type (14) in the ILP. On the other hand, this less restrictive consolidation strategy leads to a smaller value of the upper bound u which determines the size of the set K, and thus strongly influences the numbers of variables and constraints. -However, especially when considering the values opt and t, a lower number of variables and constraints does not necessarily have to lead to an ILP model easier to solve for CPLEX. More precisely, having S = 0 seems to be the most favorable setup for our optimization. We attribute this observation to two opposing effects: On one hand, with increasing value of S, the optimization problem becomes less restrictive, since more item combinations are possible making it harder to solve, in general. On the other hand, we empirically noticed that for a given instance, the lower bound always matches the optimal value of the LP relaxation (at the root node). As the lower bound does not depend on S (but the optimal value z ⋆ actually does), we conclude that, for decreasing values of S, the bounds become worse, so that the branch-and-bound procedure applied by CPLEX is impaired. Consequently, in a rough summary, the parameter S manages the trade-off between the cardinality of the feasible set of points and the quality of the LP bounds. -In both tables (but more clearly in Table 3), a "skewness" in terms of the counter opt can be observed. More precisely, CPLEX is always possible to solve more instances to proven optimality for the positive values of S (compared to their negative counterparts). We interpret this as an indication that, among the two opposing effects mentioned in the previous point, the quality of the LP bound seems to be more important for the solution of our instances.
Altogether, the choice S = 0 is not only reasonable from a theoretical point of view, but also from a practical perspective, since it most probably offers the best compromise between the complexity of the ILP model and the solution times.

Remark 13
As stated in the list above, the test scenarios for Table 2 and Table 3 always resulted in an equality between the lower bound and the optimal value z ⋆ LP of the LP relaxation. However, this does not hold in general. In a further series of 50 instances with n = 25 (not reported here), we happened to find an instance having = 4 and z ⋆ LP ≈ 4.05272 . Hence, the (rounded-up) LP bound can be strictly larger than . Given the quadratic number of required input data (that is, especially the entries of the covariance matrix) for a complete description, we decided to not present the setup of this single instance here. However, for the interested reader, it should be mentioned that the full details can be obtained from the authors upon request.

Conclusions
In this article, we considered a server consolidation problem with (not necessarily independent) jobs whose future workloads are given in a stochastic way. Moreover, we introduced the concept of overlap coefficients to avoid that mutually influencing jobs are executed on the same server, as this would lead to considerable performance degradations, e.g., in terms of latency. From a mathematical point of view, we showed that the problem under consideration can be reformulated as a stochastic bin packing problem with conflicts. Within this framework, an exact ILP model as well as a lower and an upper bound were presented. Based on numerical experiments with real-world data, this new approach was shown to outperform an earlier and less general method (Martinovic et al. 2019) in terms of the instance sizes that can be solved to optimality within a reasonable amount of time. However, it also turned out that for some parameter constellations, the solution times may still be too large to be applied in dynamic scenarios, so that the practical contributions of our research paper could be summarized as follows: -In data centers, which are predominantly confronted with very long-lasting jobs, the exact procedures can be used, either to handle the complete instance (if the total number of jobs is moderate) or to find an optimal assignment for a representative set of jobs (as it is also alluded to in Remark 12) which can then be used to schedule the remaining jobs in the same fashion. In these cases, the additional efforts to compute an optimal solution are worthwhile because the optimal solution can then be executed for a long time, so that, from an overall point of view, energy will still be saved. -Heuristics (like the FFD approach presented here) should be used in data centers that have to deal with either a large fluctuation or a tremendous number of jobs. The justification that these heuristics lead to useful approximations, however, is based, among other things, on the possibility to calculate exact solutions at least for moderate instance sizes. For this reason, exact procedures are also valuable (from a theoretical point of view) if heuristics should ultimately be preferred for the concrete practical application.
To tackle the challenge of evaluating heuristic solutions also for larger instance sizes, finding improved lower bounds (preferably using all of the problem-specific input data) or alternative (pseudo-polynomial) modeling frameworks is part of our future research. Moreover, based on the new concept of overlap coefficients, we should also think about appropriate means to take the overall interaction of all jobs of a server (and not only the pairwise relationship) into account. The most difficult challenge, however, is to unify the theories for the temporal BPP and the stochastic BPP to obtain a fully application-oriented description of the job-to-server assignment problem (involving job-dependent activity intervals). To this end, an approach taking into account the theory of stochastic processes much more than it was introductorily done in this article will be required in addition.

A Technical details for Example 3
Before presenting the actual calculations, we need to state two important auxiliary results.
Lemma 6 Let Z = e Z + Z ⋅G Z denote a lognormally distributed random variable with G Z ∼ N(0, 1) , Z ∈ ℝ , and Z > 0 . Moreover, we define Then, the CDF of X + Y can be approximated by the CDF of Z = e Z + Z ⋅G Z with G Z ∼ N(0, 1) and if the argument of that CDF is scaled by a factor of 1 2 .
Proof Let us first note that the function x ↦ e x is convex, so that we have for all a, b ∈ ℝ . The closer a and b are, the smaller is the difference between both sides of this inequality. Now, we obtain by convexity. On the other hand, please note that 1 2 ⋅ ( X ⋅ G X + Y ⋅ G Y ) represents a normally distributed random variable with mean value and variance i.e., we finally have for a lognormally distributed random variable Z = e Z + Z ⋅G Z with G Z ∼ N(0, 1) and For our instances, we typically have X , Y ≪ 1 and X ≈ Y (see, for instance, the histograms in Appendix B), so the inequality involving the convexity can be assumed to be a good approximation, so that we can roughly say which proves the claim. ◻ With these ingredients at hand, we can now handle the two probabilities appearing in Example 3: -The lognormally distributed pair (X, Y) ⊤ (with the same notation as in Lemma 7) can be approximated by a bivariate normal distribution (N X , N Y ) with where m X and m Y are defined analogously to m Z from Lemma 6. By that, we can calculate P 1 = ℙ[N X + N Y > 1] from Example 3 simply by the CDF of the normally distributed random variable N X + N Y ∼ N( , 2 ) with This precisely leads to -The calculation of P 2 = ℙ[X + Y > 1] requires Lemma 6 and Lemma 7, and consequently can only be approximated. However, in the light of our discussions, we saw that these approximations are of reasonable quality for the input data appearing in our numerical tests. Altogether, we obtain for the lognormally distributed random variable Z = e Z + Z ⋅G Z with G Z ∼ N(0, 1) . Here, the parameters Z and Z are defined as is Lemma 7. Based on these data, m Z and s Z are then calculated according to the rules presented in Lemma 6.