Keywords

1 Introduction

The processing of big data sets crucially depends on powerful hardware environments. Big data is typically processed in data and computing centers. Nonetheless, today even a single PC can solve problems with data volumes that were considered huge just a few years ago. In addition to speed, energy consumption has become a major concern in computing environments. Information and communications technology (ICT) systems consume a significant amount of energy. Currently personal computers, data centers and communication networks use 5–9% of the total electricity worldwide [10, 31, 34]. It is anticipated that electricity used by ICT could exceed 20% of the global total by 2030. Data centers consume about 200 terawatt hours per year, which corresponds to 1.5% of the global electricity demand [10, 34]. This is more than the energy consumption of many (European) countries.

At the heart powerful hardware environments consist of processing units such as servers, PCs and – at the bottom level – CPUs. They may operate separately and sequentially but in most cases form parallel and, in particular, massively parallel systems. Nowadays standard PCs and laptops are equipped with multicore architectures. Moreover, in computing and data centers the available processors are interconnected so that hundreds or thousands of CPUs can work on the same application.

In this chapter we will review algorithmic techniques for energy savings in hardware and, in particular, processor systems. The study of such approaches has received quite some interest over the past 15 years, see e.g.,  [3, 14, 21, 32] and references therein. Essentially, there exists two general techniques towards an energy conservation in processor systems.

(1) Dynamic speed scaling: Many modern microprocessors can run at variable speed of frequency. Examples are the Intel Speed Step and the AMD Power Now! processors as well as the VIA Technologies LongHaul CPUs and the AsAP 1 chips. The speed changes are implemented at the hardware level and the operating system level. High processor speed implies high performance. However, the higher the speed the higher the energy consumption is. The goal is to use the full speed/frequency spectrum of a processor so as to minimize the overall energy consumption, while providing a certain service.

(2) Power-down mechanisms: A well-known technique for energy savings is to transition a given system – such as the display of a desktop, a laptop, or simply a CPU – into a standby or hibernate mode if it has been idle for a while. The design of power-down strategies becomes particularly challenging in multi-processor environments, where the active and idle periods of the components have to be coordinated so that the system can satisfy a desired processing demand.

In dynamic speed scaling, energy is conserved by optimally exploiting the speed spectrum of processors. Power-down mechanisms reduce energy consumption by transitioning idle systems into low-power sleep states. In the following sections we address both of the above techniques, focusing on results that were achieved within our project of the SPP 1736.

2 Dynamic Speed Scaling

Dynamic speed scaling has been studied extensively in the algorithms community. Prior work has considered single-processor environments as well as multi-processor platforms with homogeneous CPUs. In this context a fundamental algorithmic optimization problem was introduced in a seminal paper by Yao, Demers and Shenker [39]. Specifically, we are given a single variable-speed processor. If the processor runs at speed s, then the required power is (proportional to) \(f(s) = s^\alpha \), where \(\alpha > 1\) is a constant. In practice, \(\alpha \) is typically a small value in the range [2, 3]. In fact the cube-root rule for CMOS devices states that the speed s of a processor is proportional to the cube-root of the power or, equivalently, that power is proportional to \(s^3\). Obviously, when considering a time horizon, energy consumption is power integrated over time.

Yao  et al.  [39] define a deadline-based scheduling problem. We are given a sequence \(\sigma = J_1, \ldots , J_n\) of jobs, where each job \(J_j\) is specified by a release time \(r_j\), a deadline \(d_j\) and a work volume \(w_j\). If a job \(J_j\) is processed at fixed speed s, then it takes \(w_j/s\) time units to complete the job. Preemption of jobs is allowed. The goal is to find a feasible schedule, respecting the deadline constraints, that minimizes the total energy consumption. For simplicity it is assumed that a processor can run at any speed. In particular, there are no upper and lower bounds on the speeds. Also speed changes are instant. Yao  et al.  [39] prove that the offline variant of the problem, where all jobs are known in advance, is polynomially solvable.

In the online variant of the problem, the jobs are revealed at their release time. At any time a scheduling algorithm has to make a decision without knowledge of any future jobs. Given a job sequence \(\sigma \), let \(A(\sigma )\) denote the energy consumed by A on \(\sigma \) and let \(OPT(\sigma )\) be the minimum energy consumption required for \(\sigma \). Online algorithm A is called c-competitive [38] if there exists a constant d such that \(A(\sigma ) \le c\cdot OPT(\sigma )+d\) holds for every job sequence \(\sigma \) [38]. The constant d must be independent of \(\sigma \). We remark that, for the results presented in this article, the stated competitive ratios hold without an additive constant. Yao  et al.  [39] devised two elegant online algorithms, called Average Rate and Optimal Available. They showed that Average Rate achieves a competitive ratio of \(\alpha ^\alpha 2^{\alpha -1}\), for any \(\alpha \ge 2\). Bansal  et al.  [21] analyzed Optimal Available and proved a competitive ratio of \(\alpha ^\alpha \).

Speed scaling on homogeneous parallel processors, considering again deadline-based scheduling, was studied in [6, 12, 23]. It is assumed that job migration is allowed, i.e. whenever a job is preempted, it may be moved to a different processor. Hence, over time, a job may be executed on various processors as long as the respective processing intervals do not overlap. Albers  et al.  [6] show that the offline problem can be solved optimally in polynomial time using a combinatorial algorithm. Furthermore they extend the algorithm Optimal Available and prove a competitiveness of \(\alpha ^\alpha \). An extension of Average Rate attains a competitive ratio of \(\alpha ^\alpha 2^{\alpha -1}+1\).

2.1 Speed Scaling on Heterogeneous Processors

In [7 SPP, 8 SPP] we present a comprehensive study of dynamic speed scaling in heterogeneous multi-processor environments. This is a very timely problem as data and computing centers typically host a variety of hardware architectures. Prior to our work, Bampis  et al.  [18] examined a setting where the power functions of all the processors are convex. For the offline problem they devise an algorithm that returns a solution within an additive \(\epsilon \) of the optimum and runs in time polynomial in the size of the instance and \(1/\epsilon \). Gupta  et al.  [29, 30] study speed scaling on heterogeneous platforms with the objective to minimize energy and the total flow time of jobs.

In [7 SPP, 8 SPP] we focus again on classical deadline-based scheduling and assume that m power-heterogeneous processors \(P_1, \ldots , P_m\) are given. Let \(f_p(s)\), \(1\le p \le m\), be the power function of processor \(P_p\), depending on speed s. We consider two classes of functions.

  1. 1.

    General power functions: The function \(f_p(s)\) of each processor \(P_p\) is an arbitrary continuous and monotonically increasing function of s.

  2. 2.

    Standard power functions: Each processor \(P_p\) has a power function of the form \(f_p(s)= s^{\alpha _p}\), where \(\alpha _p>1\) is a constant. Let \(\alpha = \max _{1\le p \le m} \alpha _p\).

We assume that job preemption and migration is allowed. In the following let \(t_1< t_2< \ldots< t_l < t_{l+1}\) be the sorted sequence of all possible different release times and deadlines of jobs. Let \(I_i = [t_i,t_{i+1})\), for \(i=1,\ldots , l\).

2.2 The Offline Problem with General Power Functions

In a first step we develop an algorithm for the offline problem that is based on linear programming and applies to a wide family of continuous power functions. Our linear program (LP) formulation is more compact than the configuration LP proposed in [18]. The latter one contains an exponential number of variables and requires the use of the Ellipsoid method, which may not be very efficient in practice. Moreover, the formulation in [18] is solvable only for convex functions.

In order to define our LP, let \(s_{LB}\) and \(s_{UB}\) be a lower bound and an upper bound on the speed of any processor in an optimal schedule. We could choose \(s_{LB} = w_{\min }/\sum _i|I_i|\) and \(s_{UB} = \sum _j w_{j}/\min _i|I_i|\). Given any constant \(\epsilon >0\), we geometrically discretize the interval \([s_{LB},s_{UB}]\) and define the set of discrete speeds

$$D = \{s_{LB}, s_{LB}(1+\epsilon ), s_{LB}(1+\epsilon )^2,\ldots ,s_{LB}(1+\epsilon )^k\},$$

where \(k = \min \{i \mid s_{LB}(1+\epsilon )^i\ge s_{UB}\}\). This set contains \(O({1\over \epsilon }\log ({s_{UB} \over S_{LB}}))\) speed levels.

We consider the wide class of continuous power functions satisfying the following invariant. For any small constant \(\epsilon >0\), there exists a small value \(\epsilon '>0\) such that \(f((1+\epsilon )s)\le (1+\epsilon ')f(s)\) holds for any speed \(s\in [s_{LB},s_{UB}]\). Intuitively, a small increase in the speed does not increase the power function by too much. In the case of standard power functions we have that \(\epsilon ' = (1+\epsilon )^\alpha -1\). Hence \(\epsilon '\) may depend on \(\epsilon \) and the power function; it is not necessarily smaller than 1. We first show that there exists a \((1+\epsilon ')\)-approximate schedule such that, at any time, every processor uses a speed level that belongs to D.

For the definition of our LP, for each interval \(I_i\) and each job \(J_j\) such that \(I_i\subseteq [r_j,d_j)\), we introduce a variable \(x_{i,j,p,s}\), which corresponds to the total amount of time that \(J_j\) is processed during \(I_i\) on processor \(P_p\) using speed s.

$$\min \sum _{i,j,p,s} x_{i,j,p,s} f_p(s)$$
$$\begin{aligned} \mathrm{s.t.}\ \ \ \sum _{i,p,s} x_{i,j,p,s} s\ge & {} w_j \ \ \forall j\\ \sum _{p,s} x_{i,j,p,s}\le & {} |I_i| \ \ \forall i,j\\ \sum _{j,s} x_{i,j,p,s}\le & {} |I_i| \ \ \forall i,p\\ x_{i,j,p,s}\ge & {} 0 \ \ \forall i,j,p,s \end{aligned}$$

A solution to the above LP specifies an operation of job \(J_j\) on processor \(P_p\) with processing time \(\sum _s x_{i,j,p,s}\) during interval \(I_i\). Hence, for each \(I_i\), we obtain an instance of the preemptive open shop problem, which can be solved in polynomial time using the algorithm by Gonzalez and Sahni [28].

Theorem 1

There exists an algorithm that produces a \((1+\epsilon ')\)-approximate schedule in \(O(poly(n,m,{1\over \epsilon }, \log ({s_{UB} \over S_{LB}}))\) time.

2.3 The Offline Problem with Standard Power Functions

In this section we focus on standard power functions \(f_p(s)= s^{\alpha _p}\), \(1\le p \le m\). Such functions were considered by Yao  et al.  [39]. In fact, most of the literature on dynamic speed scaling focuses on this family of functions. As a main result in [7 SPP, 8 SPP] we prove that the offline problem can be solved in polynomial time using a fully combinatorial algorithm that is based on repeated maximum flow computations. In a first step we show that there exists an optimal schedule that exhibits four specific properties. These properties will be essential in the design of our algorithm.

First we demonstrate that for any job \(J_j\), \(1\le j \le n\), the processor speeds at which the job is executed are related through the derivative of the power functions. More specifically, if \(J_j\) is partially executed by processors \(P_p\) and \(P_q\) with speeds \(s_{j,p}\) and \(s_{j,q}\), respectively, then \(f'_p(s_{j,p})= f'_q(s_{j,q})\). This follows from the convexity of the power functions when analyzing the energy consumed by \(J_j\) on processors \(P_p\) and \(P_q\). Therefore, for any job \(J_j\), let \(Q_j = f'_p(s_{j,p})\) be the hypopower on processor \(P_p\).

  • Property 1: Each job \(J_j\) is executed with constant hypopower \(Q_j\).

The next property implies that, at any time, the available jobs with the greatest hypopower are executed.

  • Property 2: For any pair of jobs \(J_j,J_k\) and \(t\in [r_j,d_j)\cap [r_k,d_k)\) such that \(J_j\) is executed at time t and \(J_k\) is not executed at t, it holds that \(Q_j\ge Q_k\).

We assume that the density \(\delta _j := w_j/(d_j-r_j)\) of each job \(J_j\) satisfies \(\delta _j \ge \max _{p,q} (\alpha _p/\alpha _q)^{1/(\alpha _q-1)}\). Observe that \(\delta _j\) is equal to the minimum average speed necessary to complete \(J_j\) if no other jobs were present. With the assumption on the job densities we can then show that in an optimal schedule, for each job \(J_j\) and processor \(P_p\), the speed \(s_{j,p}\) is at least 1. This allows us to define an order on the processors. We number the processors \(P_1,\ldots , P_m\) such that, for any \(s\ge 1\), it holds that \(f_1(s)\le \ldots \le f_m(s)\). This implies, \(\alpha _1\le \ldots \le \alpha _m\) and \(f'_1(s)\le \ldots \le f'_m(s)\). We say that \(P_p\) is cheaper than \(P_q\) if \(p<q\). The next property states that cheap processors execute, in general, jobs with greater hypopower, compared to expensive processors.

  • Property 3: Let I be an interval and \(J_j,J_k\) be any pair of jobs executed by processors \(P_p\) and \(P_q\) during I, respectively. If \(p<q\), then \(Q_j \ge Q_k\).

The final property states that at each time the cheapest processors are occupied.

  • Property 4: For each interval \(I_i\), there exists an \(m_i\) with \(0\le m_i\le m\) such that \(P_1, \ldots , P_{m_i}\) are occupied throughout \(I_i\) while \(P_{m_i+1}, \ldots , P_m\) are idle.

We proceed with the description of our algorithm. To this end we define problem instances specified by triples \(({\mathbb J},{\mathbb P},{\mathbb I})\). Here \({\mathbb J}\) is a set of jobs, \({\mathbb P}\) is a set of processors and \({\mathbb I}\) is a set of disjoint intervals. Initially, \({\mathbb J}=\{J_1,\ldots ,J_n\}\), \({\mathbb P}=\{P_1,\ldots ,P_m\}\) and \({\mathbb I}=\{I_1,\ldots ,I_l\}\). In general, during each \(I_i\in {\mathbb I}\), there is a subset \({\mathbb J}(I_i) \subseteq {\mathbb J}\) of alive jobs \(J_j\) with \(I_i\subseteq [r_j,d_j)\) and a subset \({\mathbb P}(I_i) \subseteq {\mathbb P}\) of available processors that are unused throughout \(I_i\). Let \(n_i = |{\mathbb J}(I_i)|\) and \(a_i = |{\mathbb P}(I_i)|\)

Let \(S^*\) be an optimal schedule satisfying Properties 1–4. Consider any interval \(I_i\in {\mathbb I}\). In Property 4, considering \(S^*\), we have \(m_i = \min \{n_i,a_i\}\) because the number of used processors cannot exceed the number of available processors or the number of alive jobs. This equation specifies the exact amount of time, say \(t_p\), that a processor \(P_p \in \mathbb {P}\) is used in \(S^*\) as well as the corresponding intervals. The most energy-efficient though not necessarily feasible way to schedule the jobs in \({\mathbb J}\) is to use the same constant hypopower Q satisfying

$$\sum _{p\in {\mathbb P}} t_p \left( {Q\over \alpha _p}\right) ^{1\over a_p-1}= \sum _{J_j\in {\mathbb J}} w_j.$$

We assume for simplicity that the value of Q satisfying the above equation can be computed with arbitrary precision.

If there is a feasible schedule in which all jobs are executed with constant hypopower Q, then this schedule is optimal and we are done. As we will explain below, this feasibility problem and the calculation of the corresponding schedule can be solved using a maximum flow computation. If such a feasible schedule does not exist, then \(({\mathbb J},{\mathbb P},{\mathbb I})\) can be partitioned into two independent subproblems \(({\mathbb J}_{\ge Q},{\mathbb P}_{\ge Q},{\mathbb I})\) and \(({\mathbb J}_{<Q},{\mathbb P}_{<Q},{\mathbb I})\). Here \({\mathbb J}_{\ge Q}\) and \({\mathbb J}_{<Q}\) are the subsets of \({\mathbb J}\) that are executed with hypopower at least Q and smaller Q, respectively, in the optimal schedule \(S^*\). In each interval \(I_i\in {\mathbb I}\), Properties 2 and 3 specify the subsets of available processors \({\mathbb P}_{\ge Q}(I_i),{\mathbb P}_{<Q}(I_i)\subseteq {\mathbb P}\) dedicated to the jobs of \({\mathbb J}_{\ge Q}\) and \({\mathbb J}_{< Q}\) that are alive during \(I_i\). The jobs of \({\mathbb J}_{\ge Q}\) occupy the cheapest \(\min \{a_i, |{\mathbb J}_{\ge Q}(I_i)|\}\) processors during \(I_i\), while the jobs of \({\mathbb J}_{< Q}\) use the remaining processors of \({\mathbb P}(I_i)\).

The feasibility of \(({\mathbb J},{\mathbb P},{\mathbb I})\) w.r.t. the hypopower Q is based on a maximum flow computation in an appropriate network \(N({\mathbb J},{\mathbb P},{\mathbb I},Q)\). Consider an interval \(I_i\in {\mathbb I}\) and a processor \(P_p\in {\mathbb P}(I_i)\). If \(P_p\) runs with hypopower Q in \(I_i\), then its speed is \(s_{i,p} = (Q/\alpha _p)^{1/(\alpha _p-1)}\). We slightly abuse notation and let \(s_{i,p}\) be the speed of the p-th cheapest available processor during \(I_i\) and \({\mathbb P}(I_i)\) be the set of the the \(m_i\) cheapest available processors during \(I_i\).

In the network, there is a source node \(u_0\), a node \(u_j\) for each \(J_j\in {\mathbb J}\), a node \(v_{i,p}\) for each pair of interval \(I_i\in {\mathbb I}\) and processor \(P_p\in {\mathbb P}(I_i)\), a node \(v_i\) for each interval \(I_i\in {\mathbb I}\), and a sink node \(v_0\). The network contains the arc \((u_0,u_j)\) with capacity \(w_j\) for each job \(J_j\in {\mathbb J}\), the arc \((u_j,v_{i,p})\) with capacity \((s_{i,p}-s_{i,p+1})|I_i|\) for each interval \(I_i\), job \(J_j\in {\mathbb J}(I_i)\) and processor \(P_p\in {\mathbb P}(I_i)\), the arc \((v_{i,p},v_i)\) with capacity \(p(s_{i,p}-s_{i,p+1})|I_i|\) for each interval \(I_i \in {\mathbb I}\) and processor \(P_p \in {\mathbb P}(I_i)\) as well as the arc \((v_i,v_0)\) with infinite capacity for each \(I_i\in {\mathbb I}\). We set \(s_{i,m+1} := 0\). This is depicted in Fig. 1, was also introduced by Federgruen and Groenevelt [25].

Fig. 1.
figure 1

The flow network

If there does not exist a feasible schedule for \(({\mathbb J},{\mathbb P},{\mathbb I})\) with hypopower Q, then the biseparation into \(({\mathbb J}_{\ge Q},{\mathbb P}_{\ge Q},{\mathbb I})\) and \(({\mathbb J}_{<Q},{\mathbb P}_{<Q},{\mathbb I})\) is based on the following crucial property. Let \({\mathbb J}' \subseteq {\mathbb J}_{<Q}\) be any subset of jobs. A job \(J_j \in {\mathbb J}\setminus {\mathbb J}'\) belongs to \({\mathbb J}_{\ge Q}\) if and only if, in the network \(N({\mathbb J}\setminus {\mathbb J}',{\mathbb P},{\mathbb I},Q)\), there exists a minimum \((u_0,v_0)\)-cut that does not contain arc \((u_0,u_j)\). This allows us to identify \({\mathbb J}_{\ge Q}\) and \({\mathbb J}_{<Q}\). The technical details are omitted here. In summary Algorithm 1 show a pseudocode description of our strategy. The following theorem gives the main result.

Theorem 2

Algorithm 1 generates an optimal schedule and runs in polynomial time \(O(n^4m)\).

figure a

2.4 An Online Algorithm

The online algorithm Average Rate (AVR), proposed by Yao  et al.  [39] for single-processor speed scaling with power function \(f(s) = s^\alpha \), works with the concept of job densities. Again, the density \(\delta _j\) of job \(J_j\) is equal to \(\delta _j = w_j/(d_j-r_j)\). Recall that this is the minimum average speed necessary to complete the job if no other jobs were present. At any time t, the processor speed s(t) is set to the accumulated density of active jobs, i.e. \(s(t) = \sum _{j:t\in [r_j,d_j)} \delta _j\). With this speed profile, available jobs are scheduled according to the Earliest Deadline First policy.

In order to generalize AVR to the multi-processor setting, we consider a variation of the above single-processor algorithm, which uses the same processor speed at any time but applies a different job selection rule. Assume w.l.o.g. that all release times and deadlines are integers. Moreover, assume that \(r_{\min } = \min _{1\le j \le n} r_j = 0\) and \(d_{\max } = \max _{1\le j\le n} d_j = T\). We partition the time horizon into unit-length intervals \(I_t = [t,t+1)\), \(0\le i <T\). For each job \(J_j\) with \(I_t \subseteq [r_j,d_j)\), the algorithm assigns a work volume of \(\delta _j\) to interval \(I_t\). Then it produces an arbitrary schedule of the total work assigned to \(I_t\) using a fixed speed of \(s(t) = \sum _{j:I_t\subseteq [r_j,d_j)} \delta _j\) during the whole \(I_t\). This modified algorithm attains the same competitive ratio as the original algorithm AVR because both strategies always employ the same speed and consume the same energy.

Next we turn our attention to the setting with multiple heterogeneous processors. Based on the above algorithm variation, we say that a schedule S is an AVR-schedule if, for every job \(J_j\) and interval \(I_t\subseteq [r_j,d_j)\), the total amount of work of \(J_j\) executed during \(I_t\) on all the processors in S is equal to \(\delta _j\). We prove that, for each input sequence \(\sigma = J_1,\ldots , J_n\), there exists a feasible AVR-schedule \(S_{ AVR }\) on heterogeneous processors with general power functions, as described in Sect. 2.2, whose energy consumption is at most \(\max _p c_p +1\) times that of the optimum schedule for \(\sigma \). Here \(c_p\) is the competitive ratio of the single-processor AVR algorithm when executed on processor \(P_p\) with power function \(f_p(s)\).

We are ready to describe our algorithm H-AVR for heterogeneous processors. The main idea is to generate a \((1+\epsilon )\)-approximate AVR-schedule using the LP-algorithm described in Sect. 2.2. More specifically, given the assignment of work into intervals implied by the definition of AVR-schedules, for each interval \(I_t=[t,t+1)\) we compute an offline \((1+\epsilon )\)-approximate schedule for this subinstance of the heterogeneous speed-scaling problem.

Theorem 3

H-AVR is \((1+\epsilon )(\max _p c_p +1)\)-competitive for speed scaling with heterogeneous processors, where \(c_p\) is the competitiveness of the single-processor AVR algorithm when applied to processor \(P_p\) with general power function \(f_p(s)\).

Corollary 1

H-AVR is \((1+\epsilon )(\alpha ^\alpha 2^{\alpha -1} +1)\)-competitive for speed scaling with heterogeneous processors having standard power functions.

2.5 Further Results

We briefly review work by postdoctoral scientists when they were funded within our project. Article [19] explores dynamic speed scaling, assuming that job preemptions are not allowed. In some applications it might not be feasible or too expensive to interrupt and later resume the execution of a job. For the setting with a single processor, we develop a polynomial time algorithm achieving an improved approximation guarantee of \((1+\epsilon )^\alpha B_\alpha \), where \(B_\alpha \) is a generalization of the Bell number [19]. For multi-processor environments we develop the first approximation algorithm for the fully power-heterogeneous setting, where each processor \(P_p\) has an individual power function \(f_p(s) = s^{\alpha _p}\). The performance factor is equal to \(B_\alpha ((1+\epsilon )(1+w_{\max }/w_{\min }))^\alpha \). Here \(w_{\max }\) and \(w_{\min }\) are the maximum and minimum work volumes of the jobs. Again \(\alpha = \max _{1\le p\le m} \alpha _p\).

In [11] we examine the scenario where jobs must be executed subject to an energy budget. The goal is to maximize the throughput. As a main result we develop polynomial time algorithms based on dynamic programming. In [26] we introduce the new problem of scheduling jobs over scenarios. In [27] we study a dynamic market scheduling problem where an intermediary interacts with an unknown sequence of agents.

3 Power-Down Mechanisms in Data Centers

Power-down strategies for a single device have been investigated by Irani  et al.  [33] and Augustine  et al.  [17]. The goal is to minimize the energy consumed in an idle period when the device is not in use. In our work we focus on power-down mechanisms in massively parallel systems and, in particular, data centers.

Energy management is a key issue in data center operations [24]. Electricity costs are a dominant and rapidly growing expense in such centers; about 30–50% of their budget is invested into energy. Surprisingly, the servers of a data center are only utilized 20–40% of the time on average [16, 22]. When idle and in active mode, they consume about half of their peak power. Hence a fruitful approach for energy conservation and capacity management is to transition idle servers into standby and sleep states. Servers have a number of low-power states [1]. However state transitions, and in particular power-up operations, incur energy/cost. Therefore, dynamically matching the varying demand for computing capacity with the number of active servers is a challenging problem.

3.1 Heterogeneous Servers

In [4 SPP, 5 SPP] we formulate and study an optimization problem that arises in the energy management of data centers, hosting a large number of heterogeneous servers. Each server has an active state and several standby/sleep states with individual power consumption rates. The demand for computing capacity varies over time. Idle servers may be transitioned to low-power modes so as to rightsize the pool of active servers. The goal is to find a state transition schedule for the servers that minimizes the total energy consumed. On a small scale the same problem arises in multi-core architectures with heterogeneous processors on a chip. One has to determine active and idle periods for the cores so as to minimize the consumed energy.

More formally, we define the optimization problem Dynamic Power Management (DPM). A problem instance \(I = ({\mathbb S},{\mathbb D)}\) is specified by a set of servers and varying computing demands over a time horizon. Let \({\mathbb S} = \{S_1,\ldots , S_m\}\) be a set of heterogeneous servers. Each server \(S_i\), \(1\le i \le m\), has an active state as well as one or several standby/sleep states. The states of \(S_i\) are denoted by \(s_{i,0}, \ldots , s_{i,\sigma _i}\). Here \(s_{i,0}\) is the active state and \(s_{i,1}, \ldots , s_{i,\sigma _i}\) are the low-power states. The modes have individual power consumption rates. Let \(r_{i,j}\) be the power consumption rate of \(s_{i,j}\), i.e., \(r_{i,j}\) energy units are consumed per time unit while \(S_i\) resides in \(s_{i,j}\). The states are numbered in order of decreasing rates such that \(r_{i,0}> \ldots > r_{i,\sigma _i}\ge 0\). A server can transition between its states. Let \(\varDelta _{i,j,j'}\) be the non-negative energy needed to move \(S_i\) from state \(s_{i,j}\) to state \(s_{i,j'}\), for any pair \(0\le j,j' \le \sigma _i\). The transition energies satisfy the triangle inequality, i.e., the energy to move directly from \(s_{i,j}\) to \(s_{i,j'}\) is upper bounded by that of visiting an intermediate state \(s_{i,k}\). Formally, \(\varDelta _{i,j,j'} \le \varDelta _{i,j,k} +\varDelta _{i,k,j'}\).

Over a time horizon the computing demands are given by a demand profile \({\mathbb D}=(T,D)\). Tuple \(T=(t_1,\ldots , t_n)\) contains the points in time when the computing demands change. There holds \(t_1< t_2< \ldots < t_n\) so that the time horizon is \([t_1,t_n)\). Tuple \(D=(d_1, \ldots , d_{n-1})\) specifies the demands. More precisely, \(d_k\in \mathbb {N}_0\) servers are required for computing during interval \([t_k,t_{k+1})\), for any \(1\le k \le n-1\). Thus at least \(d_k\) servers must reside in the active state during \([t_k,t_{k+1})\). We have \(d_k \le m\), for any \(1\le k \le n-1\), so that the requirements can be met.

Given \(I = ({\mathbb S},{\mathbb D)}\), a schedule \(\varSigma \) specifies, for each \(S_i\) and any \(t\in [t_1,t_n)\), in which state server \(S_i\) resides at time t. Schedule \(\varSigma \) is feasible if during any interval \([t_k,t_{k+1})\) at least \(d_k\) servers are in the active state, \(1\le k \le n-1\). The energy \(E(\varSigma )\) incurred by \(\varSigma \) is the total energy consumed by all the m servers. Whenever server \(S_i\), \(1\le i \le m\), resides in state \(s_{i,j}\) it consumes energy at a rate of \(r_{i,j}\). Whenever the server transitions from state \(s_{i,j}\) to state \(s_{i,j'}\), the incurred energy is \(\varDelta _{i,j,j'}\). The goal is to find an optimal schedule, i.e., a feasible schedule \(\varSigma \) that minimizes \(E(\varSigma )\). We assume that initially, immediately before \(t_1\), and at time \(t_n\) all servers reside in the deepest sleep state, i.e. \(S_i\) is in \(s_{i,\sigma _i}\), \(1\le i \le m\).

In DPM the demand for computing capacity is specified by the number of servers needed at any time. In data centers it is common practice that a number of required servers is determined as a function of the current total workload, ignoring specific jobs. DPM focuses on energy conservation instead of individual job placement. Again, in the active state, a processor has a fixed energy consumption rate. We investigate DPM as an offline problem, i.e. the varying computing demands are known in advance. From an algorithmic point of view it is important to explore the tractability and approximability of the problem. The offline setting is also relevant in practice. Data centers usually analyze past workload traces to identify long-term patterns. The findings are used to specify demands in future time windows.

Given a problem instance I, we first characterize optimal solutions. Property 1 below implies that there exists an optimal schedule in which a server never changes state while being in low-power mode. Property 2 states that there exists an optimal schedule executing state transitions only when the computing demands change. A server powers up if it transitions from a low-power state to the active state (indexed 0). A server powers down if it moves from the active state to a low-power state.

  • Property 1: There exists an optimal schedule with the following property. Suppose that \(S_i\) powers down at time t and next powers up at time \(t'\). Then between t and \(t'\) \(S_i\) resides in a single state \(s_{i,j}\), where \(j>0\).

  • Property 2: There exists an optimal schedule that satisfies Property 1 and performs state transitions only at the times of T.

Finally we may assume w.l.o.g. that the power-down energies \(\varDelta _{i,0,j}\) are equal to 0, \(1\le i \le m\) and \(1\le j \le \sigma _i\). If this is not the case we case we can simply fold the power-down energy \(\varDelta _{i,0,j}>0\) into the corresponding power-up energy \(\varDelta _{i,j,0}\).

3.2 Servers with Two States

In [4 SPP, 5 SPP] we first investigate the variant of DPM in which each server \(S_i\) has exactly two states, an active state \(s_{i,0}\) and a sleep state \(s_{i,1}\), \(1\le i\le m\). As a main result we show that an optimal schedule can be computed in polynomial time using an algorithm that resorts to a min-cost flow computation.

In a first step we argue that we may assume w.l.o.g. that the power consumption rates in the sleep states are equal to 0. If this is not the case and \(r_{i,1} >0\), for some i, then we can subtract \(r_{i,1}\) from both \(r_{i,0}\) and \(r_{i,1}\). This changes the energy consumption by a fixed amount of \(r_{i,1}(t_n-t_1)\) over the entire time horizon. To simplify notation let \(r_i:= r_{i,0}\) be the power consumption rate of \(S_i\) in the active state, \(1\le i \le m\). Moreover, let \(\varDelta _i := \varDelta _{i,1,0}\) be the energy needed to transition \(S_i\) from the sleep state to the active state.

In the following let \(I = ({\mathbb S},{\mathbb D)}\) be a given problem instance. We develop an algorithm that computes an optimal schedule. Based on Property 2, we focus on schedules that perform state transitions only at the times of T. Given I, our strategy constructs a flow network N(I) that we describe in the next paragraphs.

Fig. 2.
figure 2

The component \({C_i}\) for server \({S_i}\)

Network Components. Network N(I) contains a component \(C_i\), for each server \(S_i\), \(1\le i \le m\). Such a component \(C_i\), which is depicted in Fig. 2, consists of an upper path and a lower path. The upper path represents the active state of \(S_i\); the lower path models the server’s sleep state. The computing demands change at the times \(t_1< \ldots < t_n\) in T. For any \(t_k\), \(1\le k \le n\), there is a vertex \(u_{i,k}\) on the upper path. Vertices \(u_{i,k}\) and \(u_{i,k+1}\) are connected by a directed edge \((u_{i,k},u_{i,k+1})\) of cost \(r_i(t_{k+1}-t_k)\), \(1\le k \le n-1\). This cost is equal to the energy consumed if \(S_i\) is in the active state during \([t_k,t_{k+1})\). Similarly, for any \(t_k\), \(1\le k \le n\), there is a vertex \(l_{i,k}\) on the lower path. In order to ensure that at least \(d_k\) servers are in the active state during \([t_k,t_{k+1})\), if \(k<n\), we need two auxiliary vertices \(l_{i,k}^a\) and \(l_{i,k}^b\). These vertices are again connected by directed edges. There is an edge \((l_{i,k},l^a_{i,k})\), followed by two edges \((l^a_{i,k},l^b_{i,k})\) and \((l^b_{i,k},l_{i,k+1})\), for any k with \(1\le k \le n-1\). The cost of each of these edges is 0 because the energy consumption in the sleep state is 0.

The lower and the upper paths are connected by additional edges that model state transitions. Recall that all servers are in the sleep state at times \(t_1\) and \(t_n\). For any k with \(1\le k \le n-1\), there is a directed edge \((l_{i,k},u_{i,k})\) of cost \(\varDelta _i\), representing a power-up operation of \(S_i\) at time \(t_k\). For any k with \(1<k\le n\), there is a directed edge \((u_{i,k},l_{i,k})\) of cost 0, modeling a power-down operation of \(S_i\) at time \(t_k\). The capacity of each edge of \(C_i\) is equal to 1.

The Entire Network. In N(I) components \(C_1,\ldots ,C_m\) are aligned in parallel and connected to a source \(a_0\) and a sink \(b_0\). The general structure of N(I) is depicted in Fig. 3. There is a directed edge from \(a_0\) to \(l_{i,1}\) in \(C_i\), for any \(1\le i \le m\). Furthermore, there is a directed edge from \(l_{i,n}\) to \(b_0\), for any \(1\le i\le m\). Each of these edges has a cost of 0 and a capacity of 1. Vertex \(a_0\) has a supply of m, and \(b_0\) has a demand of m. Hence m units of flow must be shipped through \(C_1, \ldots , C_m\). Since all edges have a capacity of 1, one unit of flow must be routed through each \(C_i\), \(1\le i \le m\). Whenever the unit traverses the upper path, \(S_i\) is in the active state. Whenever the unit traverses the lower path, \(S_i\) is in the sleep state.

In order to ensure that at least \(d_k\) servers are in the active state during \([t_k,t_{k+1})\), \(1\le k \le n-1\), we introduce additional sources and sinks. Network N(I) has a source \(a_k\) and a sink \(b_k\) with supply/demand \(d_k\), for any \(1\le k \le n-1\). There is a directed edge from \(a_k\) to \(l^a_{i,k}\) on the lower path of each \(C_i\), \(1\le i \le m\). Furthermore, there is a directed edge from each \(l^b_{i,k}\) to \(b_k\), \(1\le i \le m\). The cost and capacity of each of these edges is equal to 0 and 1, respectively. Since \(d_k\) flow units have to be shipped from \(a_k\) to \(b_k\), there must exist at least \(d_k\) components \(C_i\) in which the flow unit from \(a_0\) to \(b_0\) traverses the upper path from \(u_{i,k}\) to \(u_{i,k+1}\). Hence the corresponding servers are in the active state during \([t_k,t_{k+1})\).

Fig. 3.
figure 3

The network N(I)

Obviously, any feasible schedule \(\varSigma \) in which state transitions are performed only at the times of T corresponds to a feasible flow of cost \(E(\varSigma )\) in N(I). Unfortunately, the reverse statement is not true. Since N(I) is a single-commodity flow network, a feasible flow f does not necessarily represent a feasible schedule. It may happen that flow shipped out of a source \(a_k\) is not necessarily routed to \(b_k\), \(0\le k \le n-1\). In particular, flow leaving \(a_k\) may be routed to a sink \(b_{k'}\), where \(k'>k\), or to \(b_0\). Observe that in N(I) all edge capacities and supplies/demands are integer values. Hence in N(I) there exists a minimum-cost flow that is integral, i.e., the flow along any edge takes an integer value. Moreover, there exist polynomial time combinatorial algorithms that compute such an integral minimum-cost flow [2]. In [4 SPP, 5 SPP] we prove that any feasible integral flow f of cost C in N(I) can be transformed so that it corresponds to a feasible schedule \(\varSigma \) consuming energy C. More specifically, using (non-trivial) flow modification operations, we ensure that each network component \(C_i\) ships exactly on flow unit in each interval \([t_k,t_{k+1})\). The transformation takes a polynomial number of steps.

Theorem 4

Let I be an instance of DPM in which each server has exactly two states. An optimal schedule for I can be computed in polynomial time by a combinatorial algorithm that uses a minimum-cost flow computation.

3.3 Servers with Multiple States

In [4 SPP, 5 SPP] we also investigate DPM in the general scenario that each server has multiple sleep states. In this case DPM becomes NP-hard. We extend our approach based on flow computations to design an approximation algorithm. More specifically, we develop a second algorithm that works with a more complex network in which each component has several lower paths, representing the various low-power states of a server. Furthermore, we need a second commodity to ensure that computing demands are met. With only a single commodity, flow units could switch between lower paths at no cost, and infeasible schedules would result.

Given a fractional two-commodity minimum-cost flow, our algorithm executes advanced flow rounding and packing procedures. First, by repeatedly traversing components, the algorithm modifies flow so it becomes integral on the upper paths. Then flow on the lower paths is packed. The final integral flow allows the constructing of a schedule for DPM. Our algorithm achieves an approximation factor of \(\tau \), where \(\tau \) is the number of server types in the problem instance. Specifically, the servers can be partitioned into \(\tau \) classes such that, within each class, the servers are identical. Of course, the servers of a class are independent and not synchronized. In practice, a data center has a large collection of machines but a relatively small number of different server architectures.

Theorem 5

Let I be an instance of DPM with \(\tau \) server types. A schedule whose energy consumption is at most \(\tau \) times the minimum one for I can be computed in polynomial time based on a min-cost two-commodity flow computation.

3.4 Homogeneous Servers

In [9] we investigate another algorithmic problem with the objective of dynamically resizing a data center. Specifically, we resort to a framework that was introduced by Lin, Wierman, Andrew and Thereska [35, 37].

Consider a data center with m homogeneous servers, each of which has two states, an active state and a sleep state. An optimization is performed over a discrete, finite time horizon consisting of time steps \(t=1,\ldots , T\). At any time t, \(1\le t\le T\), a non-negative convex cost function \(f_t(\cdot )\) models the operating cost of the data center. More precisely, \(f_t(x_t)\) is the incurred cost if \(x_t\) servers are in the active state at time t, where \(0\le x_t\le m\). This operating cost captures e.g., the energy cost and service delay, for an incoming workload, depending on the number of active servers.

Furthermore, at any time t there is a switching cost, taking into account that the data center may be resized by changing the number of active servers. This switching cost is equal to \(\varDelta (x_t-x_{t-1})^+\), where \(\varDelta \) is a positive real constant and \((x)^+=\max (0,x)\). Again we assume that transition cost is incurred when servers are powered up from the sleep state to the active state. A cost of powering down servers may be folded into this cost. The constant \(\varDelta \) incorporates e.g., the energy needed to transition a server from the sleep state to the active state, as well as delays resulting from a migration of data and connections. We assume that at the beginning and at the end of the time horizon all servers are in the sleep state, i.e., \(x_0=x_{T+1}=0\). The goal is to determine a vector \(X=(x_1,\ldots ,x_T)\) called schedule, specifying at any time the number of active servers, that minimizes

$$\begin{aligned} \sum _{t=1}^T f_t(x_t) + \varDelta \sum _{t=1}^T (x_t-x_{t-1})^+. \end{aligned}$$
(1)

All previous work [13, 15, 20, 35,36,37] on the data-center optimization problem assumes that the server numbers \(x_t\), \(1\le t\le T\), may take fractional values. That is, \(x_t\) may be an arbitrary real number in the range [0, m]. From a practical point of view this is acceptable because a data center has a large number of machines. Nonetheless, from an algorithmic and optimization perspective, the proposed algorithms do not compute feasible solutions. Important questions remain if the \(x_t\) are indeed integer valued: (1) Can optimal solutions be computed in polynomial time? (2) What is the best competitive ratio achievable by online algorithms?

In [9] we present the first study of the above data-center optimization problem assuming that the \(x_t\) take integer values. In a first step we examine the offline variant of the problem, where the convex functions \(f_t\), \(1\le t \le T\), are known in advance. Lin  et al. [37] developed an algorithm based on a convex program that computes optimal solutions if fractional values \(x_t\) are allowed.

Fig. 4.
figure 4

Construction of the graph

Considering the discrete setting with integer valued \(x_t\), we prove that optimal solutions can also be computed in polynomial time. Our algorithm is different from the convex optimization approach by Lin et al.  [37]. More precisely, our strategy works with an underlying directed, weighted graph \(G=(V,E)\). Let \([k] := \{1, 2, \dots , k\}\) and \([k]_0 := \{0, 1, \dots , k\}\) with \(k \in \mathbb {N}\). For each \(t\in [T]\) and each \(j \in [m]_0\), there is a vertex \(v_{t,j}\), representing the state that exactly j servers are active at time t. Furthermore, there are two vertices \(v_{0,0}\) and \(v_{T+1,0}\) for the initial and final states \(x_0=0\) and \(x_{T+1}=0\). For each \(t\in \{2,\ldots ,T\}\) and each pair \(j,j'\in [m]_0\), there is a directed edge from \(v_{t-1,j}\) to \(v_{t,j'}\) having weight \(\varDelta (j'-j)^+ +f_t(j)\). This edge weight corresponds to the switching cost when changing the number of servers between time \(t-1\) and t and to the operating cost incurred at time t. Obviously, \(\varDelta (j'-j)^+ +f_t(j) = f_t(j)+\varDelta (j'-j)^+\) so that the edge cost properly represents the cost contribution in the objective function, see (1), at time t. Similarly, for \(t=1\) and each \(j'\in [m]_0\), there is a directed edge from \(v_{0,0}\) to \(v_{1,j'}\) with weight \(f_1(j)+\varDelta (j')^+\). Finally, for \(t=T\) and each \(j\in [m]_0\), there is a directed edge from \(v_{T,j}\) to \(v_{T+1,0}\) of weight 0. The structure of G is depicted in Fig. 4. In the following, for each \(j\in [m]_0\), vertex set \(R_j =\{ v_{t,j} \mid t \in [T]\}\) is called row j.

A path between \(v_0\) and \(v_{T+1}\) represents a schedule. If the path visits \(v_{t,j}\), then \(x_t=j\) servers are active at time t. The total length (weight) of a path is equal to the cost of the corresponding schedule. An optimal schedule can be determined using a shortest path computation, which takes O(Tm) time in the particular graph G. However, this running time is not polynomial because the encoding length of an input instance is linear in T and \(\log m\), in addition to the encoding of the functions \(f_t\). In [9] we present a polynomial time algorithm that improves an initial schedule iteratively using binary search. In each iteration the algorithm constructs and uses only a constant number of rows of G.

Theorem 6

An optimal schedule can be computed in polynomial time \(O(T\log m)\).

In [9] we also examine the online variant of the data center optimization problem where the functions \(f_t\), \(1\le t \le T\), are revealed over time. We extend an algorithm Lazy Capacity Provisioning proposed by Lin  et al. [37] and prove that it achieves a competitive ratio of 3. We also show that this is best possible. No deterministic online algorithm can attain a competitive ratio smaller than 3.