Introduction

Data center is a large-scale distributed system, where Information Technology (IT) and cooling subsystems compose a complex and massive structure. It provides user-friendly, reliable, and flexible services for various Internet applications, and it has been one of the sustentacular technologies that push IT industry forward. However, with the prevalence of cloud computing, the problem of energy consumption hiding behind convenient services gradually arose, and the situation is getting more and more serious. For a typical data center, the daily energy consumption is as much as 25,000 households, which is 100 to 200 times more than an office with the same space [1]. In 2015, up to 2.3 billion KWh electricity was consumed by the Microsoft data center located in USA, which is the third largest energy consumer among US enterprises (https://news.microsoft.com/download/presskits/cloud/docs/CloudFS.docx). In addition, as an energy-hungry system, gigantic-scale data centers not only lead to tremendous energy consumption, but also give rise to a battery of related problems, such as air pollution, water contamination, and soil destruction, etc.

It has been widely accepted that the data center consists of a number of interactive subsystems, of which the most energy-hungry components are IT and cooling subsystems. Energy consumed by IT equipments is primarily used to process service request and dissipates in the form of heat. The cooling facilities remove heat to ensure the health of IT equipment, so these two components are coupled via heat. There are a number of studies separately dealing with energy minimization in data centers. For example, [26] studied IT energy consumption, others [79] inclined to optimize cooling energy consumption. However, the aforementioned studies only concentrated on the energy consumption of a subset of system components and overlooked the interactions among them. To address this issue, joint optimization techniques [1012], which aim to minimize the overall energy consumption, are proposed.

Nevertheless, most existing approaches either neglected the temperature constraint or ensured the server reliability by imposing a “hard” server temperature. By “hard”, it means that the server temperature must be strictly lower than a predefined threshold at all times. But actually, IT equipment is more durable than estimated. Zhang et al. [13] demonstrated that the performance and reliability of IT equipment is not greatly undermined under higher ambient temperature. Pervila et al. [14] placed computer facilities in a harsh outdoor environment, and results showed that servers can still function well when the outside temperature fluctuates dramatically. Thus, the “hard” temperature constraint underestimated the capability of failure tolerance for servers. In practice, the server temperature can occasionally exceed the preset threshold as long as the average temperature falls into a safe operating range. Using the “hard” temperature constraint, it may be too pessimistic and induce unnecessary energy consumption.

In this work, we formulate a “soft” Server Temperature-Constrained Energy Minimization (STCEM) problem. Since the system is running in a stochastic environment, to solve the STCEM problem is not an easy task. To address this issue, we leverage the Lyapunov Optimization (LO) theory to design two approximation algorithms, i.e., linear and quadratic control policies, to obtain the near-optimal solution. More specifically, we introduce virtual queues for the average temperature constraints. By guaranteeing the mean rate stability of virtual queues, the “soft” server temperature constraints are enforced [15]. To verify the efficiency of the proposed policies, we use real-world workloads [16] to run extensive simulations, and conduct detailed analysis of total and component energy efficiency. In the process, some interactive metrics, including server temperature, supplying cold air temperature, the number of powered-on servers, and Power Usage Efficiency (PUE), are carefully dissected by a series of comparison studies under two different control policies. Simulation results persuasively reveal that our proposed approach can effectively cut down the data center energy consumption. In particular, the quadratic control policy outperforms the linear control policy in terms of both energy consumption and average server temperature.

Related works

Typically, the dominant energy consumers in the data center are IT subsystem and cooling subsystem, which account for 56% and 30% of total energy, respectively [17]. In this section, we elaborate state-of-the-art IT and thermal management techniques for data center energy minimization. Since IT and cooling systems interact via heat, we also discuss the thermal model and supplying air temperatures in the data center.

IT subsystem management

To meet the Quality of Service (QoS) required by cloud users in the context of fluctuating IT workloads, data center operators usually over-provision the computing resource according to the peak workload. As a result, the IT resource is extremely underutilized (20−30% [18] in typical data centers). Staggeringly, server energy consumption is out of proportion to its utilization, and energy consumed by an idle server is about 60% of a fully utilized counterpart [19]. Controlling the sleep/active state of servers is proved to be an effective way to save IT energy. Meisner et al. [19] presented an energy conservation approach named PowerNap, which switches the operation state of servers between active and sleep modes to cater for the fluctuating workload. However, when the actual idle interval is less than the wake-up latency, the frequent switch between active and sleep modes may be negative for energy saving. To address this issue, Duan et al. [20] proposed a prediction scheme, which dynamically estimates the length of CPU idle interval and thereby intelligently picks out the most cost-efficient operation mode to effectively cut down idle energy.

Another approach to improve energy efficiency for the IT subsystem is Dynamic Voltage and Frequency Scaling (DVFS), which reduces energy consumption by dynamically scaling the supplied voltages and adjusting the CPU frequency (according to the CPU workload). Many existing researches have attained some advancement in energy-efficient scheduling by leveraging DVFS technology. To optimize the energy efficiency, some studies focused on calculating the optimal CPU frequency and controlling the supply voltage, e.g., [2123], in which several scheduling algorithms are designed and evaluated to find the optimal frequency. Moreover, applying the Virtual Machine (VM) migration and consolidation techniques into DVFS management is an extended solution. For example, Takouna and Meinel [24] made use of the memory DVFS mechanism for VMs consolidation to cut down the energy consumption. Arroba et al. [25] proposed a DVFS-aware consolidation policy, which can consider the processor frequency when workload is dynamically allocated to servers. It is demonstrated that up to 39.14% energy is saved by joint DVFS and workload management.

Cooling subsystem management

According to heat exchange medium, the cooling treatments for the data center can be classified into three different approaches [26].

  • Air cooling. The Computer Room Air Conditioners (CRACs) supply cold air for servers, which absorbs waste heat and disposes it to the outside environment.

  • Liquid cooling. Heat-generating components are surrounded by closed loops full of liquid, which carry heat out by the heat transfer effect. Thanks to liquid higher specific heat, liquid cooling is more efficient than air cooling.

  • Immersion cooling. Computing equipments are encapsulated into containers, which are immersed into non-electrical but thermal conductive liquid. Immersion cooling saves up to 99% energy with respect to the traditional data center.

Due to the simpleness and cheap implementation cost, air cooling is the most common approach used in the industry. Cooling energy is primarily consumed by chillers, CRACs, pumps, cooling towers, and fans, which compose a “standard” cooling subsystem. There are a number of works concentrating on energy-minimization techniques for air cooling systems. For instance, Iyengar and Schmid [8] modeled energy consumption for every component based on physics. They also formulated a holistic energy model of HVAC subsystem by adding up the energy of overall components within a cooling system. Their models are capable of estimating energy consumption and heat transfer phenomenon. Some experiments [2729] leveraged the active tiles to balance cooling airflow velocity between under-provisioning and over-provisioning. With the assistance of active tile fans, they tried to exactly provide the airflow volume required by servers. Taking full advantages of local climate resource, some studies concentrated on saving cooling energy by air-side economizer [30, 31]. Khalaj et al. [32] conducted cooling energy simulations for nine different air economizers, and carried out measurement for 23 location in Australia. Actual measurement data indicated that about 85% of the cooling energy is saved on average, and the average value of PUE reduces from 1.42 to 1.22.

Thermal model

A typical configuration of a data center with raised floor and perforated tiles is illustrated in Fig. 1 [33]. Early implementations of this structure did not seal the cold aisle, therefore the inlet airflow temperatures of server racks varied drastically with different sites due to the recirculation of hot air. Yeo et al. [34] showed that the temperature differential of server inlet airflow for different locations in an opened aisle data center can be as high as around 20 °C when the supplying cold air temperature is 15 °C. To model the recirculation behavior, researchers proposed a thermal network approach, in which the inlet airflow temperature is a linear function of the supplying cold airflow temperature and rack outlet temperature [35, 36]. The coefficients of thermal network model can be obtained via field measurements or Computational Fluid Dynamics (CFD) simulations. To avoid the inefficiency of open aisle, later studies use the cold aisle containment to mitigate the recirculation [37, 38]. It is revealed that enclosed cold aisle has positive effect on uniform thermal distribution and minimizing cooling energy consumption [39].

Fig. 1
figure 1

A typical data center model [33]

On the other hand, the traditional rack structure is illustrated in Fig. 2 a, where fans are integrated with a server [4042]. The server manufacturer rather than the data center operator takes charge of the fan control due to the ownership of servers, i.e., servers generally belong to the data center users rather than the data center operators. As a consequence, the data center operator only need to ensure the supplying air temperature to be less than a threshold specified by the server manufacturer. However, with recently emerged rack designs where fans are shifted from servers to racks, as shown in Fig. 2 b [43], the data center operator must consider the server temperature instead of the supplying air temperature as the constraint. Hence, the server temperature control is applied into our thermal model.

Fig. 2
figure 2

Typical server design and modern rack design. a A typical server design: fans installed in servers. b Fan-equipped rack [43]: shared cooling fan wall on rack front

System model

Figure 1 [33] presents a typical data center model with the configuration of a cold aisle and airflow pattern. There are a number of Cloud Users (CUs) who rent servers from a Cloud Provider (CP) to obtain required computing capacity. In response, the CP will turn on a certain number of servers which generate waste heat. To remove the waste heat inside the data center, rack fans suck the cold air provided by CRAC and blow it through servers. Then, the cold air absorbs the waste heat and is ejected from the rack rear. CRAC consumes a large amount of energy in this process. To guarantee the QoS and system reliability, data center operation must comply with the QoS and “soft” temperature constraints.

Energy consumption model in data center

Data center energy consumption principally derives from servers operation and cooling equipment refrigeration. In this article, we do not take fundamental infrastructures (illuminating system, fire extinguishing system etc.) into account. The total energy consumption is approximately equivalent to the sum of energy consumed by IT and cooling subsystems.

Server energy consumption model

We assume a total number of J users requesting service from a data center. At time t, p j (t), L j (t), and m j (t) are defined as the power cost of a server, the number of user requests, and the amount of servers providing service for user j∈[1,2,…,J], respectively. The energy consumption of a single server can be written as follows

$$ p_{j}(t)=a_{1}\frac{L_{j}(t)}{m_{j}(t)}+a_{2}, $$
(1)

where a 1 is the marginal energy consumption of a server, a 2 refers to the server’s basic energy consumption generated by some non-workload-related components, e.g., power supply unit and storage devices etc. [44]. The energy consumption of a server cluster handling user j requests can be written as

$$ P_{j}(t)=m_{j}(t)p_{j}(t)=a_{1}L_{j}(t)+a_{2}m_{j}(t). $$
(2)

Then the total energy consumption for data center servers is

$$ P(t)=\sum\limits_{j=1}^{J}P_{j}(t)=\sum\limits_{j=1}^{J}\left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right). $$
(3)

CRAC energy consumption

As in [40], we define the power consumption of a CRAC at time t as

$$ C(t)=\frac{P(t)}{CoP\left(T^{c}(t)\right)}, $$
(4)

where T c(t) is the temperature of cold air from CRAC. The Coefficients of Performance (CoP) describes the cooling efficiency of CRAC at T c(t). This paper takes the following quadratic CoP model [40]

$$ CoP\left(T^{c}(t)\right)=b\left(T^{c}(t)\right)^{2}+cT^{c}(t)+d. $$
(5)

Constraints

QoS constraint

We assume that all requests from the same user share a service queue. Therefore, the system is an M/M/N queueing system (Fig. 3) and the response time can be calculated by \(\frac {1}{m_{j}\mu _{j}-L_{j}(t)}\). For user j, the average service rate μ j can be computed through dividing CPU speed s (expressed by instructions/second) by K j , i.e., \(\mu _{j}=\frac {s}{K_{j}}\). Each user is assigned a delay upper bound D j and the QoS constraint can be expressed via

$$ \frac{1}{m_{j}\mu_{j}-L_{j}(t)}\leq D_{j}. $$
(6)
Fig. 3
figure 3

An M/M/N queueing model

“Soft” server temperature constraint

Under steady-state, the temperature of a running server depends upon the server inlet air temperature T c(t) and the energy consumption of servers p j (t)

$$ T_{j}^{cpu}(t)=T^{c}(t)+\varsigma p_{j}(t), $$
(7)

where ς (Kelvin.secs/Juoles) refers to the heat exchange rate. As mentioned above, in practice the instant server temperature can occasionally violate the upper bound \(T_{CPU}^{max}\) without undermining the system reliability. Hence, we use the average server temperature constraint instead of the “hard” one, i.e.,

$$ E \left\{T^{c}(t)+\varsigma p_{j}(t)\right\}\leq T_{cpu}^{max}. $$
(8)

We assume the thermal distribution is uniform in server inlets, where the cold air temperature is equal to the set-point of CRAC temperature. Previous studies like [36, 45] developed a thermal model where the server inlet temperature is a linear combination of the CRAC cold air temperature and other server outlet temperatures to reflect the hot air recirculation effect between cold and hot aisles. These models are constructed for open aisle configuration. In the enclosed aisle designs considered in this paper, however, the thermal characteristic is different. Arghode et al. [39] showed through CFD simulations that physically separating the hot and cold aisles can result in uniform and lower server inlet temperature, especially for over-provisioned case. To further provide evidence for this argument, we measure the inlet air flow temperature at various locations through two experiments in a real data center of UniCloud [16]. Temperature sensors are deployed in the rack inlet, and they are grouped into three horizontal layers, with vertical inter-layer distance 0.6 m (Fig. 4). The distance between bottom layer and raised floor is also 0.6 m. The cold aisles in both computer rooms are enclosed. Temperature data are collected every 10 min, and both experiments lasts for 1.5 h. The resolution of temperature sensor is 0.1°C. Field measurement results plotted in Fig. 5 indicate that although the air flow temperature at CRAC outlet is generally lower than other locations, the temperature is rather stable in server racks, and there is no strong correlation between locations and air flow temperature. In addition, recent advanced techniques like active tiles [29] and the Down-flow Plenum [46] further diminished the nonuniform temperature distribution. Therefore, our thermal model (8) is applicable in practice.

Fig. 4
figure 4

Layout of sensors deployment. a Computer room 1. This computer room is dedicated to servers. The horizontal distance between the first rack and CRAC is 3.3 m. The inter-rack distance is 2.4 m. b Computer room 2. This is a small computer room dedicated to network devices. The horizontal distance between the first rack and CRAC is 3.8 m. The inter-rack distance is 1.4 m

Fig. 5
figure 5

Measured inlet air flow temperature. a Computer room 1, b Computer room 2

Total energy minimization problem

Now, we can summarize the holistic energy minimization problem as follows

$$ \min \lim \limits_{T \rightarrow \infty}\frac{1}{T}\sum\limits_{t=0}^{T-1}E\left\{P(t)+C(t)\right\}, $$
(9)

s.t.

$$ \frac{1/D_{j}+L_{j}(t)}{\mu_{j}}-m_{j}(t)\leq 0, $$
(10)
$$ \lim \limits_{T \rightarrow \infty}\sum\limits_{t=0}^{T-1}E\left\{T^{c}(t) + \varsigma\left(a_{1}\frac{L_{j}(t)}{m_{j}(t)} + a_{2}\right)-T_{cpu}^{max}\right\}\leq 0, $$
(11)
$$\begin{array}{*{20}l} m_{j}^{min}&\leq m_{j}(t)\leq m_{j}^{max}, \forall j, \end{array} $$
(12)
$$\begin{array}{*{20}l} T^{min}&\leq T^{c}(t)\leq T^{max}, \end{array} $$
(13)
$$\begin{array}{*{20}l} L_{j}^{min}&\leq L_{j}(t)\leq L_{j}^{max}, \forall j. \end{array} $$
(14)

where \(m_{j}^{min}=\frac {1/D_{j}^{max}+L_{j}(t)}{\mu _{j}}\) is the minimal number of servers guaranteeing QoS, and \(m_{j}^{max}\) is the maximal number of servers restricted by the monetary budget. T c is the cold air temperature in CRAC, which ranges between the upper T max and lower T min bound.

The above problem is hard to solve for the following reasons: 1) The probability density function of L j (t) needed to compute Eq. (10) is unknown. 2) The applicability of classical dynamic programming method is limited due to scalability issue. Hence, we must explore an alternative approximation method to solve this problem.

Dynamic control strategies

Considering the above imperfection about traditional dynamic programming algorithm, we use a Lyapunov Optimizaion approach to develop an approximate algorithm to solve problems (9) - (13).

Virtual queue and thermal constraint

Conversion for the average temperature constraint

At time t, the updated length of virtual queue is

$$ Z(t+1)=\max \left[ Z(t)+ \bar{y}(t),0 \right], $$
(15)

where \(\bar {y}(t)\) is the mean rate of a virtual queue. If it is stable, the constraint (\(\lim \limits _{t \rightarrow \infty }\bar {y}(t) \leq 0\)) can be satisfied. Now, we analyze why it is reasonable to translate the CPU average temperature constraint into the mean rate stability in virtual queues.

Transforming formula (8), we can also express the “soft” server temperature constraint by

$$ E\left\{m_{j}(t)T^{c}(t)+\varsigma\left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)\right\}-T_{cpu}^{max}m_{j}(t)\leq0 $$
(16)

Substituting (15) into (16), the updated queue is modeled as Eq. (17), which relates the server temperature to the virtual queue. According to [15], the mean rate stability of virtual queues \(\left (\lim \limits _{t \rightarrow \infty } \frac {\mathrm {E}\left \{Z_{j}(t)\right \}}{t}=0\right)\) can guarantee the “soft” server temperature constraint.

$$ \begin{aligned} \mathrm{Z}_{j}(t+1)&=\max \left\{Z_{j}(t)+m_{j}(t)T^{c}(t)\right.\\ &\quad\left.+\varsigma\left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)-T_{cpu}^{max}m_{j}(t),0\right\}. \end{aligned} $$
(17)

Applying LO theory to the STCEM problem

The above inference has proved that the mean rate stability in virtual queues ensure the safe server temperature. It provides an inspiration for us to solve the problem by an indirect approach. Leveraging the above conclusion, we next discuss how to minimize energy consumption in data centers.

The LO function of the length of virtual queues is defined as

$$ L\left(Z(t)\right)=\frac{1}{2}\sum\limits_{{j}=1}^{J}Z_{j}(t)^{2}. $$
(18)

The corresponding LO drift function is as follows

$$ {\begin{aligned} \Delta \left(Z(t)\right)&=E\left\{L(t+1)-L(t)\mid Z(t)\right\}\\ &=E\left\{\frac{1}{2}\sum\limits_{{j}=1}^{J}Z_{j}(t+1)^{2}-\frac{1}{2}\sum\limits_{{j}=1}^{J}Z_{j}(t)^{2}\mid Z(t)\right\}. \end{aligned}} $$
(19)

Transforming (19) by replacing Z j (t + 1) with (17) results in

$$\begin{array}{*{20}l} \Delta \left(Z(t)\right)=&E \left\{\frac{1}{2}\sum\limits_{{j}=1}^{J}\max \left[{\vphantom{T_{cpu}^{max}m_{j}}}Z_{j}(t)+m_{j}(t)T^{c}(t)\right.\right.\\ &\left.+\varsigma \left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)-T_{cpu}^{max}m_{j}(t),0\right]^{2} \\ &\left.-\frac{1}{2}\sum\limits_{{j}=1}^{J}Z_{j}(t)^{2}\mid Z(t)\right\}, \end{array} $$
(20)

which yields

$$ {\begin{aligned} \Delta \left(Z(t)\right)&\leq \sum\limits_{{j}=1}^{J}E\left\{\frac{1}{2}\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)^{2}m_{j}^{2}(t)\right.\\ &\quad+\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)m_{j}(t)\\ &\quad\left.+\frac{1}{2}\varsigma a_{1}L_{j}^{2}(t)+\varsigma a_{1}L_{j}(t)Z_{j}(t)\mid Z_{j}(t)\right\}. \end{aligned}} $$
(21)

The Lyapunov penalty function of Z j (t) is

$$ VE\left\{\left(P(t)+C(t)\right)\mid Z(t)\right\}. $$
(22)

Substituting Eqs. (3) and (4) in (22), we get the Lyapunov penalty function of “soft” server temperature constraint

$$\begin{array}{*{20}l} VE&\left\{P(t)+C(t)\mid Z(t)\right\}\\ &=E\left\{\sum\limits_{j=1}^{J}{Va}_{2}\left(1+\frac{1}{CoP(T^{c}(t))}\right)m_{j}(t)\right.\\ &\quad\left.+{Va}_{1}\left(1+\frac{1}{CoP(T^{c}(t))}\right)L_{j}(t)\mid Z(t)\right\}.\notag \end{array} $$
(23)

The Lyapunov drift-plus-penalty function of virtual queue Z(t) is

$$ \Delta\left(Z(t)\right)+VE\left\{P(t)+C(t)\mid Z(t)\right\}, $$
(24)

where V>0 representing the weight of energy consumption with respect to the server temperature constraint. Based on the LO theory, problem (9) can be transformed into minimizing the upper bound of (24)

$$ \min \Delta\left(Z(t)\right)+VE\left\{P(t)+C(t)\mid Z(t)\right\}, $$
(25)

s.t. (10), (12), (13), and

$$ \lim \limits_{t\rightarrow\infty} \frac{E\left\{Z_{j}(t)\right\}}{t}=0. $$
(26)

Equation (26) is the converted form of the CPU average temperature constraint. Hence, we can indirectly optimize total energy consumption of data centers by minimizing the Lyapunov drift-plus-penalty function.

Linear resource control strategy

Theorem 1

A linear bound of (24) can be written as (27),

$$ {\begin{aligned} &\Delta\left(Z(t)\right)+VE\left\{P(t)+C(t)\mid Z(t)\right\}\\[-1pt] &\leq B_{L}+E\left\{\sum\limits_{j=1}^{J} \left[{\vphantom{\left(1+\frac{1}{b(T^{c}(t))^{2}+cT^{c}(t)+d}\right)}}\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)m_{j}(t)T^{c}(t)+{Va}_{2}m_{j}(t)\right.\right.\\[-1pt] &\left.\left.\quad\times\left(1+\frac{1}{b(T^{c}(t))^{2}+cT^{c}(t)+d}\right)\right]\mid Z(t){\vphantom{\sum\limits_{j=1}^{J}}}\right\}\\[-1pt] &\quad+E\left\{\sum\limits_{j=1}^{J}\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)\left(\varsigma a_{2}-T_{cpu}^{max}\right)m_{j}(t)\mid Z(t)\right\}\\[-1pt] &\quad+E\left\{\sum\limits_{j=1}^{J}{Va}_{1}L_{j}(t)\left(1+\frac{1}{b(T^{c}(t))^{2}+cT^{c}(t)+d}\right)\mid Z(t)\right\}\\[-1pt] &\quad+E\left\{\sum\limits_{j=1}^{J}\varsigma a_{1}L_{j}(t)Z_{j}(t)\mid Z(t)\right\}. \end{aligned}} $$
(27)

where

$$\begin{array}{*{20}l} B_{L}&=\sum\limits_{j=1}^{J}\left(\frac{1}{2}\left(T^{max}+\varsigma a_{2}-T_{cpu}^{max}\right)^{2}\left(m_{j}^{max}\right)^{2}\right. \\[-1pt] &\quad+\left.\frac{1}{2}\varsigma^{2}a_{1}^{2}\left(L_{j}^{max}\right)^{2}\right). \end{array} $$
(28)

Proof

Transform (17) into (29).

$$\begin{array}{*{20}l} &Z_{j}(t+1)-Z_{j}(t)\\ \geq &m_{j}(t)T^{c}(t)+\varsigma \left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)-T_{cpu}^{max}m_{j}(t). \end{array} $$
(29)

Squaring formula (29) gets into (30).

$$ {\begin{aligned} &Z_{j}^{2}(t+1)-Z_{j}^{2}(t)\\ &\quad\leq \left(m_{j}(t)T^{c}(t)+\varsigma \left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)-T_{cpu}^{max}m_{j}(t)\right)^{2}\\ &\quad+2Z_{j}(t)\left(m_{j}(t)T^{c}(t)+\varsigma \left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)-T_{cpu}^{max}m_{j}(t)\right). \end{aligned}} $$
(30)

Taking expectation for (30) and adding up from j=1 to j=J results into

$$\begin{array}{*{20}l} &\sum\limits_{j=1}^{J}E\left\{Z_{j}^{2}(t+1)-Z_{j}^{2}(t)\mid Z(t)\right\}\\ &\quad\leq \sum\limits_{j=1}^{J}E\left\{{\vphantom{T_{cpu}^{max}}}\left(\vphantom{-T_{cpu}^{max}m_{j}(t)} m_{j}(t)T^{c}(t)+\varsigma \left(a_{1}L_{j}(t)+a_{2}m_{j}(t)\right)\right.\right.\\ &\qquad\left.-T_{cpu}^{max}m_{j}(t)\right)^{2} +2Z_{j}(t)\left(m_{j}(t)T^{c}(t)+\varsigma \left(a_{1}L_{j}(t)\right.\right.\\ &\qquad\left.\left.\left.+a_{2}m_{j}(t)\right)-T_{cpu}^{max}m_{j}(t)\right)\mid Z(t)\right\}. \end{array} $$
(31)

In format, the left side of (31) is consistent with the Lyapunov drift function of virtual queues, i.e., Eq. (19). Thus, replacing left side by 2Δ(Z(t)) and simplifying it yield the following connection

$$\begin{array}{*{20}l} &\Delta \left(Z(t)\right)\\ \leq &\sum\limits_{j=1}^{J}E\left\{\frac{1}{2}\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)^{2}m_{j}(t)^{2}\right.\\ &+\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)m_{j}(t)\\ &\left.+\frac{1}{2}\varsigma^{2}a_{1}^{2}L_{j}(t)^{2}+\varsigma a_{1}L_{j}(t)Z_{j}(t)\mid Z(t)\right\}. \end{array} $$
(32)

In Eq. (32), applying the bound (12) and (14), we reach the result (33).

$$ {\begin{aligned} &\Delta \left(Z(t)\right) \leq B_{L}\\ &\quad+\sum\limits_{j=1}^{J}\left\{\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)\right.\\ &\quad\left.\! \times m_{j}(t)+\varsigma a_{1}L_{j}(t)Z_{j}(t)\mid Z(t)\right\}. \end{aligned}} $$
(33)

(33) plus (23) makes (27). □

Theorem 1 shows that at time t, the linear control algorithm solves the bound of the right hand of problem (27).

As we can see, the problem (27) is only related to variables T(t) and m j (t). According to the monotonicity of linear function, we take distinctive value of m j (t): 1) if the coefficient of m j (t) is more than zero, we take its minimum, \(m_{j}^{min}=\frac {1/D_{max}+L_{j}(t)}{\mu _{j}}\). 2) Oppositely, we set m j (t) as \(m_{j}^{max}=1.1*L^{max}\), where L max is the maximum of workload among all users at time t∈[0,1,…,T−1]. We next enumerate each possible T c(t) and take out the optimal T c(t) associated with the minimal objective function value. Then, we set the corresponding T(t) and m j (t) as the CRAC temperature degree and the amount of servers, respectively. By the above analysis, we translate the optimization of energy consumption into the minimum of the right hand of Eq. (27). The linear control algorithm based on the LO theory is presented in Algorithm 1.

Quadratic resource control strategy

In the above section, we take the linear control strategy based on the LO theory to approximately compute the minimum of energy consumption in data centers. Depending on the property of the linear function, the optimal variable value, corresponding to the minimal energy consumption, is either minimum or maximum of the range of its values. This method gives rise to high fluctuation for CPU transient temperature, which goes against the stability and reliability of servers. As a response to this problem, another method, the quadratic control strategy, is presented.

Theorem 2

A quadratic bound of problem (24) is defined as (34),

$$ {\begin{aligned} &\Delta\left(Z(t)\right)+VE\left\{P(t)+C(t)\mid Z(t)\right\}\\ &\quad\leq B_{Q}+E\left\{\sum\limits_{j=1}^{J}\left[\frac{1}{2}\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)^{2}\right.\right.\\ &\qquad\left.\left. m_{j}(t)^{2}+\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)T^{c}(t)m_{j}(t){\vphantom{\frac{a^{a}}{a}}}\right]{\vphantom{\sum\limits_{j=1}^{J}}}\mid Z(t)\right\}\\ &\qquad+E\left\{\sum\limits_{j=1}^{J}\left[\vphantom{\left(1+\frac{1}{b\left(T^{c}(t)\right)^{2}+cT^{c}(t)+d}\right)}\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)\left(\varsigma a_{2}-T_{cpu}^{max}\right)\right.\right.\\ &\qquad\left.\left.+{Va}_{2}\left(1+\frac{1}{b\left(T^{c}(t)\right)^{2}+cT^{c}(t)+d}\right)\right]m_{j}(t)\mid Z(t){\vphantom{\sum\limits_{j=1}^{J}}}\right\}\\ &\qquad+E\left\{\sum\limits_{j=1}^{J}{Va}_{1}L_{j}(t)\left(1+\frac{1}{b\left(T^{c}(t)\right)^{2}+cT^{c}(t)+d}\right)\mid Z(t)\right\}\\ &\qquad+E\left\{\sum\limits_{j=1}^{J}\varsigma a_{1}L_{j}(t)Z_{j}(t)\mid Z(t)\right\}. \end{aligned}} $$
(34)

where

$$ B_{Q}=\sum\limits_{j=1}^{J}\frac{1}{2}\varsigma^{2} a_{1}^{2}\left(L_{j}^{max}\right)^{2}. $$
(35)

Proof

In Eq. (34), applying the bound (14), we obtain (36).

(23) added to (36) is equal to (34). □

$$ {\begin{aligned} \Delta\left(Z(t)\right) \leq B_{Q} &+E\left\{\sum\limits_{j=1}^{J}\frac{1}{2}\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)^{2}m_{j}(t)^{2}\mid Z(t)\right\}\\ &+E\left\{\sum\limits_{j=1}^{J}\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)\left(T^{c}(t)+\varsigma a_{2}-T_{cpu}^{max}\right)\right.\\ &\times\left.m_{j}(t)+\varsigma a_{1}L_{j}(t)Z_{j}(t)\mid Z(t){\vphantom{\sum\limits_{j=1}^{J}}}\right\}. \end{aligned}} $$
(36)

Theorem 2 shows that at time t, the quadratic control algorithm solves the bound of the right hand of problem (34).

It is obvious that the right hand of (34) is a quadratic function with only one variable unknown, and the coefficient of quadratic term is more than zero. To solve the extreme value of a quadratic function, we generally take derivative with respect to the unknown variable. Such as y=Ax 2+Bx+C(A≠0,x∈[m,n]), its derivative about x is y =2Ax+B. We use the value of x satisfying with y =0 to calculate the minimum y. Specific to (34), the value of m j (t) is classified to three categories expressed in (37). In the regulation, \(m_{j}^{min}=\frac {1/D_{max}+L_{j}(t)}{\mu _{j}}\), \(m_{j}^{max}=1.1*L^{max}\), L max is the maximum of workloads among all users at time t∈[0,1,…,T−1]. We next enumerate T c(t) and take out the optimal T c(t) associated with the minimal objective function value. Compared with the linear control strategy, this policy overcomes the drawback of only taking extremum for variable m j (t) through shrinking upper range of the Lyapunov drift function. The quadratic control strategy is presented in Algorithm 2. It provides a flexible and practical method for distributing servers, and it reduces the swing of CPU temperature as well.

$$ {\begin{aligned} &m_{j}(t)\\&=\left\{ \begin{array}{ll} m_{j}^{min}&m_{j}(t)<m_{j}^{min}\\ \frac{-{Va}_{2}\left(1+\frac{1}{CoP(T^{c}(t))}\right)-\left(T^{c}(t)+\varsigma a_{2}-T^{max}\right)\left(\varsigma a_{1}L_{j}(t)+Z_{j}(t)\right)}{\left(T^{c}(t)+\varsigma a_{2}-T^{max}\right)^{2}}&m_{j}^{min}\leq m_{j}(t)\leq m_{j}^{max}\\ m_{j}^{max}&m_{j}(t)>m_{j}^{max} \end{array} \right. \end{aligned}} $$
(37)

Performance evaluation

In this section, we conduct extensive simulations to evaluate the effectiveness of control policies presented in this paper. We analyze that what effect the control strategies have on the total and component energy consumption in a data center. We also compare the performance between linear and quadratic control policies from the number of servers, CPU temperature, and CRAC temperature.

System setup

We use a workload trace from a real data center Ordos UniCloud Co. Ltd [16], which is shown in Fig. 6. The trace includes one week request arrivals recorded with the time density of 1 h for 4 interactive applications. In practice, request arriving rate ranges from 1000 request/s to 150000 request/s. For simplicity, we set mean rate of a CPU service as 100 request/s. In addition, we set a 1=a 2=40 W, the upper bound of response time is 50 ms, ς=0.625 Kelvin.secs/Joules. We assume the operating range of CRAC is [15 °C, 25 °C].

Fig. 6
figure 6

Workload for 4 applications

Results and analysis

Baseline policy

To clearly evaluate the effectiveness of linear and quadratic control strategies, we introduce a Baseline Policy to highlight the significance of this study by comparing baseline with linear and quadratic policies.

The baseline policy can be stated as follows: 1) According to the upper bound of response time D max, we obtain the number of servers \(m=\frac {1/D_{j}^{max}+L_{j}(t)}{\mu _{j}}\), which is the minimal number of servers distributed to user j. 2) Then, taking m into \(T=T_{cpu}^{max}-\varsigma \left (a_{1}\frac {\sum _{j=1}^{J}L_{j}(t)}{\sum _{j=1}^{J}m_{j}(t)}+a_{2}\right)\) yields the peak air temperature T for CRAC. We program Algorithm 3 to calculate the number of servers distributed to 4 users and the cold air temperature supplied by CRAC.

Total energy consumption

According to the LO theory, we aim at computing the minimum of problem (25) to calculate the minimal energy consumption. As the weight parameter V increases, the variation of total energy consumption under different control policies are displayed in Fig. 7.

  1. 1.

    It is shown that the line representing baseline is parallel with X-axis. It is the fixed number of servers and invariable cooling air temperature that result in constant total energy consumption.

    Fig. 7
    figure 7

    Comparison of total energy consumption under different control policies

  2. 2.

    It is observed that the energy consumption is massive where V is small in Fig. 7. Particularly, under the linear control policy, it is even higher than baseline. We attribute it to excessively emphasizing congestion control in virtual queues, i.e., optimizing for the Lyapunov drift function Δ(Z(t)). To relieve the problem of congestion, i.e., reducing delay time, the cloud provider distributes more servers to service for users. As a consequence, more energy are consumed by servers, and waste heat increases along with increasing energy consumption. To control the server temperature falling in a safe range and guarantee the system reliability, cooling system has to efficiently work on absorbing heat. Thus, energy consumed by cooling system also increases. To the contrary, as V grows, we pay more attention to minimizing energy consumption instead of response time. Hence, the less servers, the less waste heat and energy consumption.

  3. 3.

    The energy consumption under linear control policy is dramatically more than the one controlled by quadratic policy when V<104. In other words, the energy consumption under quadratic control policy verges on the optimum. We just approximately solve the objective function by tightening boundary. Since B L >B Q , the bound calculated by the linear algorithm is looser than the counterpart computed by the quadratic algorithm.

  4. 4.

    The energy consumption trend under linear control policy is almost in coincidence with the curve under quadratic control policy when V>104. When V is greater than a threshold, we incline to optimize for total energy consumption. Moreover, the order of magnitude of optimizing energy significantly overcomes that of reducing delay.

To visually compare energy consumption under different control policies, we introduce the comparison table in the form of specific values (Table 1) and saving proportion (Table 2). In conclusion, the quadratic control policy is more efficient with respect to the linear control policy in terms of total energy consumption.

Table 1 Comparison of energy consumption in data center
Table 2 Energy saving proportion data center with respect to the baseline policy

Component energy and PUE

The component energy is plotted in Fig. 8, which is similar with total energy consumption (Fig. 7). The number of servers have a direct effect on server energy consumption. Most of energy for CPU computing dissipates in the form of waste heat, which leads to server temperature raising up. In order to guarantee normal server operation, cooling system must work hard to maintain server temperature within a safe range. During this process, component energy and total energy are in a positive correlation, which is visually expressed in Figs. 7 and 8. In addition, as V grows, the number of active servers gradually reduces. As a result, the average server temperature and cooling temperature is building up, which is clearly illustrated in Figs. 9 and 10. Considering these performance metrics, it is clear that the quadratic control policy is more appropriate.

Fig. 8
figure 8

Energy comparison for different components under linear and quadratic control. a Server energy, b Cooling energy

Fig. 9
figure 9

The relationship between number of servers and average server temperature. a linear, b quadratic

Fig. 10
figure 10

The average temperature comparison between IT and cooling subsystem. a linear, b quadratic

Figure 11 depicts the comparison of two policies in terms of PUE. PUE is computed through dividing total energy by computing energy, which is a typical performance metric. A PUE value closed to 1, indicates a higher proportion of computing energy. P U E=1 is an ideal state, which means all energy is consumed by the useful work, i.e., CPU operation. Through Fig. 11, we find that the PUE value under the liner control policy is lower than the counterpart uner the quadratic control policy. It means that the linear control policy is superior to the quadratic control policy in terms of the metric PUE. However, as shown in Figs. 7 and 8, the quadratic control is more efficient in both the total energy and the component energy. The conclusion based on the metric PUE is opposite to the state of actual energy consumption. Hence, the metric PUE accepted by the field experts is not enough to diagnose whether a data center is efficient or not.

Fig. 11
figure 11

Power usage efficiency vs. V

Number of active servers

The control variable m j (t), i.e., the number of servers distributed to users, has a decisive influence on total energy consumption in data center. The number of servers providing service for users under linear and quadratic control policies is illustrated in Fig. 12. We use application 1 with V=1000 as an example to analyze this phenomenon. It is obvious that the number of servers under quadratic control policy is nearly always less with respect to the linear control policy. It depends on the flexibility of control variable value. It is observed that the blue line fluctuates drastically compared with the yellow line. Extensive simulations demonstrate that the similar phenomenon can also be found for other applications and a wider range of V value. The average number of active servers is depicted in Fig. 13. Holistically, as V grows, the number of active servers under linear control policy declines and almost equates to the counterpart under quadratic control policy.

Fig. 12
figure 12

Servers allocated to user1 with V=1000

Fig. 13
figure 13

Comparison in average number of active servers under different control vs.V

Average server temperature

The average server temperature fluctuation with the increasing V is displayed in Table 3. This result is attributed to the amount of workloads assigned to a server. A larger V means that more attention is paid to optimize energy consumption. As a result, fewer servers are distributed to users, and more workloads is allocated to a single server. CPU computing under high load leads to higher degree server temperature. This relationship is reflected in Fig. 9. Besides, the trend of server temperature is in correspondence with energy consumption. When V∈[100,10000], since the server energy consumption under the linear policy is higher than the counterpart under the quadratic policy, the server temperature for the linear control policy is much larger.

Table 3 Comparison of average server temperature under linear and quadratic control vs.V

To guarantee server stability and reliability, we set the upper bound of server temperature as 60 °C. As shown in Fig. 14, for the linear control policy, the average server temperature exceeds the bound when V>1000. But under quadratic control policy, it doesn’t go beyond the bound until V value is closed to 5000. In terms of thermal management, the quadratic control policy is more efficient compared with the linear control policy.

Fig. 14
figure 14

The average server temperature vs. V

The probability distribution functions of server inlet temperature under linear and quadratic control policies are displayed in Figs. 15 and 16, respectively. Taking the practical operation into account, only the curves representing V=500,1000,5000,10,000 are depicted. It can be validated that server temperature rises up with increasing V in the two figures. Using 60 °C as an example, the proportion of instantaneous temperature of servers lower than 60 °C is smaller and smaller from V=500 to V=10000 (Fig. 15). Moreover, we can observe that more than half of server temperatures under the linear control policy violates “soft” server temperature constraint 60 °C. But under the quadratic control policy, it is almost all lower than 60 °C. Specifically, we list out the probability of instantaneous temperature of servers below 60 °C in Table 4. It is obvious that the quadratic control policy is superior to a linear one in guaranteeing the “soft” server temperature constraint (≤60 °C).

Fig. 15
figure 15

The probability distribution function of server instantaneous temperature under linear policy vs.V

Fig. 16
figure 16

The probability distribution function of server instantaneous temperature under quadratic policy vs.V

Table 4 Probability of instaneous server temperature below 60 °C vs.V

Conclusion

In this paper, we formulate the total energy minimization problem subject to the “soft” server temperature constraint. Based on the Lyapunov Optimization theory, we design linear and quadratic control policies to obtain the near-optimal solution. The “soft” server temperature constraint is translated into the mean rate stability of virtual queues. Furthermore, we evaluate the system performance through extensive simulations with various parameters. Substantial results indicate that the quadratic control policy is closer to the optimum, whatever in saving energy or complying with the temperature constraint. We set a weight parameter V to balance energy consumption and server temperature constraint. As a consequence, setting V around 5000 under the quadratic control policy is the optimal trade-off: the proportion of energy saving is up to 41.83%, and the “soft” server temperature constraint violation proportion is 17.7%.