1 Introduction

In recent decades, both global warming and climate change have been significantly spurred by the growth in energy demand from residential and non-residential buildings. The Global Alliance for Buildings and Construction reported that, in 2021, the building and construction sector was responsible for more than one third of the global energy consumption (IEA 2021; GlobalABC 2021)—despite the reduction in commercial and industrial activities caused by the COVID-19 pandemic. According to the International Energy Agency (IEA 2021), buildings are responsible for 17% and 10% of global direct and indirect CO\(_2\) emissions, respectively.

Heating, ventilation, and air conditioning (HVAC) systems are one of the main sources of energy consumption in buildings, representing more than 50% of their associated energy demand in developed countries (Pérez-Lombard et al. 2008). This consumption is potentially affected by a lack of precise control over these systems, whose proper and efficient operation is vital to ensure energy savings (Wang et al. 2020; Mawson and Hughes 2021; Gholamzadehmir et al. 2020).

Many strategies have been implemented to improve HVAC control efficiency. In fact, most common control solutions are based on Rule-Based Controllers (RBC) combined with Proportional-Integrative-Derivative (PID) control (Geng and Geary 1993), which stand out for their simplicity and efficiency (Borase et al. 2021). By using heuristic rules, these controllers can guarantee proper comfort temperature ranges while reducing energy consumption (ASHRAE 2021). However, this type of controllers are far from offering optimal behavior, as they are mostly reactive, hardly cover the complexity of environments influenced by many variables—and therefore, many rules, and are guided by fixed and predetermined control sequences that heavily rely on expert knowledge (Wang and Hong 2020; Salsbury 2005). Generally, RBCs lack scalability, as they do not address building energy optimization on a global level, but on a local one. This is because a control where so many variables are considered would result in an overly complex RBC, being practically impossible to generalize their rules at a building level (Privara et al. 2013; Serale et al. 2018).

Model-based solutions, such as Model Predictive Control (MPC) (Yao and Shekhar 2021; Salakij et al. 2016; Morari and Lee 1999), are an alternative to reactive controllers. These controllers use physical models of buildings to simulate their thermal dynamics and analytically derive optimal HVAC control. Usually, MPC consider not only the characteristics of the environment but also other constraints and contextual data such as occupancy or weather. This ability to characterize and predict environmental conditions makes MPC outperform reactive controllers (Yao and Shekhar 2021; Kümpel et al. 2021; Efheij et al. 2019). Nevertheless, MPC also suffers from certain limitations. In addition to the high computational power it demands, precise system calibration is necessary to achieve the expected performance, which is not easily scalable since each building is unique in its layout and thermodynamic properties (Gomez-Romero et al. 2019). This poses a challenge in terms of cost-effectiveness, as many factors must be considered: materials, occupancy, end-use, location, orientation, etc. Consequently, this has resulted in a relatively small number of buildings currently implementing MPC strategies compared to ‘if-then-else’, ‘on/off’ or ‘bang-bang’ RBC controllers, as well as PID where digital control and variable frequency drives are available (Serale et al. 2018).

Given the shortcomings of these methods, Reinforcement Learning (RL) has been recently proposed as a viable alternative for complex control problems. RL is a computational learning method focused on the interaction of an agent with its environment, either real or simulated. This is an iterative learning method, based on trial and error, where a reward function makes the agent lean towards preferable actions or states. Therefore, the agent’s goal will be to discover which actions lead to the maximization of the expected reward (Sutton and Barto 2018). The combination of RL with deep neural networks has led to a growing application of Deep Reinforcement Learning (DRL) in numerous domains (Mnih et al. 2015; Gibney 2016; Gupta et al. 2021), including HVAC control (Barrett and Linder 2015; Azuatalam et al. 2020; Perera and Kamalaruban 2021; Yang et al. 2021; Fu et al. 2022). Accordingly, DRL can learn sophisticated control strategies from data (Biemann et al. 2021), generally obtained from building simulations, while using more computationally efficient building models than MPCs (Deng et al. 2022).

Nevertheless, as highlighted in Vázquez-Canteli and Nagy (2019), Findeis et al. (2022), most of the DRL proposals for HVAC control in the literature pick one or few algorithms without substantial motivation, lack a comprehensive analysis of them under controlled and assorted conditions, and cannot be easily reproduced. Motivated by this gap, this paper performs a comprehensive experimental evaluation of state-of-the-art DRL algorithms applied to HVAC control. The main contribution of this work is offering insights on which algorithms are most promising in different building energy control scenarios, and what possibilities arise for their further improvement. To this aim, we focus on algorithms performance (evaluation metrics, comfort-consumption trade-off, convergence, etc.), but also in their robustness against changing conditions and capabilities for transference to different scenarios. The study relies on Sinergym,Footnote 1 an open-source building simulation and control framework for training DRL agents (Jiménez-Raboso et al. 2021). Sinergym offers a standardized and flexible design that facilitates the comparison of algorithms under different environments, reward functions, states and actions spaces, as well as experiments replication.

The remainder of this paper is structured as follows. Section 2 details the theoretical background in which our research is framed. Section 3 introduces the main fundamentals of DRL and its application in HVAC control. Sections 4 and 5 will describe the environment and the experiments conducted, with the subsequent discussion on the results in Sect. 6. Finally, Sect. 7 will detail the main conclusions derived from this research.

2 Related work and novelty

The application of DRL in HVAC control is a developing field that has generated significant interest and growth within a broad community of researchers. Although the first approach to applying RL in building control dates back to 1998 by Mozer (1998), the development of this field really began to be remarkable in the last few years, together with the rise of DRL.

Extensive research about the application of DRL in building energy control and HVAC can be found in recent literature reviews, such as Vázquez-Canteli and Nagy (2019), Yu et al. (2021), Wang and Hong (2020), Han et al. (2019), Leitao et al. (2020), Mason and Grijalva (2019), Rajasekhar et al. (2020), Zhang et al. (2018), Yang et al. (2020), Mocanu et al. (2018), where several applications, perspectives, and objectives within this field are summarized and studied in detail.

Focused on the DRL methods being used, we find many applications of well-established algorithms, such as Deep Q-Networks (DQN) (Lissa et al. 2021; Yoon et al. 2019; Gupta et al. 2021; Sakuma et al. 2020), Deep Deterministic Policy Gradient (DDPG) (Gao et al. 2020; Zou et al. 2020), or actor-critic methods (Morinibu et al. 2019; Wang et al. 2017). However, we share the opinion of Biemann et al. (2021) on the current situation in this field: (1) the widespread use of DQN—instead of more modern algorithms such as Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC) or Proximal Policy Optimization (PPO)—in this continuous control problem could be hindering significant progress; and (2) these are generally isolated implementations that compare the performance of the chosen algorithm against a simplistic baseline and solve specific problems in well-delimited domains. In fact, there is little work benchmarking multiple DRL algorithms under standard conditions, as is common in other fields of machine learning (Vázquez-Canteli and Nagy 2019; Wölfle et al. 2020; Islam et al. 2017).

One of the few exceptions is the aforementioned work by Biemann et al. (2021), which develops a comparison between different DRL algorithms in a two-zone data centerFootnote 2 environment. This is a well-known and commonly used test scenario in DRL-based HVAC control (Moriyama et al. 2018; Zhang et al. 2019b), in which the objective is the management of temperature setpoints and fan mass flow rates of a data center. Specifically, the algorithms they compared were SAC, TD3, PPO, and Trust Region Policy Optimization (TRPO), evaluating energy savings, thermal stability, robustness, and data efficiency. Their results revealed the predominance of off-policy algorithms and, specifically, SAC, achieving consumption reductions of approximately 15% while guaranteeing comfort and data efficiency. Finally, the robustness was remarkable when the DRL algorithms were trained using different weather models, which increased their generalization and adaptability.

Another remarkable and recent study is that of Brandi et al. (2022), where the control of an HVAC system based on the charging and discharging of a refrigerated storage facility is presented. In this case, two RBCs and an MPC controller were compared with two SAC implementations: one trained offline—in simulation, and the other one online—on the real building. The authors obtained the best results for MPC and offline DRL, as well as an interesting approximation of the online trained SAC to the performance of its offline variant over time. A Long Short-Term Memory (LSTM) neural network architecture was used to predict the value of the variables composing the agents’ observations over a given prediction horizon. The authors conclude that DRL-based solutions are more unstable than MPC solutions, although they exhibit adaptation to recurrent patterns without explicit supervision.

Regarding generalization capacity and robustness, few papers address this issue. In Xu et al. (2020), transfer learning (Torrey and Shavlik 2010; Taylor and Stone 2009) is proposed to train and evaluate a DQN-based agent in a building layout, and then transfer it to a slightly different one. In this work, the controller uses two sub-networks: a front-end network, which captures building-agnostic behavior, and a back-end network, specifically trained for each building. This way, transfer learning avoids cold-starts and tuning in the second building, allowing for rapid deployment and greater efficiency. Different experiments are carried out, such as (1) transfer from n-zone to n-zone buildings with different materials and layouts; (2) transfer from n-zone to m-zone; and (3) transfer from n-zone to n-zone with different HVAC equipment. A second related work is Lissa et al. (2021), which focuses on spatial and geographical variations in HVAC control systems. Among their main achievements, we can find a significant reduction in the time required to reach comfort temperatures, although the authors do not delve into energy efficiency.

At this point, we can identify certain limitations and research gaps in the relevant but scarce literature addressing the experimental evaluation of DRL algorithms in HVAC control:

  • Studies such as Biemann et al. (2021) and Brandi et al. (2022) only provide results in a single environment. In contrast, benchmarking is desirable to be performed using a representative enough set of scenarios and test configurations.

  • Something similar happens when it comes to the application of transfer learning: the existing case studies are limited and use different environments, so they do not enable comparisons to be made. We believe that further deepening of its application is necessary, concretely by promoting, as far as possible, the generalization capacity and robustness of different agents in a broader set of testing environments.

  • There are additional opportunities close to transfer learning that have not yet been addressed, such as the application of sequential learning. The objective pursued is to compare if a controller that learns progressively to manage an HVAC system under different weather conditions can be more efficient than one directly trained on a fixed setting.

  • Finally, we consider it relevant to study how the manipulation of the reward function used by DRL agents affects their performance, and what consequences these changes may have in terms of consumption and comfort violation.

3 Theoretical background

3.1 Deep reinforcement learning

A reinforcement learning problem is defined as a Markov Decision Process (MDP) consisting of the following elements:

  • An agent, which learns from the interaction with its environment over a discrete sequence of time steps \({\mathcal {T}} = \{0,1,2,...\}\) and that pursues a certain objective.

  • An environment, defined as any element external to the agent. It is a dynamic process that produces relevant data for the agent. A state \(s_t \in {\mathcal {S}}\) represents the current situation of the environment at time step t.

  • A set of actions \({\mathcal {A}}\), which determine the dynamics of the environment. Every action \(a_t \in {\mathcal {A}}\) allows the agent to transit between different states.

  • A reward function that evaluates the goodness of an action or state for the agent. The reward signal \(r_t \in {\mathbb {R}}\) drives the agent’s training.

  • A policy function \(\pi : {\mathcal {S}} \rightarrow {\mathcal {A}}\), which determines the actions to be taken by the agent in the face of different states.

The goal of an RL agent is to achieve optimal behavior towards accomplishing a task, which involves learning an optimal policy that maximizes cumulative reward (Zhang and Yu 2020; Sutton and Barto 2018). Learning this optimal policy can be solved through dynamic programming in problems with not-particularly-large action and/or state spaces. For more complex cases, there are a variety of algorithms based on iterative improvement of the agent’s policy or successive approximation of the expected reward of states and actions, e.g. Monte Carlo, SARSA, or the widely used Q-learning (Watkins and Dayan 1992).

However, these methods generally become infeasible as the complexity of the environment increases. This may be due to considerable growth in the number of actions or states that compose the environment, which can be mitigated by ad hoc methods that are difficult to generalize. Here is where deep learning comes into play, as neural networks can be applied to learn abstract representations of states and actions, as well as to approximate the expected rewards based on historical data. The combination of RL with deep neural networks has given rise to Deep Reinforcement Learning (DRL), which has lead to numerous applications of reinforcement learning in real environments. Thus, DRL algorithms combine the best of both worlds, endowing reinforcement learning with the representational power, efficiency, and flexibility of deep learning (Zai and Brown 2020).

In recent years, we have observed several advances in the development of DRL algorithms that represent the state of the art, such as Deep Q-Networks (DQN) (Mnih et al. 2013), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. 2015), Asynchronous Advantage Actor-Critic (A3C) and Advantage Actor-Critic (A2C) (Mnih et al. 2016), Trust Region Policy Optimization (TRPO) (Schulman et al. 2017b), Proximal Policy Optimization (PPO) (Schulman et al. 2017a), Twin Delayed DDPG (TD3) (Fujimoto et al. 2018), and Soft Actor-Critic (SAC) (Haarnoja et al. 2018). Table 1 summarises the main features and limitations of these algorithms.

Table 1 Some of the most widely used DRL algorithms, including their main features and limitations

As we will see in the following subsection, some of these algorithms are powerful tools for solving continuous control problems with an indefinite number of states and actions (Duan et al. 2016), as is the case of HVAC control.

3.2 HVAC control problem formulation

The feasibility of HVAC control through DRL was firstly demonstrated by using reduced action spaces and simplified building models (Zhang et al. 2019a; Moriyama et al. 2018). The objective of this problem is to find an optimal control policy that maximises the comfort of occupants while minimising energy consumption. The problem is formulated as a Partially Observable MDP (POMDP), and the goal is to find a policy that includes as many desirable states as possible while avoiding energetically-costly actions. Additional parameters, such as energy price and the use of renewable energy sources, can also be considered in the optimization process. However, for the sake of simplicity, we will focus on comfort and consumption as the primary targets.

Based on a set of observed variables that define the ambient conditions of the environment—i.e., outdoor/indoor temperature, CO\(_2\) concentration, humidity, occupancy, the following objectives are considered:

  • Regarding power demand P, we seek a policy that leads to its minimization:

    $$\begin{aligned} \pi ^* = \underset{\pi _\theta }{argmin}\sum ^T_{t=1}\ P_t \end{aligned}$$
    (1)
  • In the case of comfort, we try to minimize the distance C between the current state of the building, \(S_t\), and a target state, \(S_{target}\):

    $$\begin{aligned} \pi ^* = \underset{\pi _\theta }{argmin}\sum ^T_{t=1}\ C(S_t, S_{target}) \end{aligned}$$
    (2)

A state will be desirable if the variables that compose it are within certain pre-established preferences. Therefore, we say that a comfort violation occurs if the value of these variables differs from the established limits.

In this problem, P corresponds to the consumption of the building’s thermal equipment—e.g., heat pump, boiler—while \(C(S_t, S_{target})\) refers to the difference between the actual and desired temperatures in the controlled zones of the building.

Therefore, we can combine the minimization of power demand and the maximization of comfort in a single expression:

$$\begin{aligned} \pi ^* = \underset{\pi _\theta }{argmin}\sum ^T_{t=1} \omega \ C(S_t,S_{target}) + (1 - \omega ) \ P_t \end{aligned}$$
(3)

with \(\omega\) and \((1-\omega )\) being the weights assigned to comfort and power demand, respectively. We can now define our reward function such that:

$$\begin{aligned} r(S_t, A_t) = (1 - \omega ) \ \lambda _P \ P_t + \omega \ \lambda _C \ C(S_t, S_{target}) \end{aligned}$$
(4)

where \(P_t\) is the power demand of the HVAC system at time step t; \(C(S_t, S_{target})\) represents the distance from the current zone temperature to the desired comfort limits; \(\omega\) represent the weighting assigned to each part of the reward, and \(\lambda _P\) and \(\lambda _C\), two constant factors used for scaling the magnitudes of power demand and temperature.

It is common to express this reward in negative terms, since its maximization is sought, which leads us to rewrite it as:

$$\begin{aligned} r(S_t, A_t) = -(1 - \omega ) \ \lambda _P \ P_t - \omega \ \lambda _C \ C(S_t, S_{target}) \end{aligned}$$
(5)

Note how the reward function employed will vary depending on the variables to be considered in the problem. As discussed later, the details of this function is one of the main design decisions in DRL.

Finally, the actions to be performed by the agent will depend on the problem we are facing with. For example, in problems where the actions consist of adjusting the heating and cooling setpoints of the HVAC system, the action space may be discrete—there is a finite number of actions where each action is a tuple with fixed setpoint values, or continuous—each setpoint is a real number to be adjusted. Other similar problems can be addressed with a similar setup, such as air flow regulation (Raman et al. 2020) or lighting control (Chen et al. 2018).

3.3 Simulation models

Most of the DRL solutions proposed in the literature are not directly trained on a real building, but they make use of different simulation tools. In the HVAC context, https://energyplus.net/ and https://modelica.org/ are the most widely used frameworks. The use of this type of simulators is motivated by the fact that training a DRL agent in a real scenario would be too inefficient, as there is a need to establish a complete mapping between states, actions, and rewards, also considering extreme cases (Brandi et al. 2020). In fact, it has been observed that it takes 20–50 days of training to converge on an acceptable control policy (Fazenda et al. 2014; Costanzo et al. 2016; Vázquez-Canteli et al. 2019), thus making impractical to directly train DRL agents while they are deployed.

Typically, simulators are combined with deep learning (e.g., TensorFlow, Keras, PyTorch) or DRL libraries (e.g., Stable Baselines, RLlib, TensorFlow Agents) to pre-train and test algorithms in simulated environments prior to deployment (Valladares et al. 2019; Vázquez-Canteli et al. 2019). However, communication between DRL agents and simulation engines is rarely straightforward, which leads us to tools such as Boptest (Blum et al. 2021), Energym (Scharnhorst et al. 2021), RL-testbed (Moriyama et al. 2018) or Sinergym. These tools enable the communication between DRL agents and energy simulators, providing the necessary tools to perform the training and evaluation of the algorithms in different settings.

In the case of Sinergym, this software acts as a wrapper for EnergyPlus, offering several tools that enable DRL agents execution, data logging, the configuration of simulation environments, as well as the customization of states, actions and rewards. Its similarities and distinctive features with respect to other alternatives are detailed in Table 2.

Table 2 Frameworks for DRL-based building energy optimization

4 Data and environments

All the experiments were formulated and executed using the Sinergym framework (Jiménez-Raboso et al. 2021). In the following subsections we present the configuration details, including state and action spaces, reward functions and weather data.

4.1 Building models

The following building models included in Sinergym, were chosen as test scenarios:

  • 5ZoneAutoDXVAV (Wei et al. 2017; Ding et al. 2020). This is a single-story rectangular office building with dimensions of 30.48 m \(\times\) 15.24 m (see Fig. 1a). The building is divided into five zones: four exterior zones and one interior zone. The average height of the building is 3.048 m and the total surface area is 463.6 m\(^2\). it is equipped with a packaged Variable Air Volume (VAV), Direct Expansion (DX) cooling coil and gas heating coils, with fully auto-sized input as the HVAC system to be controlled.

  • 2ZoneDataCenterHVAC (Moriyama et al. 2018; Zhang et al. 2019b; Li et al. 2019; Biemann et al. 2021). A 1-story data center with a surface of 491.3 m\(^2\) (see Fig. 1b). The building is divided into two asymmetrical zones (east and west). Each data center zone has an HVAC system consisting on an air economizer, direct and indirect evaporative coolers, a single speed DX cooling coil, chilled water coil, and VAV with no reheat air terminal unit.

Fig. 1
figure 1

Building models used as benchmark environments

In both cases, we aim to regulate indoor temperatures by balancing comfort and power demand while also considering the influence of one on the other.

4.2 Weather

Given the importance of weather conditions in building control strategies (Ghahramani et al. 2016), the experimentation was conducted considering three types of weathers integrated in Sinergym, based on Typical Meteorological Year (TMY3) data obtained from U.S. Department of Energy: Prototype Building Models | Building Energy Codes Program (2021). Note how these climates vary in their average temperatures and humidity levels, thus providing a diverse and representative test set. These are:

  • Hot dry Climate corresponding to Arizona (USA), with an average annual temperature of \(21.7\,^{\circ }\hbox {C}\) and an average annual relative humidity of 34.9%.

  • Mixed humid Climate in New York (USA), with an average annual temperature of 12.6 °C and an average annual relative humidity of 68.5%.

  • Cool marine Corresponds to Washington (USA). The average annual temperature is 9.3 °C, while the average annual relative humidity is 81.1%.

Sinergym offers the possibility of adding random variations to the weather data between episodes by using Ornstein–Uhlenbeck (OU) processes (Benth and Šaltytė-Benth 2005; Jiménez-Raboso et al. 2021). This enables a more varied training aimed at preventing overfitting and improving generalization. Thus, in this work, stochasticity was achieved by applying OU with the following parameters: \(\sigma =1.0\), \(\mu =0\) and \(\tau =0.001\).

4.3 Observation and action spaces

At each simulation step, the agent receives information from the environment in the form of an observation \(o_t \in {\mathcal {O}} \subset {\mathcal {S}}\). The information received consists of a subset of variables that provide information about the current state of the building, including comfort and consumption metrics, occupancy, time information, indoor and outdoor temperatures, humidity, wind, or solar radiation. These variables may vary in number or zone depending on the building we are trying to control. However, EnergyPlus uses common identifiers shared by different buildings, which facilitates the management of this information.

Therefore, Table 3 contain the variables that make up an observation for 5ZoneAutoDXVAV and 2ZoneDataCenterHVAC, as well as their ranges.Footnote 3 Both the observed and controlled variables are extracted from the simulation of each building, with 20 for 5ZoneAutoDXVAV and 29 for 2ZoneDataCenterHVAC. Depending on the building and simulation conditions, each variable can take values in a wide range that we are unable to determine a priori. Therefore, most of the variables are assumed to take values in \([-5e6, 5e6]\), as these are the minimum and maximum limits of the simulator outputs. However, all the observation values were normalized in [0, 1] using the normalization wrapper provided by Sinergym.

Table 3 Observation variables in 5ZoneAutoDXVAV and 2ZoneDataCenterHVAC

Note that a comprehensive analysis of the variables required for training will depend on the problem’s context and the scope of the controller’s application. In this case, to train all agents with the most information available, we did not explicitly conduct a detailed study of these observations.

Regarding the action space, each action consists of a set of values representing the HVAC heating and cooling setpoints. In the case of 5ZoneAutoDXVAV, an action \(a_t \in {\mathbb {R}}^2\) involves the selection of two values representing the heating and cooling setpoints for the whole building. Conversely, for 2ZoneDataCenterHVAC, an action \(a_t \in {\mathbb {R}}^4\) consists of four values corresponding to the selected setpoints for the east and west zones of the facility. Their identifier, definition, and ranges are displayed in Table 4.

Table 4 Action variables for 5ZoneAutoDXVAV and 2ZoneDataCenterHVAC

Finally, based on several tests, we consider a continuous action space, as opposed to a discrete one. This is because a discrete action space can overly constrain the agent’s behaviour if the number of allowed actions is limited. In addition, the use of fixed setpoints can worsen the agent’s flexibility by explicitly imposing expert knowledge. In our case, while expert knowledge is present in the range of values that these variables can take, there are no restrictions within this range, allowing each setpoint to take arbitrary continuous values—as long as the heating setpoint remains below the cooling setpoint. Finally, we must consider that a continuous action space covers the action range of a discrete one. This means that the possibility of choosing similar actions is not lost, while offering a wide margin for improvement.

4.4 Reward

For the reward calculation, we chose the linear reward function included by default in Sinergym:

$$\begin{aligned} r_t = - \omega \ \lambda _P \ P_t - (1 - \omega ) \ \lambda _C \ (\mid T_t - T_{up} \mid _+ + \mid T_t - T_{low} \mid _+) \end{aligned}$$
(6)

This is a similar formula to the one shown in Eq. 5, where \(\omega\) is the weight assigned to power consumption and consequently, \(1 - \omega\) the weight assigned to comfort; \(\lambda _P\) and \(\lambda _C\) are both scaling constants; \(P_t\) is the facility total HVAC power demand rate (measured in W); \(T_t\) is the current inner temperature (in °C), and \(T_{up}\) and \(T_{low}\) are the limits of the target comfort range. Thus, the greater the consumption, or distance from the interior of the comfort range, the worse the reward, and the more the agent is penalized.

The linear reward allows us to maintain a real trade-off between comfort and consumption. Thus, assuming precise scaling factors, the preference for each reward term is determined solely by the value of \(\omega\), and no term is more important than the other by default.

Recent studies such as (Kadamala et al. 2024) address the issue of reward selection without reaching a clear consensus on which option is more effective in practice, so we opt for the most equitable reward definition. Other alternatives may lead to a bias towards a particular reward term. For example, this is the case with Sinergym’s exponential reward, where deviations from target temperatures have a greater impact than increases in power consumption.

For the majority of the experiments in Sect. 5, we use the default value of \(\omega = 0.5\), which involves seeking a complete balance between comfort and power requirements. We also analyse the effect of using different weightings in Sect. 5.6.

4.5 Algorithms

We briefly summarize the DRL algorithms and rule-based controllers used throughout the experimentation. The proposed control problem is addressed from a single-agent perspective, since multi-agent HVAC control is mostly applied in building communities (Pinto et al. 2021; Vazquez-Canteli et al. 2020) or problems where the action space is too large for a single agent (Fu et al. 2022). For the environments considered in this work—with a maximum of two controlled zones in 2ZoneDataCenterHVAC, such control is not strictly required.

The DRL algorithms selected were PPO, TD3, and SAC. As it is beyond the scope of this paper to go into their theoretical details, we refer the reader to the original publications where these are presented in detail: Schulman et al. (2017a) (PPO), Fujimoto et al. (2018) (TD3), and Haarnoja et al. (2018) (SAC).

PPO is an improvement over the on-policy algorithm TRPO. On-policy algorithms have a single policy that improves progressively according to the exploration of states and/or actions. This algorithms are generally sample-inefficient due to the data loss that occurs when updating their policy. This may hinder their performance in domains such as HVAC control, where data collection is slow, as detailed in Biemann et al. (2021). This is an assumption that we will empirically test in our experiments.

On the other hand, we have off-policy algorithms such as TD3 and SAC, which are more sample-efficient and can therefore be expected to produce better results in the HVAC control problem. Off-policy algorithms are based on two policies: a behaviour policy, which is exploratory, and a target policy, which is the one actually used by the agent and is based on the knowledge gathered by the exploratory one. Thus, while TD3 is an improved version of DDPG commonly used in HVAC control (Li et al. 2019; Gao et al. 2020; Biemann et al. 2021; Fu and Zhang 2021), SAC offers a promising alternative in this domain, as shown by recent studies such as (Brandi et al. 2022; Yu et al. 2020; Biemann et al. 2021; Coraci et al. 2021).

Regarding the rule-based controllers used as baselines, the following approaches produced the best results for each building:

  • For 5ZoneAutoDXVAV, a control based on static setpoints depending on the season of the year, based on the ASHRAE standard (ASHRAE 2004) for thermal comfort in dwellings. In this way, the static setpoints [26, 29] °C are set for the period from June to September, while setpoints [20, 23.5] °C are maintained for the rest of the year.

  • In the case of 2ZoneDataCenterHVAC, an integral control based on degree-by-degree correction of the current setpoints, depending on whether the temperature of each zone is within the specified range. This range corresponds to ASHRAE standard (ASHRAE 2016) for recommended temperature ranges in data centers, which establishes [18, 27] °C as reference setpoint values in these facilities.

Although these RBCs are able to guarantee proper control in the proposed scenarios, it should be noted that (1) they are not easily scalable, as they have to be defined manually; (2) they are not able to consider wide optimization horizons, as the rules used are essentially reactive; and (3) in more complex scenarios (e.g. larger number of variables to be controlled), their performance could be significantly compromised.

For an in-depth study on how these RBCs are implemented, we refer the reader to the corresponding Sinergym https://github.com/ugr-sail/sinergym/blob/main/sinergym/utils/controllers.py module, where their source code can be consulted.

5 Experiments

The following subsections detail the different experiments conducted. The objective pursued was to thoroughly study (1) which algorithms are able to guarantee better thermal control; (2) which DRL algorithms offer higher robustness to weather conditions for which they have not been trained; (3) the application of sequential learning and its effectiveness to obtain better controllers, and (4) the comfort-consumption trade-off and its influence on the performance of the controllers.

Although the most significant results will be illustrated graphically, space limitations have led us to transfer part of the complementary graphic material to an external repository: https://github.com/ugr-sail/paper-drl_building, available for detailed analysis of simulation data, code, and additional charts.

5.1 Methodology

Figure 2 summarizes the experimentation phases followed in this work. First, the performance between RBCs and DRL algorithms were compared in the 5ZoneAutoDXVAV and 2ZoneDataCenterHVAC buildings with three different types of weather conditions, that is, 6 different environments.

Fig. 2
figure 2

Proposed experiments

Following this comparison, further experiments were carried out under more complex settings. On the one hand, the best DRL agent obtained in the previous step was used to test how it adapted to different weather conditions than those used during its training. We refer to this experiment as the “robustness test”, which allowed us to evaluate the agent’s ability to generalise to situations not experienced during training.

Subsequently, we explored the application of sequential learning by progressively training an agent over multiple weather conditions, comparing its performance with that of an agent specialised in a single environment

Finally, we compared the performance of the best DRL controllers of each building under different definitions of the reward function, mainly by varying the weights of the comfort and consumption terms and observing the results obtained.

5.2 Experimental settings

Both the training and evaluation of the DRL algorithms were executed using Google Cloud computing platform and Sinergym version https://pypi.org/project/sinergym/1.8.2/. Regarding the handling of simulation time, each episode corresponds to 1 year of building simulation. Each episode consists of 35,040 time steps (i.e., 365 days \(\times\) 24 h/day \(\times\) 4 timesteps/hour), resulting in an observation sample and subsequent setpoint adjustment every 15 min. This is an appropriate value for the problem addressed, although it can be configured as desired within Sinergym settings.

In addition, the evaluation metrics common to all experiments and used to compare the agents are the following:

  • Mean episode reward: calculated as the arithmetic mean of the rewards obtained in each time step of an episode.

  • Mean power demand: mean HVAC electricity demand rate of the building’s HVAC system (in W), provided by EnergyPlus.

  • Comfort violation: percentage of time during which the ambient temperature is outside a desired comfort range, and mean value (in °C) of temperature deviations from comfort limits. In the case of 5ZoneAutoDXVAV, the comfort ranges used were: [23, 26] °C for June to September, and [20, 23.5] °C for the rest of the year. Meanwhile, for 2ZoneDataCenterHVAC, a single comfort range was used: [18, 27] °C, as recommended by ASHRAE’s standard (ASHRAE 2016), in order to ensure the safety of the building’s equipment.

Other metrics, such as the evolution of indoor and outdoor temperatures, will also give an insight into how the control adjusts between setpoints and how performance evolves, as detailed below.

5.3 Comparison between control algorithms

The first experiment conducted was a comparison between DRL algorithms and RBCs. This involve the training and subsequent evaluation of the DRL algorithms in different environments, in order to identify which algorithms perform better for each environment. The environments used for training and evaluation consisted in all the combinations between buildings (5ZoneAutoDXVAV, 2ZoneDataCenterHVAC) and weathers (hot dry, mixed humid, and cool marine), resulting in 6 different scenarios.

The configuration settings for this experiment were as follows:

  • Machine used for training: Google Cloud’s https://cloud.google.com/compute/docs/machine-types machine, equipped with a 2.2 GHz Intel Cascade Lake processor, 8 vCPU, and 8192 MiB of RAM.

  • Number of training episodes: 20. This was found to be a sufficient number of episodes for all algorithms to achieve convergence.

  • Frequency of evaluation: 4. Specifies the number of training episodes after which an evaluation and selection of the best model is performed.

  • Evaluation length: 3. Refers to the number of episodes used to perform the evaluations.

  • Random seed: 42. This seed was used in all experiments to facilitate the replication of the results.

  • Normalization of the observations in [0, 1], using the Sinergym normalization wrapper.

  • Multiple combinations of hyperparameters were tested until finding the ones that offered the best results for each DRL algorithm and environment combination, as detailed in Appendix 1. Endorsing the idea of (Brandi et al. 2020; Agarwal et al. 2021), the choice of hyperparameters in DRL is a complex task that greatly affects the training process of the agents. Therefore, it is necessary to follow a testing process during several episodes to compare the performance of different configurations. This philosophy was followed until acceptable sets of hyperparameters were found.

Regarding the training process, Fig. 3 shows how SAC offers the best results in all 5ZoneAutoDXVAV scenarios, with early convergence at its optimal values. Moreover, TD3 shows fast convergence, although without surpassing SAC, and offering similar results—or even slightly lower—than those of PPO, whose convergence requires a greater number of training episodes.

Fig. 3
figure 3

Mean episode rewards during 20 training episodes in 5ZoneAutoDXVAV

Regarding 2ZoneDataCenterHVAC, in Fig. 4 we can see a higher instability of TD3 in the first episodes, and a later better performance against SAC and PPO. In this environment, SAC again demonstrates early convergence, outperforming PPO in cool and hot climates, and equaling it in the mixed case.

Fig. 4
figure 4

Mean episode rewards during 20 training episodes in 2ZoneDataCenterHVAC

Once the algorithms were trained and reached convergence, they were evaluated during 20 episodes using the best model obtained during training. Figures 5 and 6 show the results obtained for each environment, considering the metrics already presented in Sect. 5.2: mean rewards per episode, power consumption and comfort violation. We will analyze the results obtained taking as baseline the performance of both RBCs and random agents (RAND). The values represented in the boxplots are summarized in Appendix 2.

Fig. 5
figure 5

Results for 20 evaluation episodes in 5ZoneAutoDXVAV

In the case of 5ZoneAutoDXVAV, as shown in Fig. 5, the best comfort-consumption balance is guaranteed by the RBC. Being this an ad hoc solution implemented for each building, we are interested in knowing which DRL algorithm comes closest to its performance. In this case, as we could see during training, SAC is the closest agent to RBC control in all climates, followed by PPO and TD3, with similar mean rewards in all scenarios. Thus, none of the agents manages to outperform RBC regarding this metric.

Considering the comfort and consumption values, we observe that RBC’s main competitor, SAC, manages to reduce power demand without major penalties in comfort violation, especially in hot weather. On the other hand, TD3 is the most energetically costly agent, while PPO maintains a significant balance between both metrics without demonstrating outstanding performance.

Fig. 6
figure 6

Results for 20 evaluation episodes in 2ZoneDataCenterHVAC

Looking at Fig. 6, corresponding to 2ZoneDataCenterHVAC, we find a higher competitiveness with respect to RBC, which is outperformed in all scenarios by TD3 in terms of reward. We see, in turn, significant power savings with respect to the other agents, albeit at the cost of a greater comfort violation. This could potentially be less interesting if we are dealing with a building where the temperature must always be kept in a specific range for safety reasons, so more conservative options such as SAC, PPO or RBC itself could be more suitable.

When we refer to conservative solutions, we mean those that offer indoor temperatures further away from the limits of comfort ranges. This makes it possible to deal with temperature inertia at the cost of investing more power in ensuring stable temperatures. For example, if we compare the indoor temperatures throughout the simulation for TD3 (Fig. 7) and RBC (Fig. 8), we observe that the former temperatures are closer to the upper comfort limit than those of the RBC. This behavior may be interpreted as the DRL agent trying to optimize consumption by taking a higher risk in ensuring comfort, thus leading to indoor temperatures closer to the limit allowed by the reward function.

A more feasible alternative in ensuring compliance with comfort limits might involve modifying the reward function used, giving greater importance to comfort versus consumption. This approach will be specifically addressed in Sect. 5.6.

Fig. 7
figure 7

Evolution of temperatures during a year of simulation using TD3 in 2ZoneDataCenterHVAC-hot. Comfort thresholds are marked with the horizontal dotted red and blue lines. (Color figure online)

Fig. 8
figure 8

Evolution of temperatures during a year of simulation using RBC in 2ZoneDataCenterHVAC-hot. Comfort thresholds are marked with the horizontal dotted red and blue lines. (Color figure online)

Finally, it is worth noting the difficulties of the DRL algorithms in adapting to comfort ranges that vary at specific periods of the episode. This is the case of 5ZoneAutoDXVAV and the variations in the comfort ranges defined for the warm and cold months, as explained in Sect. 5.2. For instance, Fig. 9 shows how the sudden change in the desired temperature range poses difficulties for SAC, which does not adapt well enough to the new comfort requirements.

Fig. 9
figure 9

Evolution of temperatures during a year of simulation using SAC in 5ZoneAutoDXVAV-mixed. Comfort thresholds are marked with the horizontal dotted red and blue lines. (Color figure online)

Having provided a first insight into the performance of the different algorithms, in the following subsections we will address issues related to performance improvement and generalization capabilities of DRL agents.

5.4 Robustness test

We will test the performance of agents executed in environments that differ from those in which they have been trained. This will enable to know how far an agent is able to generalize and extrapolate the knowledge acquired in a given climate to be applied in a different one.

According to Fig. 10, we observe that the agents that perform best in each climate are those that have been trained in that same climate, as would be expected. After applying Mann–Whitney U tests, we obtain significant differences between the rewards obtained in each test environment (\(p < 0.05\) in all cases).

In all cases, the following order of performance holds true: cool is the climate for which the greatest rewards are obtained, followed by mixed and hot. Even agents trained in mixed and hot get a higher reward in cool than in their respective training climates. The same happens for the agent trained in hot climate, whose reward is worst in its own climate than in the rest.

Fig. 10
figure 10

Evaluation rewards for SAC agents trained in different weathers. Compared with RBC mean rewards in 5ZoneAutoDXVAV

As a summary, Table 5 gathers the results obtained for each SAC training and evaluation combination in 5ZoneAutoDXVAV, which we will further discuss in Sect. 6.2.

Table 5 Rewards for every training and evaluation combination using SAC in 5ZoneAutoDXVAV

5.5 Sequential learning

Another objective proposed in this work is to address the application of sequential learning: an approach that involves the progressive training of an agent under different weather conditions. On this basis, our goal in applying this incremental learning is to verify whether an agent trained progressively through different environments is capable of achieving better, or at least similar, performance to an agent trained in a single problem.

Thus, based on the results obtained in Sect. 5.3, we will test to compare the performance of an agent directly trained on a single climate with one trained on every available climate. We will use SAC under the same configurations described in the previous experiments. The sequence of training considered is: (1) cool, (2) mixed, and (3) hot.

Therefore, in Fig. 11 we observe that the performance of SAC trained under sequential learning is mostly inferior to that of standard SAC according to the metrics proposed.

Fig. 11
figure 11

Results for SAC and sequential learning SAC in 5ZoneAutoDXVAV

It is noteworthy that the best performance obtained by SAC with sequential learning occurs in hot weather, which is actually the last weather used to train the agent. This could lead us to believe that the phenomenon of “catastrophic forgetting” (French 1999) might occur, as we will discuss in Sect. 6.3.

5.6 Comfort-consumption trade-off

As we already anticipated in the experimentation of Sect. 5.3, the reward function chosen, as well as the weights given to comfort and consumption in the training of the agents should influence their performance.

In this section, we will compare how different weights for comfort and consumption influence the performance of the agents under the same rest conditions. Specifically, building on the results described in Sect. 5.3, we will use the best performing agents on each building—SAC in 5ZoneAutoDXVAV, and TD3 in 2ZoneDataCenterHVAC—to compare the performance of the following agents:

  • An agent trained only on the basis of the comfort requirements (\(\omega = 1\) in Eq. 6), regardless of consumption.

  • Another trained with a weight of 75% for consumption and 25% for comfort (\(\omega = 0.25\)).

  • An agent trained with the same weight for consumption and comfort (\(\omega = 0.5\)).

  • And, finally, an agent trained with a weight of 75% for comfort and 25% for consumption (\(\omega = 0.75\)).

Fig. 12
figure 12

Results for SAC with different comfort weights in 5ZoneAutoDXVAV

Fig. 13
figure 13

Results for TD3 with different comfort weights in 2ZoneDataCenterHVAC

Thus, the results obtained are shown in Figs. 12 and 13. From the figures we observe that a greater emphasis on comfort in the reward function reduces comfort violations, as would be expected. Improvements are especially significant from 25 to 50% comfort weight, and decrease as the comfort weight increases (although always improving).

However, there are two exceptions where assigning 100% weight to comfort means some worsening from 75%. This is the case for SAC in hot climate and TD3 in cool climate. In Sect. 6.4 we will discuss possible causes for these exceptions.

Finally, improving comfort means—apart from the exceptions mentioned above—an increase in power demand. In this multi-objective problem, the trade-off is disrupted as soon as one objective is given greater importance than the other.

6 Discussion

In this section, we will discuss the results obtained after experimentation, not only describing them but also identifying gaps to be addressed in future work.

6.1 Rewards and comparison between algorithms

We begin by analyzing the results obtained by the different control algorithms in the set of proposed environments. Overall, the results obtained are consistently stable with minimal variation over the 20 evaluation episodes. Stable results from multiple evaluation episodes allows us to confirm that there are no random factors significantly affecting the agents’ results (see Appendix 2).

As shown in Sect. 5.2, in the case of 5ZoneAutoDXVAV, no DRL algorithm was able to outperform RBC considering a reward which gives equal importance to comfort and consumption (50–50%). It should also be noted that the buildings used in this experiment have a relatively simple control, which can be easily translated into a limited set of rules. However, as we discussed in the Sect. 5, the results are particularly promising when we consider that there has been neither an exhaustive selection of observed variables nor an in-depth choice of hyperparameters.

In Fig. 9 we also observe how DRL algorithms have problems in adapting to changing setpoints in certain periods of the simulation, such as the warm months. If we also consider that the higher the temperature, the more difficult it is to guarantee the comfort-consumption balance (see rewards for hot weather environments in Fig. 5), the loss of performance is self-evident. This loss of performance can be caused by external factors that affect the agent’s ability to act, such building characteristics or an increased power demand from the HVAC system for cooling.

On the contrary, looking at the results in 2ZoneDataCenterHVAC, TD3 manages to outperform RBC in terms of average reward. The fact that a DRL agent is able to outperform an RBC in 2ZoneDataCenterHVAC, but not in 5ZoneAutoDXVAV could be justified by the greater complexity of the control problem, as the latter may require a more sophisticated strategy, involving a larger number of control variables.

As we anticipated earlier, the main reason why TD3 gets a better reward is because of the large reduction in consumption compared to RBC. However, this seems to imply a higher comfort violation, which is ultimately profitable for the agent.

An important aspect to question at this point is under what conditions we consider a comfort violation to occur. If we look at Fig. 7, we can see that although many comfort violations occur for TD3 throughout the year, their value is quite low, mainly due to temperature inertia combined with the fact that power savings imply approaching comfort limits. Therefore, if we only look at Fig. 6, agnosticism towards real temperatures and deviations could lead us to believe that TD3 significantly sacrifices comfort in exchange for reducing consumption, thus reaching critical temperatures, which is not the case. As a proof, Fig. 14a shows the temperature violations achieved by TD3 with respect to the comfort ranges in 2ZoneDataCenterHVAC, proving that the temperature deviations exceed the upper comfort limit by less than 0.7 °C, and never the lower limit.

Fig. 14
figure 14

Comfort violations by TD3 and SAC in different scenarios

Something similar occurs in the case of the SAC in 5ZoneAutoDXVAV. Looking at Fig. 14b, we observe that the comfort violations above the upper limit reach a maximum of about 1.6 °C, while the comfort violations below the lower limit reach about 2.4 °C. In the latter case, the comfort violations are higher because the agent adapts with difficulty to the change of the comfort range during summer months, usually staying below the desired minimum temperature.

Finally, we should not overlook a common factor related to the reward function used for both buildings: the value of the scaling factors \(\lambda _P\) and \(\lambda _T\) (see Eq. 5), which were chosen empirically. This may lead to slight inaccuracies in the reward calculation. Nevertheless, assuming the same values for all algorithms allows comparing the results under similar conditions.

6.2 Robustness and generalization

The results of the robustness and generalisation experiments yield some interesting insights. On the one hand, we confirm that the agents that performed best in a given environment were those that had been trained in the same environment. That is to say, an agent executed in an environment different from the one in which it was trained reported a loss of performance in terms of average reward.

On the other hand, the rewards obtained in each test environment (see Fig. 10) show that the climate associated with the highest reward was cool, followed by mixed and hot. However, these values should be interpreted with caution: the fact that the rewards obtained in the hot climate were lower for each agent than in the cool climate does not necessarily mean that the agents are unable to perform properly in hot environments, but may be due to external factors. For example, a reward of around \(-\)0.4 in hot weather could be comparable to one of around \(-\)0.2 in cool weather simply because the HVAC system consumes less when heating than when cooling, which is beyond the agents’ ability to act. This leads us to suggest that a column-by-column study of Fig. 10 may not be as informative as a row-by-row comparison of the rewards.

These results also lead us to question whether a specific and fixed climate is the best approach when training DRL algorithms in this domain. Future work should try to clarify whether the differences in the control exerted by agents trained in different climates are really significant—we are currently dealing with centesimal variations in the value of the mean rewards, as well as to address alternatives to training based on single climates. For example, an alternative may involve using a dataset which pools records from several different climates, and from which one climate observation is randomly sampled at each time step. This is an idea already addressed in other works, such as Du et al. (2021), where a DRL agent was trained by sampling observations of heating and cooling scenarios, thus leading to more adaptive action strategies. Consequently, this solution may greatly enhance agents’ generalization during training.

6.3 Effectiveness of sequential learning

The search for an improvement in training by applying sequential learning also raises important issues in its application and results. As shown in Fig. 11, this approach resulted in a performance degradation with respect to standard SAC in each of the climates. Thus, the proposed method did not lead to improvements in expected performance.

A noticeable finding is that the results obtained report a lower loss of performance by the agent trained with sequential learning in the hot climate. This leads us to question whether this is caused by the fact that it was the last climate used for training. If so, we would be facing the already mentioned problem of catastrophic forgetting, which implies a bias in the agent’s learning towards the most recent training data, discarding the learning of previous climates.

In fact, if we compare SAC trained with sequential learning with the rest of SAC agents trained in single climates (Fig. 15), we notice that the rewards are always worse, with the exception of hot, where results are close to those of SAC trained in mixed, and superior to those trained in cool. Again, this improved control in hot climate could be a direct consequence of catastrophic forgetting, as it concerns the last climate used in the agent’s training, thus partially forgetting what the agent learned for cool and mixed environments.

Fig. 15
figure 15

Evaluation rewards for SAC trained with sequential learning in all climates, and SAC trained in single climates. Compared with RBC mean rewards in 5ZoneAutoDXVAV

To evaluate whether this phenomenon occurs, we tested the performance of an agent trained with sequential learning by following the reverse process. To do so, a SAC agent was firstly trained in hot weather, then in mixed, and finally in cool. Its performance is shown and compared with the rest of the SAC agents in Fig. 16, where we can observe a significant improvement in the reward for cool weather, at the expense of worsening performance for the rest of climates.

Looking at the results reinforces the idea that there is a bias towards the latter training climate—in this case, cool. This leads us to believe that catastrophic forgetting does indeed occur, preventing diversified training.

Fig. 16
figure 16

Comparison between standard SAC rewards, and SAC agents trained under sequential learning in different orders

It is also noteworthy how the new agent trained in all three climates manages to outperform a SAC agent trained only in cold, so the application of sequential learning may not be worthwhile in these circumstances. However, as a consequence of the existence of catastrophic forgetting, future work should address the influence of the architecture of the neural networks used by the agents, as well as the way in which the training data is handled. Again, we reiterate the idea that sampling from a set of observations from different climates may be a suitable option for this type of problems.

6.4 Comfort-consumption trade-off

From the results shown in Figs. 12 and 13, we can see how the weight assigned to comfort and, therefore, to consumption—directly influences the importance of this value in the training and final performance of the algorithms.

As we already anticipated, a higher comfort assurance implies, in most cases, a consequent increase in power demand. However, there are interesting cases such as SAC in 5ZoneAutoDXVAV and cold weather, where it is possible to reduce the comfort violation by increasing its importance in the reward function without significant drawbacks in terms of consumption.

Furthermore, the choice of weighting for comfort and consumption depends entirely on the problem and the environment we are dealing with. However, there are cases, such as the one mentioned above, where we can achieve significant improvements in comfort with hardly any penalty in consumption. It is therefore a problem that needs to be approached cautiously and, if possible, several options should be evaluated in a simulated way.

If we look at the rewards obtained by the different SAC agents in 5ZoneAutoDXVAV, we observe that a greater emphasis on comfort generally leads to better overall performance. As we have observed, this is because improvements in comfort do not have a large impact on power consumption, which offsets the results.

Nevertheless, in the case of TD3 in 2ZoneDataCenterHVAC, the comfort and consumption weighting changes do pose more of a challenge in terms of trade-off, with 50–50% being the best weighting choice in all climates.

Let us now study what happens with the two exceptions mentioned in Sect. 5.6: SAC in 5ZoneAutoDXVAV and hot weather, and TD3 in 2ZoneDataCenterHVAC with cool weather, where a 75% weight for comfort in the reward function gives better results in terms of comfort violations than assigning a 100% percent weight to this value.

Taking the SAC case as an example, if we look at Fig. 17, we observe that adding intermediate weights (95%, 97% and 99%) to comfort between those previously considered (75% and 100%), leads to a range of similar results with centesimal variations. Therefore, the average comfort violations per episode reveal that there is some convergence in agents’ performance, and that, despite increasing the importance of comfort in the reward function, their capacity is limited and it is not possible for them to improve comfort assurance. This is also the case for TD3 in 2ZoneDataCenterHVAC with cool weather, for which similar experiments were conducted.

Fig. 17
figure 17

Mean episode comfort violations for SAC with additional comfort weights in 5ZoneAutoDXVAV

Finally, we conclude that these two exceptions are indeed attainable cases which do not denote a noticeable loss of performance, but rather an oscillation around a limit that the agent is physically unable to overcome.

6.5 Summary and implications

We conclude with an overview of the results obtained and their potential implications for researchers and practitioners.

The results presented in Sects. 5.3 and 6.1 show how DRL-based agents match the performance of RBCs in simple scenarios, such as 5ZoneAutoDXVAV—where there is little room for improvement and how they perform better in environments of increasing complexity, such as 2ZoneDataCenterHVAC. We therefore recognise that the margin for improvement of a DRL over a reactive controller should be examined in the light of the complexity of the building being controlled. This complexity can be given by the system to be controlled, including its number of zones, as well as the observed and control variables. Other aspects may include climate variability or external factors such as occupancy or scheduling, which greatly affect the scalability and complexity of RBCs.


DRL agents were able to level reactive controllers in simple environments and surpass them in complex environments. Therefore, we consider that the complexity of the building being controlled is a major factor to be taken into account in the selection of controllers.


The robustness of agents deployed in environments with a climate different from that in which they were trained is another relevant issue (see Sects. 5.4 and 6.2). Although there were yield losses, we believe that it will depend on the case to consider whether or not these losses are negligible.

On the other hand, we observed in Sects. 5.5 and 6.3 that a sequential learning approach did not show improvements, but biases towards the last environment used to train the agent. While it may be due to overfitting, the usefulness of this type of training needs to be further studied.


Deploying a controller in an environment other than the one used in its training resulted in performance losses that might be acceptable depending on the case study. These losses were higher, and possibly impractical, in agents trained progressively through different climates, showing biases towards the last training environment employed.


One of the main advantages that DRL controllers offer compared to reactive control techniques is their flexibility in customizing the comfort-consumption trade-off by simply altering their reward function. In Sects. 5.6 and 6.4, we observed where are the limits of the comfort/consumption strategy that an agent can learn.

Moreover, while only these two factors have been considered in this paper, there are interesting studies, such as (Brandi et al. 2022), where the estimation of power consumption and prices are taken into account, considering a DRL agent adapted to sporadic peak consumption and stationary events. This reveals the customization capabilities of DRL agents in building environments without a complex model of each controller or an extensive rule base.


DRL agents are flexible to balance comfort and consumption preferences, and can incorporate other criteria by just adding extra terms in their reward.


Facing the use of DRL controllers in real environments, we must consider that the variables available in the simulation are also available in the deployment environment. Most of the variables used in this study could be easily monitored in a real environment, or replaced by proxies if not available (e.g. comfort metrics).

Purely online deployment and training of these DRL agents in real environments is not recommended since it requires long periods of time to achieve a minimally efficient behavior policy. On the contrary, the approach usually followed is to train the agents in simulated environments, avoiding cold starts, and then deploy these agents in real environments and continue to receive feedback from the information collected online.


In order to enable the deployment of a DRL agent trained in a simulated environment, it is necessary for the agent to have access, either directly or indirectly, to similar variables to those used in its training. The DRL setup facilitates continuous learning and adaptation after deployment.

7 Conclusions and future work

In this paper, we have addressed the performance comparison of multiple DRL algorithms in building HVAC control. Using the Sinergym framework, these algorithms were tested in various buildings and climates, considering both the search for comfort-consumption balance and scenarios in which one objective prevailed over the other. The robustness of the agents in different climates was also evaluated, as well as the potential benefits of progressive training to improve agents’ performance.

In the search for proper comfort-consumption balance, SAC and TD3 were able to perform similar control to reactive controllers in most scenarios, even improving power savings at the cost of low comfort penalties. Since the agents were trained with completely raw observations—neither exhaustive variable selection nor preprocessing beyond normalization—and without extensive hyperparametrization, an immediate future work may involve the detailed study of which variables should compose agents’ observations, as well as a deeper analysis to define which hyperparameters best fit each building-climate-algorithm combination. In any case, a relevant highlight derived from these results is the fact that positive results can be obtained for arbitrary environments without the need to significantly fine-tune observations and hyperparameters.

Regarding the robustness and generalisability of DRL agents, it was found that modifying the evaluation climate of an agent with respect to the one used in its training negatively influences its performance. Interestingly enough, these are minor losses that may be generally acceptable, avoiding the need to use a wide variety of climates in the training of the agents. Thus, a proposal for future work will involve training by sampling a pool of observations from a wide variety of climates, thus increasing the generalization capabilities of the agents and reducing training time and efforts.

Moreover, the use of sequential learning for training was not successful in most cases, mainly due to the bias of the agents to specialize in the control of the most recently learned environments. This phenomenon of catastrophic forgetting motivates future research from different perspectives, such as exploring the influence of the network architecture of DRL agents on the occurrence of this phenomenon, or studying possible improvements along this progressive learning process (i.e. experience replay and regularization).

Finally, the study of the comfort-consumption trade-off, and its translation into the agents’ reward function, offered some significant highlights. On the one hand, it was observed that there are cases where improvements in comfort have hardly any penalties in power consumption, which leads us to suggest that it cannot be assumed that increasing the importance of comfort in the reward function will always lead to a detriment in power demand. This is where reward engineering comes into play, so future research should be conducted to compare different rewards and their influence on the comfort-consumption trade-off. On the other hand, it was observed that there are comfort weights beyond which agents are physically incapable of improving temperature assurance in the desired ranges. Generally, as the comfort weight tended towards 100% in the reward function, we began to find some convergence in the comfort penalties without considerable detriments or improvements.

As a conclusion of this work, and in view of the state of the art, we encourage future contributions in the area to be tested under multiple environments and configurations. Frameworks like Sinergym can be very useful in this regard.

Addressing the problem of building energy optimization from the perspective of multi-agent systems is another interesting approach for which standardized comparisons are also necessary. These solutions offer several advantages in large buildings where coordination between multiple independent HVAC controllers is required (Nagarathinam et al. 2020; Yu et al. 2020).

From a wider perspective, it would also be interesting to extend these studies with configurations where the use of green energies is considered as a variable to be optimized, given the rise of smart grids and renewable energies. For this purpose, the continued use of common evaluation frameworks and methodologies will enable the comparison of results, the standardization of the field, and the joint progress towards the improvement of building energy optimization.