1 Introduction

Artificial Intelligence (AI) is taking an increasingly important role in industry and society. AI techniques have been recently introduced in autonomous driving, personalized shopping, and fraud prevention, just to make a few examples. A key challenge faced by today’s society for which AI can bring an important advancement is environmental sustainability. Climate change, pollution, biodiversity decline, poor health, and poverty have led in the last years governments and companies to focus more and more their efforts and investments on solutions to environmental sustainability problems, which are usually characterized by an inefficient and increased use of resources. Environmental sustainability can be defined as a set of constraints regarding the use of renewable and nonrenewable resources on the one hand, pollution, and waste assimilation on the other (Goodland 1995). In this regard, in 2015, the United Nations published the “2030 Agenda for Sustainable Development” the centerpiece of which is 17 Sustainable Development Goals (United Nations 2015) to be fully achieved by 2030 to attain sustainable development in the economic, social, and environmental contexts, and eliminate all forms of poverty.

AI-based algorithms can control autonomous drones used in water monitoring (Steccanella et al. 2020; Marchesini et al. 2021; Bianchi et al. 2023), extract from acquired data new insight about environmental conditions (Castellini et al. 2020; Azzalini et al. 2020), improve the healthiness of indoor environments (Capuzzo et al. 2022), or demand forecast in district heating networks (Bianchi et al. 2019; Castellini et al. 20212022). Several AI techniques have been employed to address various environmental sustainability challenges. These approaches enable the efficient management of distributed resources within smart grids (Roncalli et al. 2019; Orfanoudakis and Chalkiadakis 2023), improve the power flow for DC grids (Blij et al. 2020), increase the utilization of renewable resources for electric vehicle charging (Koufakis et al. 2020), and mitigate carbon emissions in urban transportation by fostering ridesharing and reducing traffic congestion (Bistaffa et al. 2021, 2017). Furthermore, a crucial aspect of climate change prevention involves optimizing the energy consumption associated with heating and cooling residential properties. To tackle this issue, AI-based approaches have been developed methods to enhance the efficiency of home systems (Panagopoulos et al. 2015; Auffenberg et al. 2017) and quantify the thermal efficiency of residences (Brown et al. 2021). Among the broad spectrum of AI techniques in this survey, we focus on Reinforcement Learning (RL) (Sutton and Barto 2018), which has recently obtained impressive success, achieving human-level performance in several tasks, such as in the context of games (Silver et al. 2016, 2017).

One of the most important and interesting challenges in today’s RL research is the application of RL algorithms to real-world domains, where uncertainty makes strategy learning and adaptation much more complex than in game environments. In particular, the application of RL to environmental sustainability has achieved, in the last decade, a strong interest from both the computer science community and the communities of environmental sciences and business. Reducing carbon emissions requires increasing renewable resources usage, such as solar and wind power. While these resources are economically efficient, their stochastic and intermittent nature poses challenges in replacing nonrenewable energy sources within energy networks. RL, with a systematic trial-and-error interaction with dynamic environments, offers a promising approach for learning optimal policies that can adapt to changing system dynamics and effectively manage environmental uncertainty. Thus, an RL agent is capable of handling variations in operating conditions, for instance, due to a change in resource availability or weather conditions.

This work surveys the recent use of RL to improve environmental sustainability. It provides a comprehensive overview of the different application domains where RL has been used, such as energy and water resource management, and traffic management. The goal is to show practitioners the state-of-the-art RL methods that are currently used to solve environmental sustainability problems in each of these domains. For each paper analyzed, we consider

  • The problem tackled,

  • The RL approach used,

  • The challenges faced,

  • The formalization of the RL problem (i.e., type of state/action space, type of transition model, type of RL method, performance measures used to evaluate the results).

The paper is structured as follows. Section 2 presents the surveys already available on topics close to RL and environmental sustainability. Section 3 presents the basic concepts of RL as well as a formalization of the main concepts. In Sect. 4, we present the research methodology used in our survey. Section 5 describes the results of our research, considering different levels of detail. In particular, in Sects. 5.1.1, 5.1.2 and 5.1.3, we provide a quantitative analysis of the state-of-the-art related to the application of RL in environmental sustainability over the last two decades. Then, Sect. 5.1.4 outlines domains where RL techniques are applied and the RL-based approaches employed to address environmental sustainability. In Sect. 5.2, our focus shifts to a subset of 35 main papers, for which we analyze the application domains of proposed RL techniques, provide technical insights into problem formalization, discuss the performance metrics used for evaluation, and consider the challenges addressed. Section 5.3 provides an in-depth analysis of each of these main papers. Finally, in Sect. 6 we discuss our findings, and in Sect. 7 we draw conclusions and summarize future directions.

2 Related work

The literature provides already some surveys on the application of RL to problems related to environmental sustainability, but all these works focus only on specific aspects of environmental sustainability or they consider also AI methods different from RL. For instance, Ma et al. (2020) focus on Energy-Harvesting Internet of Things (IoT) devices, offering insights into recent advancements addressing challenges in commercialization, standards development, context sensing, intermittent computing, and communication strategies. Charef et al. (2023) conduct a study considering various AI techniques, including RL, to enhance energy sustainability within IoT networks. They categorize studies based on the challenges they address, establishing connections between challenges and AI-based solutions while delineating the performance metrics used for evaluation. Within the domain of Architecture, Engineering, Construction, and Operation, Rampini and Re Cecconi (2022) concentrate on the application of AI techniques, including RL, in Asset Management. Their work reviews studies related to several aspects such as energy management, condition assessment, operations, risk, and project management, identifying key points for future development in this context. Alanne and Sierla (2022) shift their focus to smart buildings, discussing the learning capabilities of intelligent buildings and categorizing learning application domains based on objectives. They also survey the application of RL and Deep Reinforcement Learning (DRL) in decision-making and energy management, encompassing aspects like control of heating and cooling systems and lighting systems. Within the context of smart buildings and smart grids, Mabina et al. (2021) examine the utilization of Machine Learning (ML), including RL, for optimizing energy consumption and electric water heater scheduling, emphasizing the advantages of these approaches in Demand Response (DR) due to their interaction with the environment. Himeur et al. (2022) investigate the integration of AI-big data analytics into various tasks such as load forecasting, water management, and indoor environmental quality monitoring, focusing on the role of RL and DRL in optimizing occupant comfort and energy consumption. Yang et al. (2020) focus on the application of RL and DRL techniques to sustainable energy and electric systems, addressing issues such as optimization, control, energy markets, cyber security, and electric vehicle management.

In the realm of transportation systems, Li et al. (2023) explore various topics, including cooperative mobility-on-demand systems, driver assistance systems, autonomous vehicles (AVs), and electric vehicles (EVs). Sabet and Farooq (2022) study the state-of-the-art in the context of Green Vehicle Routing Problems, which involve reducing greenhouse gas (GHG) emissions and addressing issues like charging activities, pickup and delivery operations, and energy consumption. Moreover, the authors note that most of the works leverage metaheuristics while using RL methods is uncommon. Chen et al. (2019) tackle sustainability concerns within the Internet of Vehicles, leveraging 5th generation mobile network (5G) technology, Mobile Edge Computing architecture, and DRL to optimize energy consumption and resource utilization. Rangel-Martinez et al. (2021) assess the application of ML techniques, including RL, in manufacturing, with a focus on energy-related fields impacting environmental sustainability. Sivamayil et al. (2023) explore a wide range of RL applications (e.g., Natural Language Processing, health care, etc.) emphasizing Energy Management Systems with an environmental sustainability perspective. Mischos et al. (2023) investigate Intelligent Energy Management Systems across diverse building environments, considering control types and optimization approaches, including ML, DL, and DRL. Yao et al. (2023) discuss the application of Agent-Based Modeling and Multi-Agent System modeling in the transition to Multi-Energy Systems, highlighting RL and suggesting future research directions in Multi-Agent Reinforcement Learning (MARL) for energy systems.

While these works address specific aspects of environmental sustainability using RL methods, our review takes a comprehensive approach, analyzing all contexts in which RL techniques have recently contributed to enhancing environmental sustainability. Our goal is to provide practitioners with insights into state-of-the-art methods for addressing environmental sustainability challenges across various application domains, including energy and water resource management and traffic management. In summary, the main contribution of this survey consists of offering an overview of RL application domains within the context of environmental sustainability.

3 Reinforcement learning: preliminaries and main definitions

In this section, we present the basic concepts of RL as well as a formalization of the main concepts. RL, a prominent machine learning paradigm, focuses on learning a policy that maximizes cumulative rewards, i.e., which action should be selected considering the environment configuration for achieving the best possible outcome. Key elements of RL are listed in the following:

  • The agent is the entity that makes decisions and performs actions in the environment;

  • The environment represents the system with which the agent interacts and provides the agent with feedback on the performed action;

  • The policy is a function that defines the agent’s behavior considering the environment configuration (i.e., a map between what the agent observes and what the agent should do);

  • The reward is a numerical signal that provides feedback on the action performed by the agent;

  • The value function specifies state values, namely, how valuable it is to reach a state, considering also future states reachable from it;

  • The model of the environment (optional) is a stochastic function providing next state probability given current state and action, it allows simulating the behavior of the environment in response to the agent’s actions.

RL methods (Sutton and Barto 2018) can be categorized into two main groups: model-free and model-based (Moerland et al. 2020). Over the past two decades, model-free methods have demonstrated significant success. Meanwhile, model-based approaches have become a focal point in current research due to their potential to enhance sample efficiency, which is a reduction in interactions with the environment. This efficiency is achieved by explicitly representing the model of the environment and incorporating relevant prior knowledge (Castellini et al. 2019; Zuccotto et al. 2022a, b). Additionally, model-based methods offer the advantage of addressing the risks associated with taking actions in partially observable environments (Mazzi et al. 2021, 2023; Simão et al. 2023) or partially known environment (Castellini et al. 2023).

A common framework to formalize the RL problem is by using Markov Decision Process (MDP) (Puterman 1994). An MDP is a tuple \((S, A, T, R, \gamma )\) where S is a finite set of states, A is a finite set of actions, \(T:S \times A \rightarrow \Pi (S)\) is the transition model where \(\Pi (S)\) is the space of probability distribution over states, \(R: S \times A \rightarrow {\mathbb {R}}\) is the reward function and \(\gamma \in [0,1)\) is the discount factor. The agent’s goal is to maximize the expected discounted return \({\mathbb {E}}[\sum _{t=0}^{\infty } \gamma ^t R(s_t,a_t)]\) acting optimally, namely, choosing in each state \(s_t\), at time t, the action \(a_t\) with the highest expected reward. The solution of an MDP is an optimal policy, namely, a function that optimally maps states into actions. A policy is optimal if it maximizes the expected discounted return. The discount factor \(\gamma\) reduces the weight of long-term rewards guaranteeing convergence. In the case of partially observable environments, an extension of the MDP framework, namely POMDP (Kaelbling et al. 1998), can be used. A POMDP is a tuple \((S, A, O, T, \Omega , R, \gamma )\) where the elements shared with MDP are augmented by \(\Omega\), a finite set of observations, and \(O: S \times A \rightarrow \Pi (\Omega )\), the observation model. In contrast to MDPS, in POMDPs the agent is not able to directly observe the current state \(s_t\) but it maintains a probability distribution over states S, called belief, which updates at each time-step. The belief summarizes the agent’s previous experiences, i.e. the sequence of actions and observations that the agent took from an initial belief \(b_0\) to the belief b. The solution of a POMDP is an optimal policy, namely, a function that optimally maps belief states into actions. In the following, we will survey applications of RL to environmental sustainability, hence we will investigate how the elements described in this section (e.g., MDP modeling framework, RL algorithms, etc.) have been used so far to solve problems related to environmental sustainability.

4 Review methodology

In this section, we outline the research methodology we used for this study. It consists of 5 steps: (i) the definition of the research questions, (ii) paper collection process, (iii) the definition of inclusion and exclusion criteria, (iv) the identification of relevant studies based on inclusion and exclusion criteria, (v) data extraction and analysis.

Research questions. The first step involves defining the research questions we want to answer on the application of RL techniques for environmental sustainability. The goal of our questions is twofold: to offer a quantitative analysis of the state of the art related to the application of RL to environmental sustainability and to analyze the use of these techniques focusing on sustainability. Specifically, we aim to answer the following questions:

  • RQ1: How many academic studies have been published from 2003 to 2023 about RL for environmental sustainability?

  • RQ2: What were the most relevant publication channels used?

  • RQ3: In which country were located the most active research centers?

  • RQ4: What were the application domains and the methodologies used?

  • RQ5: How was the RL problem formalized (i.e., type of state/action space, type of transition model, and type of dataset used)?

  • RQ6: Which evaluation metrics were used to assess the performance?

  • RQ7: What were the challenges addressed?

The databases we use to collect papers are those of the search engines Scopus and Web of Science. To limit the scope of research to the application of RL approaches for environmental sustainability, we define the following search strings:

  • “reinforcement learning AND sustainable AND environment”;

  • “reinforcement learning AND environmental AND sustainability”;

  • “reinforcement learning AND environment AND sustainability”;

  • “reinforcement learning AND environmental AND sustainable”.

The search on the two databases led to a total of 375 papers, 236 collected from Scopus and 139 from Web of Science.

Selection criteria for the initial set of (181) papers. To refine the results of the search, we outline the following inclusion and exclusion criteria.

Inclusion criteria. To determine studies eligible for inclusion in this work, we consider the following criteria:

  • It is written in English;

  • It is clearly focused on RL for environmental sustainability;

  • In the case of duplicate articles, the most recent version is included.

Exclusion criteria. To further refine our search, we apply the following exclusion criteria: the study is an editorial, a conference review, or a book chapter.

Following these criteria, we found 181 papers (104 articles, 70 conference papers, and 7 reviews). We combine the information in the index keywords of these papers with their number of citations and the publication year. In particular, we compute the number of occurrences of each keyword to identify the application domains and methodologies most used in the literature. To this aim, we standardize the keywords to avoid spelling variations. Then, we combine these values with the number of citations and the publication year to identify the most recent and relevant studies. In cases where index keywords are missing, we use author keywords. For the only three papers that do not have author nor index keywords, we use the title as related keywords.

Selection criteria for the set of (35) main papers. To identify papers for the in-depth analysis, we applied the following criteria that consider the most important keyword occurrences (i.e., the most frequent keywords), the publication year, and the number of citations based on publication year.

  • Presence of at least one keyword with no less than 10 occurrences;

  • Publication year from 2013 to 2023;

  • Number of citations:

    • Papers published in 2022–2021, at least 3 citations;

    • Papers published in 2020–2019, at least 10 citations;

    • Papers published in 2018–2013, at least 20 citations.

Following these criteria, we selected 35 studies that have been explored in-depth, and answers to the research questions defined above have been reported.

In the following sections, we first consider the initial 181 papers found using the search strings defined above and applying inclusion/exclusion criteria. In Sect. 5.1.1 we answer question RQ1 for those papers, in Sect. 5.1.2 we answer question RQ2, in Sect. 5.1.3 we answer question RQ3 and in Sect. 5.1.4 we answer question RQ4. Namely, we first analyze the number of papers that focus on RL for sustainability published in the last 20 years, then we identify the main international conferences, workshops, and journals used to disseminate research, subsequently, we find the research centers that are particularly active in this research/application topic, and finally, we analyze the application domains and RL methodologies used. From Sect. 5.2, we start focusing only on the main 35 papers identified using main papers selection criteria. In particular, we answer question RQ4 in Sect. 5.2.1, question RQ5 in Sect. 5.2.2, question RQ6 in Sect. 5.2.3, and question RQ7 in Sect. 5.2.4. Namely, for these main papers, we first analyze the application domains of RL techniques and the RL-based approaches used to tackle environmental sustainability; then we analyze the way in which the problem has been formalized; subsequently, we investigate the evaluation measures used; finally, we identify the main challenges addresses. Notice that questions RQ1, RQ2, and RQ3 have not been answered considering only the main 35 papers because these questions aim to provide a quantitative analysis of the state of the art as a whole, and this subset of articles is part of the 181 papers used to answer these three questions.

5 Results of the review

This section reports the results of the analysis provided in this survey, first for the initial set of 181 papers, then for the subset of the main 35 papers.

5.1 Analysis of the initial set of 181 papers

The initial set of papers, selected using the search strings of Sect. 4, is analyzed by answering questions RQ1, RQ2, RQ3, and RQ4.

5.1.1 RQ1: How many academic studies have been published from 2003 to 2023 about RL for environmental sustainability?

This research question aims to quantify the interest of the international scientific community in applying RL methods to environmental sustainability problems over the last 20 years. As shown in Fig. 1, the number of publications (pink dots) remained relatively low until 2018, with the number of publications each year less than five. Since 2019, there has been a rapid growth of up to 53 papers in 2022, showing the increasing interest in this topic during the last few years. It is important to notice that the data for the year 2023 are updated to April 2023 and do not represent a decrease in the number of studies published. Application of inclusion and exclusion criteria leads to no publication in the years 2004, 2005, 2010, and 2011. In Fig. 1, we also show that the increase in the number of publications fits an exponential pattern (green line) with a growth rate of 0.42 in the number of publications (from 2 papers in 2007 to 53 in 2022). To compute the regression model, we do not consider 2023 in the model since its information is partial.

Fig. 1
figure 1

Academic studies published from 2003 to 2023. Pink dots represent the number of publications per year used to compute the regression model represented by the green line

Table 1 Journals and conferences with at least two publications

5.1.2 RQ2: What were the most relevant publication channels used?

With this research question, we aim to show what are the main channels used to disseminate research in the application of RL techniques to environmental sustainability problems. In Table 1, we show the journals and conferences with at least 2 publications. As can be seen, the topics of the journals and conferences are very varied. In particular, some of these communication channels are specific for sustainability, e.g., “Sustainability (Switzerland)” and “Sustainable Cities and Society”, and many are related to environmental aspects such as “IOP Conference Series: Earth and Environmental Science” and “IEEE Transactions on Green Communications and Networking”. Moreover, in the third column of Table 1, we provide an overview of the scope of the publication channels. To this aim, we analyze the information presented on the website of each conference and journal about its scope, indicating whether it has a technical/informatics or application-oriented perspective (“CS” or “APP”, respectively) or a combination of them (“CS + APP”). As can be seen, most of the publication channels are application-oriented (2 conferences + 12 journals), followed by those that present a combined scope (2 conferences + 8 journals), finally, a few of them (3 conferences) have a more technical/informatics perspective.

5.1.3 RQ3: In which country were located the most active research centers?

This research question aims to show which countries whose research centers are most concerned with the application of RL methods to environmental sustainability issues. With this in mind, we leverage the information in the Scopus and Web of Science databases about the 181 papers that were not excluded by the application of inclusion and exclusion criteria. In Fig. 2, we show only the countries with at least 5 publications and, as we can see, the highest number of papers comes from research centers located in China (33 papers), followed by the United States (29 papers), and the United Kingdom (17 papers). It is important to note that most of these works are developed in collaboration between research centers in multiple countries, so we count the paper for each collaborating country. To show co-author relationships, in Fig. 3, we represent only countries with at least 5 occurrences among analyzed documents. Each country is depicted as a circle, a link between 2 circles represents a co-authorship relation, and the line weight is proportional to the number of papers in the co-authorship relationship. As we can see, the countries with more links are the United States (9 links), followed by Australia (7 links), China, and India (6 links).

Fig. 2
figure 2

Number of publications per Country on RL approaches for environmental sustainability

Fig. 3
figure 3

Co-author relationships with Country as a unit of analysis. Nodes represent states, and links depict co-authorship relationships. The thickness of the link is proportional to the number of papers in the co-authorship relationship

5.1.4 RQ4: What were the application domains and the methodologies used?

This research question aims to analyze the application domains and the RL methodologies used for tackling issues related to environmental sustainability. To this aim, we analyze the index keywords of the 181 papers that were not excluded by applying the inclusion and exclusion criteria and the authors’ keywords for works with no index keywords. In Fig. 4, we show the application domains with more than 10 index keyword occurrences. We group the keywords into macro areas, such as in “Energy” we include keywords like “energy”, “energy conservation”, “energy consumption”, etc., in “Electric energy” we group keywords such as “electric energy storage”, “electric load dispatching”, “smart grid”, etc. (see Appendix for details). The image clearly shows that there is a wide variety of application domains, but most of the applications deal with sustainability issues related to energy fields.

Regarding the proposed approaches, we follow the same procedure as previously described for application domains grouping keywords that refer to the same method. For example, in “Actor-Critic” we group keywords such as “actor critic”, “advantage actor-critic (A2C)”, and “soft actor critic”. As we can see in Fig. 5, the most widely used RL method for dealing with environmental sustainability in different application domains is a state-of-the-art model-free algorithm, namely Q-Learning (Watkins 1989). It is important to note that, in the image, we show only RL approaches, but there are also index keywords related to other approaches like “genetic algorithm”, “simulated annealing”, etc.

Fig. 4
figure 4

Overview of application domains. For each application domain (y-axis), we show the number of occurrences of keywords belonging to its macro-area (x-axis)

Fig. 5
figure 5

Overview of RL methods used. For each RL method (y-axis), we show the number of occurrences of corresponding keywords (x-axis)

Moreover, we perform a bibliometrics analysis on the co-occurrence of index keywords by using VOSviewer (Perianes-Rodriguez et al. 2016). Having a co-occurrence means that 2 keywords occur in the same work. After a data cleaning process, VOSviewer detects 17 clusters by considering keywords with 3 occurrences at least. In Fig. 6, each cluster corresponds to a color, and each element of the cluster, namely a keyword, is depicted by a circle in the cluster color. For instance, the blue cluster is made of several blue nodes, each of which contains a keyword (e.g., electric vehicles, charging (batteries)) belonging cluster. The size of the circle and the circle label depend on the number of occurrences of the related keyword. Lines between items depict co-occurrences of keywords in a paper. Each cluster groups keywords identifying an application domain and/or the approaches used to tackle related environmental sustainability issues. For example, cluster 1 (red colored on the top-right) is somewhat related to traffic signals control for traffic management through the application of control strategies. Cluster 2 (green colored on the left) is related to power management and energy harvesting in wireless sensor networks.

Fig. 6
figure 6

Bibliometrics analysis on the co-occurrence of index keywords. Each color outlines a cluster, and each circle of the cluster color represents a keyword, while edges represent co-occurrences of keywords in the same work

5.2 Analysis of the 35 main papers

In this section, we focus on the 35 papers chosen using the selection criteria for the main papers (see Sect. 4). First, we provide a high-level analysis of the application domain and the RL approaches used to address environmental sustainability issues (research question RQ4). Then, we give an overview of the RL problem formalization (i.e., type of state/action space, type of transition model, type of RL method) (research question RQ5). Subsequently, we analyze the performance measures used to evaluate the results (research question RQ6). Finally, we evaluate the main challenges faced (research question RQ7).

5.2.1 RQ4: What were the application domains and the methodologies used?

In Table 2, we summarize the application domains and the RL approaches used in the selected works. First, we group the 35 main works according to their main related application domains (first column). It is important to note that application domains may overlap consequently, we report all application domains common to all papers in the same group. Then, we indicate for each paper (second column) the method behind the proposed technique (third column). The selected papers tackle environmental sustainability issues in the application domains shown in Fig. 4. In particular, the most relevant application domain correlates to the macro area of “Energy”. Indeed, it involves more than half of the papers in the table, considering both the works in which it represents the main application domain and those in which it is related to the main application domain. In Table 2, we also show that 16 out of the 35 selected papers use DRL approaches such as Deep Q-Network (DQN) (Mnih et al. 2015) and Double Deep Q-Network (DDQN) (van Hasselt et al. 2016), and another 2 rely on DRL techniques in multi-agent contexts, such as Multi-Agent Deep Deterministic Policy Gradient (MADDPG) (Lowe et al. 2017). RL techniques are used by 10 articles, here the most used method is Q-Learning, and 7 apply RL approaches to a multi-agent context. Finally, only 1 paper adopts a Genetic Algorithm-based RL (GARL) approach.

Table 2 Technical information about the selected works. In the third column, we report the RL methodology used
Table 3 We summarize the performance measures used by the authors to evaluate the proposed approaches in the second column and the challenges they address in the third column

5.2.2 RQ5: How was the RL problem formalized (i.e., type of state/action space, type of transition model, and type of dataset used)?

This research question deals with a technical point of view, which we think may be helpful for practitioners to get an overview of the environments considered by the authors in developing the proposed methods. In Table 2, we summarize the information related to problem formulation that we found in the selected papers. For each paper, we point out if the state and action spaces are continuous or discrete and whether the transition model is deterministic or stochastic. Finally, we provide information on the dataset used in the experiments, outlining whether real-world or synthetic data are used. It is important to note that not all papers explicitly provide this information. Thus, we mark with “*” all information inferred from reading the article. On the other hand, “N/A” specifies that the information available was not enough to infer the required data.

In the selected papers, most of the spaces of states and actions are discrete. Indeed, only 9 approaches use a continuous space state (third column of Table 2), and 6 use a continuous action space (fourth column of Table 2). Regarding the transition model, we can see that the model is stochastic in most cases where the information is available. In De Gracia et al. (2015), “Det.” and “St.” are both reported because the authors test the proposed methodology on both models. Finally, in the last column, we note that most of the experiments are performed on synthetic datasets. In fact, only 9 papers use real-world data, 6 of which combine them with synthetic data (“R + S” in the table), while 2 others use the real data to generate larger data sets from them (“S (R based)” in the table). Only Venkataswamy et al. (2023) test the proposed approach on both dataset types (“R, S” in the table).

5.2.3 RQ6: Which evaluation metrics were used to assess the performance?

This research question aims to provide an overview of the authors’ performance measure choices to evaluate the proposed approaches in the 35 selected papers. In the second column of Table 3, we report information about the metrics found in the articles, which are also indicated in the in-depth analysis of each paper in Section 5.3. As we can see in Table 3, the performance measures vary widely depending on the application domain and the goal of the method proposed in each paper. For example, reward is used as a metric in 9 articles but is computed differently depending on the context. Concerning electric vehicles, in Sultanuddin et al. (2023), the reward corresponds to a penalty function considering the cost of charging and a departure incentive. Instead, in wastewater treatment plants (WWTPs), Chen et al. (2021) use a reward function that takes into account the operational cost, consisting of multiple components, such as energy cost and biogas price, and several indicators, like energy consumed by the aeration and sludge treatment processes and GHG emissions. Another performance measure common to multiple application domains is, for example, energy consumption. Indeed, it is used in contexts such as water resources management (Emamjomehzadeh et al. 2023), WWTPs (Chen et al. 2021), data centers (Shaw et al. 2022), and AVs (Sacco et al. 2021). Even approaches related to the same application domain may differ in terms of performance measures depending on their objective. Considering, for example, the water resources context, both Emamjomehzadeh et al. (2023) and Skardi et al. (2020) evaluate their proposed approaches using resource level and nitrate concentration. However, in (Emamjomehzadeh et al. 2023), energy consumption and GHG emissions are also considered, while in (Skardi et al. 2020), resource allocation is used.

5.2.4 RQ7: What were the challenges addressed?

This research question aims to offer an overview of the issues that the authors have tackled within the 35 selected papers. In the third column of Table 3, we summarize information about the challenges addressed in the articles, which are also indicated in the in-depth analysis of each paper in Section 5.3. As with the performance measures, we can see in Table 3 that the challenges faced vary greatly depending on the application context and the goal of the method proposed in each paper. As an example, considering the domain of electric vehicles, Sultanuddin et al. (2023) address several challenges, like avoiding network energy overload at peak times, considering the uncertainty of driving patterns, and managing large state spaces. On the other hand, in addition to the challenge related to dimensionality, Zhang et al. (2021a) also address issues related to coordination and collaboration among agents, the competitiveness of charging demands, and joint optimization of multiple objective functions. However, although not explicitly stated by the authors, the challenge that unites these papers is the development of approaches capable of adapting to changes in a dynamic environment and managing the uncertainty associated with the environment that, in many cases, arises from the use of renewable resource sources, which have a stochastic and intermittent nature whose management adds further complexity to the problem.

5.3 Analysis of single papers (grouped by application domain)

In this section, we group the 35 main papers by application domain and analyze each single paper answering research questions RQ4, RQ5, RQ6, and RQ7. This provides the reader interested in a specific application domain with a deep knowledge of the main features of these papers. Notice that in answering RQ5, we use the information available in Table 2 and report in the text a “(*)” for all information inferred from reading the article.

5.3.1 Electric vehicles, Batteries, Energy

The transportation system is characterized by an increasing presence of EVs due to their eco-friendly features. In Sultanuddin et al. (2023), it is proposed a DDQN-based approach to provide a smart scalable charging strategy for EV fleets that ensures all cars have sufficient charging for their trips without exceeding the maximum energy threshold of the power grid. The charging management system combines information on the current state of the network and vehicle with historical data, being able to schedule charging at least 24 hours in advance. In developing the proposed approach, it is considered an environment with discrete actions and a stochastic transition model. The experimental evaluation is performed on a synthetic dataset by using as metrics the reward, the voltage levels, the load curves, and the charging/discharging curves. The rapid growth in the popularity of EVs subjects the power grid infrastructure to challenges, such as preventing grid overload at peak times. Moreover, the authors address issues related to driving pattern uncertainty and handling large state spaces.

Zhang et al. (2021a) propose a framework for charging recommendations based on MARL, called Multi-Agent Spatio-Temporal Reinforcement Learning (MASTER). By leveraging a multi-agent actor-critic framework with Centralized Training and Decentralized Execution (CTDE), the proposed approach increases the collaboration and cooperation among agents, and it can make use of information about possible future charging competition through the use of a delayed access strategy. The framework is further extended to multi-critics for addressing multiple objective optimizations. MASTER works in environments characterized by discrete actions (*) and a deterministic transition model (*), and it has been tested on a real-world dataset. To evaluate its performance, the Mean Charging Wait Time (MCWT), Mean Charging Price (MCP), Total Saving Fee (TSF), and Charging Failure Rate (CFR) are used as performance measures. In the development of the proposed charging recommendation approach, the authors face several challenges such as dealing with large state and action space, coordination and cooperation among agents in a large-scale system, potential competitiveness of future charging requests, and the joint optimization of multiple optimization objectives.

5.3.2 IoT

Recent years have seen rapid advances in IoT technology enabling the development of smart services such as smart cities, buildings, and oceans. Regarding smart cities, Ajao and Apeh (2023) consider the Industrial Internet of Things and present a framework for edge computing vulnerabilities. Indeed, edge computing security threatens the sustainability functionality of urban infrastructure with various attacks, such as Man-in-the-Middle and denial of service. In particular, to tackle authentication and privacy violation problems, this work proposes a secure framework modeling in Petri Net, namely Secure Trust-Aware Philosopher Privacy and Authentication (STAPPA), on which a Distributed Authorization Algorithm is implemented. Moreover, a GARL approach is developed to optimize the network during learning, detect anomalies, and optimize routing. This work regards an environment characterized by discrete state and action spaces (*), and a stochastic transition model. The authors test the proposed approach on a synthetic dataset and assess the performance in anomaly detection and detection accuracy by using the popular detection accuracy, recall, precision, specificity, and F-measure. Ajao and Apeh (2023) deal with security challenges, in particular authentication and privacy violation problems.

Zhang et al. (2021b) propose an IoT-based Smart Green Energy (IoT-SGE) management system for improving the energy management of power grids allowed by DRL. The proposed approach is able to balance power availability and demand by keeping grid states steady, thus reducing power wastage. In developing IoT-SGE, it is considered an environment with discrete states (*), continuous actions (*), and a deterministic transition model (*). The proposed approach has been evaluated on a synthetic dataset by the use of operational logs, power wastage and requirement, and average failure ratio as metrics. The authors address an energy sustainability issue, in particular, they aim to manage energy requirements and allocate smart power systems.

In the context of smart ocean systems, Han et al. (2020) present an analytical model to evaluate the performance of an Internet of Underwater Things network with energy harvesting capabilities. The goal of this work is the maximization of IoT nodes throughput by optimally selecting the window size. To this aim, the authors propose an RL approach and leverage the Branch and Bound method to solve the optimization problem by autonomously adapting random access parameters through interaction with the network environment. Considering a realistic scenario, it is proposed a MARL approach to deal with the lack of network information. In this case, random access parameters autonomously adapt by using a distributed Multi-Armed Bandit (MAB)-based algorithm for each node. The environment considered in this work is characterized by deterministic actions (*) and a stochastic transition model. The authors test the proposed approach on a synthetic dataset, evaluating its performance in channel access regulation in relation to the number of ready nodes per time slot and throughput. Finally, this work addresses a fairness issue due to spatial uncertainty in underwater acoustic communication, to deal with which the authors formalize an optimization problem for maximizing the IoT network nodes throughput.

5.3.3 Water resources

Water resource management is a key aspect of sustainable development and usually does not include social aspects. Emamjomehzadeh et al. (2023) propose a novel urban water metabolism model that combines urban metabolism with the Water, Energy, and Food (WEF) (Radini et al. 2021) nexus and thus it can consider interconnections among water, energy, food, material, and GHG emissions. Moreover, this work proposes a physical-behavioral model that relates the proposed approach to a MARL agent-based model neither fully cooperative nor fully competitive developed using Q-Learning. In this case, the only technical information available concerns the use of a synthetic dataset. The proposed approach is evaluated in terms of water table level, nitrate density, energy usage, and GHG emissions. Considering water resource management challenges related to sustainability, the authors aim to model and manage the WEF nexus for Integrated Water Resource Management in an urban area, taking into account stakeholders’ characteristics.

Skardi et al. (2020) propose, instead, an approach for quantifying and including social attachments in water and wastewater allocation tasks. This work proposes a paired physical-behavioral model, and the authors leverage Q-Learning to include social and behavioral aspects in the decision-making process. Specifically, it uses the approach proposed by Bazzan et al. (2011) to integrate Social Analysis in Q-Learning, and they choose between individual or social behavior through the use of specific reward functions. In developing the proposed method both a deterministic and a stochastic transition model is considered. Tests are performed on a dataset that combines real-world and synthetic data, and the performance evaluation is conducted considering water and treated wastewater allocation to the agents, water and groundwater level, and the concentration of nitrates to measure groundwater quality. Using Social Network Analysis, the authors tackle a key challenge in common resource management, i.e., the cooperation among agents. Also, they aim to quantify and include social attachments in water resource management.

5.3.4 Emissions/pollution

The development of WWTPs has a positive impact on environmental protection by reducing pollution but, at the same time, they consume resources and produce GHG emissions as well as residual sludge. With this in mind, Chen et al. (2021) propose an approach based on MADDPG to control Dissolved Oxygen (DO) and chemical dosage at once and improve sustainability accordingly. Specifically, the proposed approach uses two agents, one to control DO and one to control chemical dosage. Moreover, a reward function is designed based on life cycle cost and various Life Cycle Assessment mid-point indicators respectively. The proposed approach is developed considering an environment with continuous state and action spaces and tested on a synthetic dataset. To evaluate the training process, the reward and the Q-values determined by trained critic networks are used as metrics, while to analyze the variation of the influents and control parameters, the authors leverage the influents (COD, TN, TP, and NH\(_3\)-N), inflow rate, DO and dosage values. Finally, to assess the impact of the proposed approach are used energy consumption, cost, Eutrophication Potential (EP), and GHG emissions. WWTPs have a positive impact on environmental protection since they reduce contaminants and environmental pollution. However, at the same time, WWTPs consume resources and produce GHG emissions as well as residual sludge, thus the authors seek to optimize their impact on environmental sustainability.

Intelligent fleet management is crucial in mitigating direct GHG emissions in open-pit mining operations. In this context, Huo et al. (2023) propose a MARL-based dispatching system for reducing GHG emissions. To this aim, this work presents an environment for haulage simulation that integrates a component for real-time computing of GHG emissions. Then, Q-Learning is leveraged to improve fleet productivity and reduce trucks’ emissions by decreasing their waiting time. In the development of the proposed approach, an environment characterized by discrete state (*) and action spaces is considered. Tests are performed on a synthetic dataset and productivity, number of operational mistakes, GHG emissions, and time spent in queue are used as evaluation metrics. In this work, the authors tackle operational randomness and uncertainties in fleet management for reducing haul trucks’ GHG emissions in open-pit mining operations

5.3.5 Agriculture

In the context of sustainable agriculture, one of the key aspects of food security is crop yield prediction. Elavarasan and Durairaj Vincent (2020) tackle this problem by using a DRL approach, specifically a Deep Recurrent Q-Network (DRQN) (Hausknecht and Stone 2015) model. It consists of a Recurrent Neural Network (RNN) (Rumelhart et al. 1986) on top of the DQN. The proposed approach sequentially stacks the RNN layers, feeds the network with pre-trained parameters, and adds a linear layer to map the RNN output into Q-values. The Q-Learning network builds a crop yield prediction environment as a ‘yield prediction game’ that leverages both parametric feature combinations and thresholds useful in agricultural production. The authors consider an environment with discrete states (*) and test their approach on a dataset combining real-world and synthetic data, evaluating the performance by using the following metrics: Determination Coefficient (R2), Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Median Absolute Error (MedAE), Mean Squared Logarithmic Error (MSLE), Mean Absolute Percentage Error (MAPE), Probability Density function (PDF), Explained Variance Score, and accuracy. Finally, Elavarasan and Durairaj Vincent (2020) address issues related to the application of Deep Learning (DL) methods to crop yield prediction for increasing food production. Specifically, the authors tackle the incapability of DL approaches to directly map, linearly or non-linearly, raw data with crop yield values and the strong dependence of their effectiveness on the quality of features extracted from data.

5.3.6 Data, energy

Data centers are among the largest consumers of energy. In Shaw et al. (2022), an RL-based Virtual Machine (VM) consolidation algorithm named Advanced Reinforcement Learning Consolidation Agent (ARLCA). Its aim consists of simultaneously improving energy efficiency and delivery service guarantees. In this work, a global resource manager constantly monitors the state of the system and identifies hosts that may be overloaded due to the resource demand change over time. The proposed approach rebalances the VM distribution and avoids the rapid overloading of hosts while ensuring efficient operation. This work presents two implementations of ARLCA based on two RL methods, i.e., Q-Learning and SARSA, and it tests two different approaches to balance the exploration-exploration tradeoff, namely \(\epsilon\)-greedy, and softmax. Finally, the authors leverage the Potential Based Reward Shaping (Ng et al. 1999) technique to include domain knowledge in the reward structure and speed up the learning process. ARLCA works in an environment with discrete state and action spaces (*) and a stochastic transition model. Its performance is evaluated on a synthetic dataset (real-world-based). To evaluate the proposed VM consolidation algorithms, energy consumption, Service Level Agreement Violations (SLAV), number of migrations, and Energy Service Level Agreement Violations (ESV) are used as performance measures. In this work, the authors tackle a key challenge for cloud computing services, namely energy awareness. Further, they also face the slow convergence to the optimal policy of conventional RL algorithms.

Renewable energy Aware Resource management (RARE), a DRL approach for job scheduling in a green data center, is presented in Venkataswamy et al. (2023). This work proposes a customized actor-critic method in which the authors use three Deep Neural Networks (DNNs): the encoder, the actor, and the critic. The encoder summarizes information about the state of the environment into a compact representation of it, used as input for both the action and the critic. The actor returns the probability of choosing each scheduling action, while the critic estimates, for each action, the total expected value achieved by starting in the current state and applying a specific action. Moreover, since DRL requires a significant amount of interactions with the environment to explore it and then to adapt a randomly initialized DNN policy, the authors leverage an offline learning algorithm, namely, Behavioral Cloning, to learn a policy based on existing heuristic policy data used as prior experience. In particular, the actor network is trained to imitate the action selection process of data within the replay memory. In developing RARE, it is considered an environment characterized by discrete states (*) and actions and tested the performance on both synthetic and real-world datasets by using the total job economic value as metrics. In this work, the authors tackle several challenges related to the application of RL techniques to the context of green datacenters. The first issue relates to the environment. The dynamic of green data center environments makes the scheduling process difficult as it has to consider and manage the intermittent and variable nature of renewable energy sources. Moreover, the lack of uniformity in the environments makes it challenging to compare different approaches. The second challenge highlights the absence of discussion regarding the systems design choices effect (e.g., the planning horizon size). Such lack does not help to clarify the reasons for the better performance of the RL scheduler over heuristic policies. Furthermore, the authors discuss employing RL schedulers as a black box, without considering different configurations, such as the size of the neural network, which can lead to improved performance. Finally, the last challenge highlights that existing RL schedulers do not focus on learning and improving available heuristic policies.

5.3.7 Urban traffic and transportation

In recent years, the traffic congestion level has increased significantly with a consequent negative impact on the environment. Ounoughi et al. (2022) present EcoLight, an approach for controlling traffic signals based on DRL, which aims to reduce noise pollution, CO\(_2\) emissions, and fuel consumption. The proposed method combines the Sequence to Sequence Long Short Term Memory (SeqtoSeq-LSTM) prediction model with the DQN algorithm. SeqtoSeq-LSTM is used to forecast the traffic noise level that is part of the traffic information given as input to the DQN to determine the action to perform. EcoLight works in environments with discrete actions (*) and has been tested on a real-world dataset. The performance of EcoLight is evaluated by using the MSE, MAE, noise levels, CO\(_2\) emission, and fuel consumption as metrics. In this work, the authors tackle the issue of developing a control method that considers not only mobility and current traffic conditions but also integrates sustainability and proactivity.

On the other hand, Alizadeh Shabestray and Abdulhai (2019) present Multimodal iNtelligent Deep (MiND), a DRL-based traffic signal controller that considers both regular vehicles and public transit and leverages sensors’ information, like occupancy, position, and speed, to optimize the flow of people through an intersection by using DQN. In developing MiND, the authors regard an environment characterized by discrete states (*) and action and a stochastic transition method and test the proposed approach on a synthetic dataset. To assess the performance of the proposed approach the following measures are used: average intersection travel time, average in queue time, average network travel time, and weighted average intersection person travel time. In this work, the authors have to fulfill some important requirements to develop a real-time adaptive traffic signal controller. Indeed, the controller has to consider both regular vehicles and public transit traffic, and leverage sensors’ data on vehicle speed, position, and occupancy, moreover, the decision-making process should be fast.

Aziz et al. (2018) present an RL-based approach to control traffic signals in connected vehicle environments for reducing travel delays and GHG emissions. The proposed method, the R-Markov Average Reward Technique (RMART), leverages congestion information sharing among neighbor signal controllers and a multi-reward structure that can dynamically adapt the reward function according to the level of congestion at intersections. The considered environment presents discrete state (*) and action spaces (*) and a stochastic (*) transition model. The authors test RMART on a synthetic dataset and to evaluate its performance they use as metrics the average delay, stopped delay, number of stops, and network-wide delay, while to assess the performance from a sustainability point of view they leverage GHG emissions, i.e., CO, CO\(_2\), NOX, VOC, PM10. Finally, this work deals with the traffic signal control problem to reduce travel delays and GHG emissions by addressing the following issues: the sharing of congestion information among neighbor signal controllers and the dynamic adaptation of the reward function on the base of congestion level.

Reducing the number of drivers who commute in search of car parking in urban centers has a positive impact on environmental sustainability. In this context, Khalid et al. (2023) propose a Long-range Autonomous Valet Parking framework that optimizes the path planning of AVs to minimize distance while serving all users by picking them up and dropping them off at their required spots. The authors propose two learning-based solutions: Double-Layer Ant Colony Optimization (DL-ACO) and DQN-based algorithms. DL-ACO can be applied in new or unfamiliar environments, while DQN can be used in familiar environments to make efficient and fast decisions since it is pre-trainable. The DL-ACO approach determines the most efficient path between pairs of spots and subsequently establishes the optimal order in which users can be served. To deal with dynamic environments, it is proposed a DQN-based algorithm in which the agent learns to solve the task by interacting with the environment and using memory experience replay and the target network. The proposed techniques aim to improve the carpool and parking experience while reducing the congestion rate. In this work, the environment considered is characterized by discrete states (*) and actions (*), and a deterministic (*) transition model. The proposed approach is tested on a synthetic dataset and execution time, reward, path planned, and distance are used as performance measures. In this work, the authors deal with path planning problems in dynamic environments while ensuring the quality of experience for each user, optimizing the order of user pick-up and drop-off, and finally minimizing the overall distance.

5.3.8 Buildings

Buildings are interesting from a DR and Demand Side Management point of view. In this context, Kathirgamanathan et al. (2021) leverage a DRL algorithm, namely Soft Actor-Critic (SAC), intending to automatize energy management and harness energy flexibility by controlling the cooling set point in a commercial building environment. In developing the proposed approach, the authors regard an environment with continuous states, and they evaluate the performance on a dataset that combines real-world and synthetic data using as evaluation metrics the energy purchased, energy cost, discomfort, total reward, temperature evolution, and power demand. Kathirgamanathan et al. (2021) tackle the application of DRL methods to automatize DR without the need for a specific building model and their robustness to different operating environments and scalability. Moreover, the authors point out that the lack of well-established environments makes it challenging to compare RL algorithms over different buildings.

De Gracia et al. (2015) instead consider Thermal Energy Storage, and in particular latent heat, techniques to maximize energy savings by leveraging a Ventilated Double Skin Facade (VDSF) with Phase Change Material (PCM) used as a cold energy storage system. By using an RL approach, i.e., SARSA(\(\lambda\)), the authors control the VDSF to optimally schedule the solidification of PCM through mechanical ventilation during nighttime and the stored cold release into the indoor environment at peak demand time, considering weather and indoor conditions. The environment considered in this work presents discrete states (*), discrete actions (*), and a deterministic transition model. Moreover, the proposed approach is evaluated on a synthetic dataset considering electrical energy savings. This work aims to maximize energy savings by considering both the benefit of VDSF and the energy used in the solidification process. Therefore, it is crucial to determine the best time for the charging process to solidify the PCM and store coldness.

5.3.9 Manufacturing

Manufacturing industries are among the largest energy consumers, so it is crucial to develop approaches that make them more energy efficient. In this regard, Wang and Wang (2022) tackle the Energy-Aware Distributed Hybrid Flow-shop Scheduling Problem (EADHFSP). The goal of this work consists of simultaneously minimizing two conflicting objectives, makespan, and Total Energy Consumption. To this aim, the authors formulate a mixed-integer linear programming model of the EADHFSP and combine a Cooperative Memetic Algorithm with an RL-based agent to solve the problem. The authors combine two heuristics to initialize the population with various solutions and finally propose an improvement scheme in which solutions are refined by using the appropriate operator determined by a policy agent, while the solution selection is performed through the use of a decomposition strategy for balancing convergence and diversity. In this work, it is considered an environment characterized by discrete state (*) and action (*) spaces, a deterministic (*) transition model, and the performance of the presented approach is tested on a synthetic dataset considering the Overall Nondominated Vector Generation, C Metric, Hyper volume, and D1\(_R\) as evaluation metrics. This work addresses the EADHFSP with the minimization of makespan and total energy consumption, a challenging problem due to the simultaneous optimization of two conflicting objectives.

Leng et al. (2021) focus on Printed Circuit Board (PCB) manufacturing and propose a Loosely-Coupled Deep Reinforcement Learning (LCDRL) model for energy-efficient order acceptance decisions. The authors leverage DL, specifically a Convolutional Neural Network (LeCun 1989), to obtain an accurate prediction of the production cost, makespan, and carbon consumption of each order by considering historical order labeled data. Then, the proposed approach combines the forecasted data with order features to decide whether to accept the order and determine the optimal acceptance sequence by using a reinforcement learning approach based on Q-Learning. The authors regard an environment with discrete actions (*) and a stochastic transition model, and they test the proposed method on a synthetic dataset (real-world-based). As performance measures, the metrics MSE, MSLE, RMSE, and R2 are used to evaluate the prediction accuracy of LCDRL, while the performance of the approach is assessed in terms of unit profit, total profit, and acceptance rate. This work tackles the problem of order acceptance in PCB manufacturing to achieve energy efficiency, reduce carbon emissions, and improve material usage. Two critical aspects of PCB manufacturing are demand uncertainty and order customization which can lead to different profits, energy consumption, and carbon emissions. These two factors have to be considered in production planning under production constraints.

5.3.10 Mobile and wireless communication

Sustainable energy infrastructures need high-quality communication systems to connect user facilities and power plants for providing information interaction. In this context, Liu et al. (2021) propose the use of a 6G network and Intelligent Reflective Surface (IRS) technology to create a wireless networking platform and suggest a DRL method to optimize the phase shift of IRS and therefore improve the communication quality. Combining the 6G Network with the IRS technology, the authors provide high-quality coverage while gaining energy-saving benefits. In particular, this work proposes the application of the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. 2016) algorithm to configure the IRS phase shift for enhancing system coverage. The authors consider an environment characterized by continuous state and action spaces. The performance of the proposed approach is assessed on a synthetic dataset using two reflection units as metrics: the achievable rate to measure the service quality and the transmission power. Developing sustainable energy infrastructure is challenging from several points of view. Liu et al. (2021) tackle the need for an effective global covering communication system using the IRS technology, whose phase shift configuration is challenging itself.

In the context of two-tier urban Heterogeneous Networks (HetNets), Miozzo et al. (2015) model the Small Cell (SC) network as a decentralized multi-agent system. The authors’ goal consists of improving system performance and self-sustainability of the SCs in terms of energy consumption. To this aim, they leverage the distributed Q-Learning algorithm so that every agent learns an appropriate Radio Resource Management (RRM) policy. Miozzo et al. (2015) is extended in Miozzo et al. (2017). Here, it is proposed to train offline the algorithm to compute Q-values with which initialize the Q-tables of the SCs that will be used in the online method. In both approaches, the environment presents discrete states (*) and actions (*) and the dataset used is synthetic. In both works, the authors evaluate the proposed approaches in terms of network performance by using the throughput gain and traffic drop rate and their energy performance in terms of energy efficiency and energy efficiency improvement. Moreover, Miozzo et al. (2015) analyze the behavior of the HetNet considering traffic demand, harvested energy, battery level, policy, and normalized load at the macro Base Station (BS). Also, the authors consider as performance metrics the total amount of energy the system spent, the average load, the average cell load for the macro BS battery outage, and Jain’s fairness index to assess the Quality of Service (QoS) improvement. Finally, Miozzo et al. (2017) assess the computed policy by leveraging the switch-off rate as a performance measure and use the battery level to analyze the convergence of the online algorithm and evaluate the excess energy over storage capacity. Both works address the problem of introducing energy harvesting into the computation of sleeping strategies to achieve energy efficiency. This is challenging due to the irregular and intermittent nature of renewable energies.

Giri and Majumder (2022), instead, leverage a Deep Q-Learning algorithm for optimizing resource allocation in energy-harvested cognitive radio networks, where primary users networks share channel resources with secondary users and nodes can harvest energy from the environment, such as solar or wind. The proposed approach addresses the dynamic allocation of resources to achieve optimal network and throughput capacity, considering QoS, energy constraints, and interference limitations. Moreover, the authors utilize both linear and non-linear energy-harvested models, proposing a novel reward function that incorporates the non-linear one/model. The proposed approach works in environments characterized by continuous states (*), discrete actions (*), and a stochastic transition model (*) and it has been tested on a synthetic dataset by using reward, capacity, network lifetime, and average delay as performance measures. Giri and Majumder (2022) address the limitations of Q-Learning-based allocation methods, then allow dealing with high-dimensional problems, improving convergence performance, and efficiently harnessing the collected energy to meet the network’s QoS requirements.

Internet traffic has increased in recent years, and in the development of next-generation networks, it is important to address the QoS issue sustainably. In this context, Al-Jawad et al. (2021) propose an RL-based algorithm to solve routing problems in a Software Defined Network (SDN) environment, named Reinforcement lEarning-based Dynamic rOuting (REDO). Indeed, the proposed approach leverages Q-Learning to handle traffic flows by determining the most appropriate routing strategy among a set of conventional routing algorithms with the aim to maximize flows meeting the Service Level Agreement as to throughput, packet loss, and rejection rate. In developing REDO, the authors consider an environment with discrete state (*) and action spaces (*). The performance of the proposed approach is evaluated on a synthetic dataset in terms of throughput, packet loss, rejected flows, PSNR, and Mean Opinion Score (MOS). In the development of next-generation networks like SDN, Al-Jawad et al. (2021) address the problem of providing QoS sustainably through the solution of a traffic flow routing problem.

5.3.11 Electric energy

One way to increase environmental sustainability is to improve the energy efficiency of smart hubs. To this aim, Sheikhi et al. (2016) present the new Smart Energy Hub framework for modeling distinct energy infrastructures under a single framework. The authors’ goal consists of optimizing the electrical and natural gas consumption of a residential customer through the use of Q-Learning. Moreover, to improve and support information management among users and utility service providers, the proposed framework leverages Cloud Computing based systems. In this case the only technical information available concern the use of a stochastic transition model and a synthetic dataset. To evaluate the performance of the proposed approach, the metrics used are the storage charge level, the operational cost, and the primary energy involved. As regards dynamic load management in smart hubs, the authors tackle two issues. The first one is related to energy system parameters which are often assumed to be constant but can vary with time or be stochastic in practice. The second one is, instead, related to the conventional smart grid architecture, which has several reported issues, including exposure to cyber-attacks, single failure problems, limited memory and storage capacity in the energy management system, and difficulties in implementing real-time early warning systems due to limited energy and bandwidth resources.

In the context of smart energy networks, Harrold et al. (2022) consider a microgrid environment and leverage DRL to control a battery for energy arbitrage and increased use of renewable energies, namely solar and wind energy. Specifically, the authors apply the Rainbow Deep Q-Network (Hessel et al. 2018) algorithm and add predicted values for demand, Renewable Energy Source (RES), and energy price to agents’ information by leveraging an Artificial Neural Network. In this work, the environment considered is characterized by continuous states (*) and discrete actions. The authors test the proposed approach on a dataset that considers both real-world and synthetic data and assess the prediction accuracy using the MAPE. Also, they evaluate the performance of the proposed approach through energy cost savings, relative savings, episodic rewards, and value distribution. This work tackles the problem of controlling an Energy Storage System in a microgrid with its demand, RES, and dynamic energy pricing to perform energy arbitrage and improve the use of RES leading to reduced energy cost. Finally, the authors point out the limited availability of data that requires an efficient algorithm training procedure.

A key aspect of sustainability and cost-effectiveness in grid operation is optimal energy dispatch. Jendoubi and Bouffard (2022) address a multi-dimensional power dispatch problem within a power system by leveraging MARL, specifically the MADDPG algorithm. The proposed control framework performs CTDE to improve the coordination among dispatchable units without communication needed, and thus it mitigates data privacy and communication issues. In developing the presented approach, it is considered an environment with continuous actions and a stochastic transition model (*). The dataset used to evaluate the performance is synthetic and the proposed method is evaluated in terms of the annual total cost, variation of daily operation cost, photovoltaics (PV) production, aggregated demand, amount of power to be charged/discharged, amount of power provided by a diesel generator, amount of electricity delivered by the electricity provider, the difference in the amount of electricity delivered by the electricity provider between two consecutive time steps and Peak-to-Average Ratio (PAR). The authors address the energy dispatch aspects related to the development of distributed energy resources control strategies in grid operation to simultaneously reduce costs and delays and allow local coordination among energy resources.

5.3.12 Energy

In recent years, international trade and container handling at port terminals have increased greatly. Improving sustainability in port operations closely relates to the energy consumption at Automated Container Terminals, where Automatic Stacking Cranes (ASCs) are used to load, unload, and pile containers. In this context, Gao et al. (2023) propose a digital twin-based approach for container yard management. Specifically, this work focuses on determining the optimal allocation of container tasks and scheduling of ASCs to reduce the energy consumption of ASCs while maintaining efficient loading and unloading operations. The proposed approach leverages a virtual container yard to simulate the operating plan and mixed integer programming model to optimize the scheduling problem taking into account the energy consumption. Finally, the authors use the Q-Learning algorithm to determine the optimal scheduling plan and minimize energy consumption. The environment considered in this work presents discrete actions (*) and a stochastic (*) transition model. The performance of the proposed approach is evaluated on a real-world dataset using working and non-working energy consumption and the ratio between them as metrics. To improve the sustainability of port operations, in this work the problem of optimizing container yard operations to minimize energy consumption is addressed. Indeed, several factors can introduce randomness and uncertainty into these operations, and incorrect distribution of tasks can lead to suboptimal utilization of ASCs.

5.3.13 Wireless sensor network

Regarding embedded systems powered by a renewable energy source like an Energy Harvesting Wireless Sensor Node (EHWSN), Hsu et al. (2014) present a method called Reinforcement Learning-based throughput on-demand provisioning dynamic power management (RLTDPM). By leveraging the Q-Learning algorithm, the proposed approach allows the EHWSN to adapt the operational duty cycle to satisfy both the energy neutrality condition and the throughput on-demand (ToD) requirement, ensuring perpetual operation. In developing RLTDPM, the authors regard an environment characterized by discrete state (*) and actions (*) and evaluate the performance on a synthetic dataset by considering the residual battery energy (RBE), exercised duty cycle (EDC), offset to the required ToD (OTRT), and ToD achievability. In this work, the authors address the problem of simultaneously achieving two mutually conflicting goals i.e., satisfying ToD and reducing power consumption.

Energy-Harvesting Wireless Sensor Networks (WSNs) are widely used in energy-constrained operation problems. In particular, Chen et al. (2016) focus on Solar-Powered Wireless Sensor Networks (SPWSNs) and present an RL-based Sleep Scheduling for Coverage algorithm to improve the sustainability of SPWSN’s operations. The proposed approach leverages a precedence operator in the group formation algorithm to prioritize sensors in sparsely covered areas ensuring the desired coverage distribution. Then, the authors propose a multi-sensor cooperation Q-Learning group model to properly choose nodes’ working modes by leveraging the developed learning and action selection strategies. The whole group learns the sleep schedule by changing the role of the active node. The environment considered in this work presents discrete state (*) and action (*) spaces and a stochastic transition model. The proposed approach is tested on a dataset that combines real-world and synthetic data, and its performance is evaluated in terms of energy balancing between group members by using the potential energy of nodes as metrics, network lifetime, area coverage ratio, number of residual alive nodes versus the network lifetime, and the recharging cycle. In this work, the authors tackle a sleep scheduling problem to simultaneously achieve the desired area coverage and energy balance between group nodes to extend the network lifetime.

On the other hand, Feng et al. (2023) propose an RL-based approach to maximize data throughput in self-sustainable WSNs. The authors consider a Mobile Sensor (MS) that collects and transmits data to a fixed sink while moving within the network and harvesting energy from the environment. By leveraging DDPG, the MS can determine the optimal trajectory to optimize the EH performance and data transmission dealing with unknown energy supply dynamics. The environment considered in this work is characterized by continuous states and actions, and a stochastic transition model (*). Moreover, the performance of the proposed approach is assessed on a synthetic dataset considering as evaluation metrics the distribution of the ratio between the expected per-slot harvested energy and the MS-to-sink distance within the network, moving trajectories of the MS, reward, actor-loss, battery level, accuracy, moving steps, convergence, and training time. The authors tackle two main challenges concerning the MS’s trajectory optimization to maximize data throughput. The first relates to the lack of energy-related information such as the energy sources’ placement, future energy harvesting potential, and statistical parameters like the average energy harvesting rate, which makes the problem challenging. The second consists of the tradeoff between energy harvesting and data transmission. Indeed, moving closer to energy sources allows the MS to increase the energy harvesting amount. However, this may lead to decreasing data transmission power due to a possible increase in the distance between the MS and the sink.

5.3.14 Autonomous vehicles

In the last decade, Unmanned Aerial Vehicles (UAVs), i.e., drones, have been used in various scenarios such as rapid disaster response, Search-And-Rescue, environmental monitoring, etc., where humans are unable to operate in a timely and efficient manner, for example, due to the presence of physical obstacles. Bouhamed et al. (2020) consider the application of UAVs as mobile data collection units in delay-tolerant WSNs. The authors propose a mechanism that exploits two RL techniques, namely DDPG and Q-Learning algorithms. The proposed approach uses DDPG to determine the best trajectory for a UAV to reach the target destination while avoiding obstacles in the environment. Q-Learning, on the other hand, is used to schedule the best order of visiting nodes to minimize the time needed to collect data within a predefined time limit. In this work, the environment presents continuous state (*) and action spaces for the DDPG-based part of the approach while discrete states (*) and actions (*) for the Q-Learning-based one. The proposed mechanism is tested on a synthetic dataset and to evaluate its obstacle avoidance and scheduling performance, the authors analyze the path followed by the UAV, the reward collected, the UAV’s battery level, and the completion time of the tour against the ground unit transmission power. This work addresses issues related to the limited battery capacity of UAVs and challenges related to navigating in obstacle-prone environments to enable communication between the UAV and low transmission power sensors.

Sacco et al. (2021) propose a MARL approach based on the actor-critic framework to tackle task offloading problems from UAV swarms in edge computing environments to simultaneously reduce task completion time and improve energy efficiency. The proposed approach determines a distributed decision strategy through the collaboration among the system’s mobile nodes that share information about the overall system state. This information is then used by the agents to decide whether to compute a task locally or offload it to the edge cloud and in this case, the proposed technique chooses the best transmission technology among Wi-Fi access points and mobile network. In developing the proposed approach, the environment considered presents continuous states (*) and discrete actions (*), and the dataset used for testing combines real-world and synthetic data. The performance of the presented techniques is assessed in terms of task completion time and utility against a varying number of agents, and average node-antenna distance. Then, the authors evaluate the energy consumption necessary to complete the task by varying the average node-antenna distance and computing workload and assess the task completion time against the average computing workload. In addition, the cumulative distribution function (CDF) and utility evolution through episodes are considered to analyze the variability of performance among nodes and convergence performance, respectively. Finally, in this work, the authors tackle the problem of reducing the time necessary for task completion of UAV swarms by leveraging task offloading to the edge cloud while improving energy efficiency.

In the context of autonomous driving, Gu et al. (2023) tackle the application of RL methods focusing on energy-saving and environmentally friendly driving strategies within a cooperative adaptive cruise control platoon. More precisely, the goal of this work consists of training platoon member vehicles to react effectively when the leading vehicle faces a severe collision. The authors leverage the Policy Gradient algorithm for training an RL agent to minimize the energy consumption in inter-vehicle communication for decision-making while avoiding collisions or minimizing the resulting damage. To this aim, two different loss functions are used, i.e., collision loss and energy loss. Moreover, utilizing a specific reward function can both ensure the vehicle’s safety and consider the fuel consumption resulting from the action performed by the vehicle. This work considers an environment characterized by continuous states (*), discrete actions, and a stochastic transition model (*). The proposed approach has been tested on a synthetic dataset using energy loss, collision loss, reward, and lane changes as metrics. A key challenge of green autonomous driving addressed in this work is the development of effective strategies that can respond to environmental observations by automatically generating appropriate control signals.

6 Discussion

The analysis of the literature performed in this work shows that most of the works about RL for environmental sustainability concern the energy application domain, followed by urban traffic and transportation. The main RL technique used in the reviewed manuscripts is Q-Learning. Concerning the 35 selected articles, we observe that energy-related issues involve most of the papers, and about half of them leverage DRL approaches, such as DQN and DDQN. In developing the proposed methods, the authors mainly consider domains with discrete state and action spaces, and stochastic transition models, using synthetic datasets to evaluate the performance.

Problems related to environmental sustainability were traditionally tackled with optimization techniques in which the concept of adaptability has to be introduced explicitly. In contrast, one of the strengths of RL is its natural way of dealing with adaptability to changing or different environments, a crucial feature in environmental sustainability problems since in this context the agent has to handle variations in operating conditions due to, for example, changes in resource availability or weather conditions. For instance, Chen et al. (2016) introduce a RL-based Sleep Scheduling for Coverage (RLSSC) approach to ensure sustainable time-slotted operations in solar-powered wireless sensor networks. This algorithm is compared to LEACH (Heinzelman et al. 2002), a high-energy-efficient hierarchical routing protocol, wherein the node chosen to be active in the current round is ineligible for selection in the subsequent round, and a random algorithm that randomly determines active nodes within a group. Among the various aspects considered, a crucial criterion for evaluating algorithm effectiveness lies in maintaining equilibrium in energy levels, as significant disparities in current residual energy arise when a node receives an energy supplement. RLSSC initially exhibits fluctuations but eventually converges through iterative learning, exhibiting slight oscillations up and down in response to varying solar strength throughout the day. Moreover, the proposed approach demonstrates real-time energy balancing among sensor nodes. In contrast, non-RL-based methods lack the capacity to adapt to the dynamic environment. Another aspect to consider is network lifetime, where RLSSC excels in adapting to uncertainties associated with harvesting time and the amount of acquired energy. This adaptability enables RLSSC to dynamically adjust its scheme in real-time, effectively extending the overall network lifetime. This is only one of several examples showing that RL can provide a strong advantage in solving problems related to environmental sustainability because of its natural capability to deal with uncertainty and adaptation in sequential decision-making.

However, we identify several open problems in the application of RL techniques to environmental sustainability. These concern scalability, data efficiency, and the necessity to deal with large data volumes, often posing cost challenges. In future developments, it is crucial to improve pre-training methods that allow the generation of initial policies by simulation and leverage knowledge acquired by solving a related task. RL methods are also sensitive to reward function therefore reward engineering is important to avoid a negative impact on performance. Moreover, in dealing with environmental sustainability problems in specific contexts like IoT, it is of particular importance to consider the presence of computational limitations and then optimize the computational complexity of the method. Finally, we note that most of the approaches involve single-agent systems. Extending the proposed approaches to the multi-agent context would allow the cooperative computation of optimal policies accounting for common performance objectives to improve shared resources management and environmental sustainability.

7 Conclusions

This review focuses on the application of RL techniques to address environmental sustainability challenges, a topic of increasing interest in the international scientific community. We have examined several contexts where RL techniques have been recently used to enhance environmental sustainability, offering practitioners insights into state-of-the-art methodologies across diverse application domains. RL has found practical application in environmental sustainability because the inherent uncertainty of this domain poses challenges to strategy learning and adaptation that can be naturally tackled by RL. The review of the literature performed in this survey has identified the most common applications of RL in environmental sustainability and the most popular methods used to address these challenges in the last two decades. We have first provided a quantitative analysis of the state-of-the-art related to the application of RL in environmental sustainability and then analyzed the use of these techniques, focusing on sustainability concerns. In particular, we have provided an overview of the application domains of the proposed RL techniques and the approaches used for environmental sustainability issues. Moreover, we have narrowed our attention to 35 selected papers and provided technical information on the formalization of the RL problem, the performance measures adopted for evaluation, and the challenges addressed.