1 Introduction

We have recently seen astounding achievements by reinforcement learning (RL) agents. In games like Go, Chess, Shogi, and Atari, RL agents have outperformed human players (Silver et al., 2016, 2017; Schrittwieser et al., 2020). While in real-time strategy games like StarCraft II, the RL agent AlphaStar ranks in the top 0.2% of human players as of August 2019 (Vinyals et al., 2019a, 2019b). In poker, RL agents have beaten human professionals (Brown & Sandholm, 2017). Many of those advancements were achieved by leveraging neural networks (NNs) by the RL research community (Mnih et al., 2013, 2015).

Although the research community has achieved many incredible feats, there are still unsolved challenges. One of these challenges is the incomprehensibility of the RL agents. In high-stake domains like healthcare, autonomous driving, criminal justice, and finance, using uninterpretable artificial intelligence (AI) systems is unacceptable. For example, Lapuschkin et al. (2019) demonstrate that a classifier trained on the PASCAL visual object classes dataset (Everingham et al., 2010) could use a watermark on an image to decide the image’s label. In another example, the correctional offender management profiling for alternative sanctions system used in the United States to assess potential recidivism risk has been accused by ProPublica of being racially biased (Angwin et al., 2016; Larson et al., 2016). When it comes to laws, the launch of the European Union’s General Data Protection Regulation introduces the right to explanations of all automated decisions for individuals (Sovrano et al., 2020). All these examples demonstrate problems with the use of AI systems, and as a result, using RL and machine learning (ML) in general is getting more complicated. Many of these problems become even more problematic when using NNs. For example, NNs’ predictions can change based on modifications in images imperceptible by human eyes (Szegedy et al., 2014). Furthermore, Nguyen et al. (2015) demonstrate that NNs can classify humanly unrecognizable observations wrongly with high confidence.

These aforementioned examples of difficulties have caused a renewed interest in explainable artificial intelligence (XAI) (Guidotti et al., 2019; Arrieta et al., 2020; Burkart & Huber, 2021; Ras et al., 2022; Minh et al., 2022). Likewise, this has resulted in a new emerging subfield, explainable reinforcement learning (XRL). XRL is a research field focusing specifically on explaining RL agents, whereas XAI focuses on many forms of learning like unsupervised and supervised learning. In supervised learning, we assume observations are independent and identically distributed. Further, the goal is empirical risk minimization with immediate response. In contrast, the agent in RL learns to maximize the return with rewards as the responses, which are not necessarily provided immediately. Hence, the agent needs to consider the short-term and long-term consequences in addition to the immediate response when learning to make decisions. Accordingly, we must develop new methods to explain these RL specific characteristics that explanation methods of supervised learning cannot explain.

Researchers have published numerous literature reviews on XRL responding to new challenges explaining RL agents. However, because of the fast development, many recent studies on XRL are not covered in these reviews. Moreover, a unified view of the field that structures and organizes these XRL reviews is missing. This systematic literature review provides a unified view of the XRL field. Furthermore, we aim to help stakeholders (e.g., RL researchers and practitioners) become acquainted with the state-of-the-art XRL methods and find research gaps. Lastly, we seek to help stakeholders find a suitable method to answer their questions. For example, which method should the stakeholder apply if the stakeholder wants to know, “how can I get the agent to do _?” We achieve these goals by first finding XRL studies through a systematic search and selection process. Afterward, we summarize existing literature reviews, structure the XRL studies into a new taxonomy, and outline what kind of stakeholder questions they can answer. Next, we provide a detailed view of the state-of-the-art XRL methods by closely examining the taxonomy and its methods. Finally, we look at the XRL research trends, recommend XRL methods for different stakeholder questions, and propose future directions for XRL based on the reviewed studies.

This systematic literature review is structured as follows. First, we describe the research method used to conduct this systematic literature review in Sect. 2. Next, Sect. 3 describes the background on RL and XAI needed to understand this systematic literature review. In addition, in the same section, we outline some related research fields. Then, Sect. 4 summarizes existing XRL literature reviews and shows how our systematic literature review differs. Section 5 overviews XRL by providing a taxonomy that categorizes the different XRL methods. In the same section, we show different explanation types and RL explainability characteristics, which describe stakeholder questions. Afterward, Sects. 6, 7 and 8 review XRL methods by following the taxonomy. Based on the reviewed methods, we look at XRL trends in Sect. 9.1, recommend XRL methods in Sect. 9.2, and discuss future directions for the XRL research field in Sect. 9.3. We conclude this systematic literature review in Sect. 10. Finally, Section Appendix A gives a concise summary of reviewed methods with various details. To summarize, our contributions are:

  • A summary of existing XRL reviews and their contributions.

  • A new taxonomy reflecting the large body of XRL studies, divided into (1) interpretable agent, (2) intrinsic explainability, and (3) post hoc explainability. Furthermore, the taxonomy organizes studies based on how explanations are conveyed: (1) via generation, (2) via representation, or (3) via inspection.

  • An overview of which explanations types and RL explainability characteristics the different taxonomy categories provide.

  • A comprehensive look at 189 XRL studies found using a systematic approach with a concise overview in Section Appendix A. For each study, the appendix details the scope, the focus, experimentation environment(s) or task(s) (or both), if it performs a user study, and if the code has been open sourced.

  • An overview of the trends in XRL, recommendation for XRL methods, and future directions to address current challenges based on the reviewed studies.

2 Research method

This section outlines our systematic approach to identifying, evaluating, and reporting studies on XRL methods. To avoid bias in literature selection and make it reproducible and complete, we chose to do the review systematically. This systematic literature review was carried out by partially following the guidelines by Kitchenham et al. (2020). We describe the research questions in Sect. 2.1. Section 2.2 describes the study selection process with the overall process depicted in Fig. 1.

2.1 Research questions

This systematic literature review aims to answer the following research questions:

  • What are the existing XRL literature reviews and their contributions?

  • What are the state-of-the-art XRL methods, and how can we organize them?

  • What kind of stakeholder questions can these methods answer (e.g., “how does the agent work?” and “why did the agent do _?”)?

  • How are the methods evaluated (i.e., were user studies performed? and in what domain or task (or both) do they evaluate their method?)?

  • What are the research trends in XRL?

2.2 Study selection process

Fig. 1
figure 1

The study selection process was performed via three steps, identification, screening, and inclusion. In the identification step, we searched five different databases and removed duplicates automatically. Next, we screened the studies found using a two-stage process and removed studies using pre-defined selection criteria. Finally, we added relevant studies and performed forward and backward searches on them to add additional studies. The figure structure is adapted from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement (Page et al., 2021)

We started by finding related work (see Sect. 4) and used those as a basis for constructing our search string. Next, we created the following search string: (“reinforcement learning” AND (“explanation” OR “explainability” OR “explainable” OR “XAI” OR “explainable AI” OR “interpretable” OR “transparency” OR “transparent” OR “understandable” OR “interpret” OR “black box”)) OR “XRL”.

Using the search string, we searched the title and abstract (when available) in the following electronic databases: (1) ACM Digital Library, (2) DBLP, (3) IEEE Xplore, (4) ScienceDirect, and (5) Web of Science. We removed the duplicates automatically using Paperpile (LLC, Cambridge, MA) since the databases overlap. Our search is limited to studies published after 2017 and before July 2022, with few exceptions. We chose 2017 since not many XRL studies existed before this year, and other reviews already cover them. Moreover, 2017 was the year Defense Advanced Research Projects Agency (DARPA) launched its XAI program (Gunning & Aha, 2019).

The author conducted the selection process by following the method Selection process for lone researchers (Kitchenham et al., 2020, Page 318). Specifically, we applied the test-retest approach, where studies are reassessed later to check if they still fit the research questions and selection criteria. When uncertain, studies were discussed with a third party. To select studies, we used the following selection criteria:

  1. (1)

    The study focuses on explainability in RL. Specifically, we omit studies where explainability is the by-product and not driven by it.

  2. (2)

    The study does not focus on the multi-agent RL.

  3. (3)

    The study is peer-reviewed.

  4. (4)

    The study is in English.

Studies were selected in two stages using these selection criteria. In the first stage, we screened the title and abstract for relevance. After the first stage, we screened the full text to decide on inclusion based on the same selection criteria. We included 136 studies after two passes of screening. By forward and backward searching those 136 included studies, we found 53 additional relevant studies. In total, there are 189 relevant studies on the XRL topic included. Figure 2 depicts the studies included distributed by year. The number of studies included suggests increasing interest in XRL. The first study selection process on October 13, 2021, found 121 relevant studies. However, to keep this review updated on state-of-the-art XRL methods and reviews, the entire study selection process was reperformed on July 24, 2022, resulting in 183 studies. As the term “XRL” was not included in the original searches, a new search on the term “XRL” was performed on July 6, 2023, resulting in 189 total studies.

Fig. 2
figure 2

The number of studies reviewed, distributed by the year published. \(^1\)The number of studies for 2022 does not include the entire year

3 Background

This section provides the necessary background to understand the literature review’s content. We give a general overview of RL in Sect. 3.1 and XAI in Sect. 3.2. Finally, we overview some research fields related to XRL in Sect. 3.3.

3.1 Reinforcement learning

RL (Sutton & Barto, 2018) is a subfield of ML and is also known by its less popular names: approximate dynamic programming (DP) and neuro-DP (Bertsekas & Tsitsiklis, 1996). DP in the name signifies the importance of DP (Bellman, 1952, 1966) as the foundation of RL. RL is a framework for constructing intelligent agents that learn to make decisions through interactions with the environment rather than via instructions. In RL, the feedback on decisions is provided through rewards, which in psychology is known as reinforcement. The feedback differs from supervised learning because the feedback is not necessarily being given on every decision made by an agent. As a result, decisions in RL have short-term and long-term consequences in addition to immediate consequences. Moreover, in supervised learning, the observations are independent and identically distributed, which is not true for RL. The RL framework is built on the reward hypothesis that states we can formulate the learning goal as maximizing the expected cumulative reward, thus, focusing on a sum instead of a single quantity. The expected cumulative reward is known as the expected return.

This section provides the RL background needed to understand the rest of this review. First, we formally define the Markov decision process (MDP) in Sect. 3.1.1. Then, in Sect. 3.1.2, the RL problem is defined, which is the goal of RL.

3.1.1 Markov decision process

An MDP formalizes the sequential decision-making problem mathematically. In an MDP, the actions affect each other and the feedback is given via rewards that are potentially not supplied for every action taken. As a result, the agent in the decision-making problem must consider both immediate and future rewards. Formally, an MDP is defined as a tuple \(\langle \mathcal{S},\mathcal{A},p,r,\gamma \rangle\) where \(\mathcal{S}\) is a finite set of states, \(\mathcal{A}\) is a finite set of actions and \(\gamma \in [0,1]\) is the discount factor. The transition function \(p(s'|s,a)\) is a conditional probability distribution that defines the dynamics of the MDP, where \(s',s\in \mathcal{S}\) and \(a\in \mathcal{A}(s)\). In the state \(s\), the actions available are indicated by the set \({\mathcal {A}}(s)\). In the MDP, we assume states have complete information. Furthermore, we assume that the probability of transitioning from s to \(s'\) depends solely on s and not the entire history, thus, satisfying the Markov property. The reward function r(sa) provides the reward of taking an action \(a\in \mathcal{A}(s)\) in the state \(s\in \mathcal{S}\), and can optionally rely on the next state \(s'\). The reward \(R\in \mathcal{R}\) is bounded by \(\pm R_\text {max}\). All possible rewards are denoted by the set \({\mathcal {R}}\), which is a finite subset of \(\mathbb {R}\). In sum, all of these stated elements together form an MDP.

A policy \(\pi\) is a mapping from a state \(s\in \mathcal{S}\) to an action \(a\in \mathcal{A}(s)\). In the stochastic case, the policy yields a probability distribution over all actions. Like the transition function in the MDP, the policy is modeled such that it is only conditioned on the current state and not the whole history of states and actions. A trajectory is a sequence of states and actions defined by \(\tau =(s_0,a_0,s_1,s_2,\dots )\) where \(s_0\sim p_0\) and where \(p_0\) denotes the start state distribution. We choose or sample the action from a policy \(\pi\) and sample the next state from the transition function. Thus, we can create trajectories with access to these two functions.

3.1.2 Problem

The RL problem is about discovering a policy \(\pi\) that maximizes the expected return (Achiam, 2018). Assume that we have an MDP and a policy \(\pi\) as defined earlier. Then, the probability of a T-step trajectory \(\tau\) conditioned on the policy \(\pi\) is

$$\begin{aligned} p(\tau |\pi )=p_0(s_0)\prod ^{T-1}_{t=0}p(s_{t+1}|s_t,a_t)\pi (a_t|s_t)\quad \text {where}\quad s_0\sim p_0. \end{aligned}$$
(1)

Under the policy \(\pi\), the probability of taking action \(a_t\) given the state \(s_t\) is denoted \(\pi (a_t | s_t)\).

We define the infinite-horizon discounted return over a trajectory \(\tau\) by

$$\begin{aligned} r(\tau )=\sum ^\infty _{t=0}\gamma ^t r(s_t,a_t) \le \sum _{t=0}^\infty \gamma ^tR_\text {max}=\frac{R_\text {max}}{1-\gamma }, \end{aligned}$$
(2)

where \(\gamma <1\). The infinite-horizon discounted return is used for several reasons (Russell & Norvig, 2020): (1) based on empirical results, humans and animals prefer rewards as soon as possible, (2) in a financial setting, it is better to invest now than later, (3) the uncertainty of rewards increases as time passes, and (4) it is mathematically convenient as shown in Eq. (2).

Based on the trajectory probability and the return, we express the expected return given an MDP and a policy \(\pi\) by

$$\begin{aligned} J(\pi ) = \int _\tau p(\tau |\pi )r(\tau ) d\tau = \mathbb {E}_{\tau \sim \pi }[r(\tau )]. \end{aligned}$$
(3)

Finally, we define the RL problem by

$$\begin{aligned} \pi _*= \arg \max _\pi J(\pi ). \end{aligned}$$
(4)

That is, finding the optimal policy \(\pi _*\) that maximizes the expected return.

3.2 Explainable artificial intelligence

XAI defines an AI that can be understood by a human, including how it works, its strengths and weaknesses, and the behavior it will exhibit in unseen situations. A black box is the opposite, where the system’s internal mechanisms are either incomprehensible or not accessible to a human. The term XAI was popularized by DARPA when they launched the XAI program in May 2017. The word explainable was chosen to signify that an XAI system actively explains to increase a human’s understanding of it. Furthermore, they use the word explainable to emphasize the interest in the human psychology of explanation. XAI has been a research interest since the early 1970 s, with expert systems like MYCIN (Buchanan & Shortliffe, 1984) and GUIDON (Clancey, 1987). However, increasingly widespread use and interest in AI have renewed the attention on XAI, especially since the success of deep NN in the ImageNet 2012 challenge (Krizhevsky et al., 2012). The recent interest in XAI has quickly created a tremendous amount of new research.

We organized the section as follows. First, Sect. 3.2.1 loosely discusses XAI terminologies. Then, Sect. 3.2.2 explains why we need explainability. Afterward, Sect. 3.2.3 describes the different stakeholders that consume explanations. Next, Sect. 3.2.4 introduces some explanation properties. Lastly, Sect. 3.2.5 overviews explanation evaluations.

3.2.1 Terminologies

Interpretability and explainability are often used interchangeably in the literature (Gilpin et al., 2018; Arrieta et al., 2020). This section loosely discusses them since there is no consensus on a single definition. According to the dictionary (Merriam-Webster, 2022), the word interpret means “to explain or tell the meaning of” or “present in understandable terms”. In the context of XAI, Doshi-Velez and Kim (2017) define interpretability as “the ability to explain or to present in understandable terms to a human”. The human is what we define as the stakeholder, which we elaborate on in Sect. 3.2.3. Murdoch et al. (2019) define interpretable ML as “the extraction of relevant knowledge from an ML model concerning relationships either contained in data or learned by the model”. Whereas Lipton (2018) suggests that interpretability refers to several ideas and is not limited to one concept. According to Gilpin et al. (2018), explainability differs from interpretability. An interpretable system is not necessarily explainable, while the opposite is true. They define an explainable system as a system that: (1) can justify its decisions, (2) is interactable, and (3) is auditable. In this review, we follow Gilpin et al. (2018) and distinguish that interpretability is passive while explainability is active. When we want to refer to both terms, we write explainability. Thus, we think of both interpretable RL and explainable RL when we talk about XRL.

The technical definition of an explanation remains elusive. According to Gilpin et al. (2022), explanations are objects created due to their functional roles, stakeholders (referred to as the audience in the study), and capabilities. The functional role refers to why stakeholders want or need explanations. The stakeholder is the receiver of the explanation, also known as the explainee. Capabilities are about the AI system’s logical thinking process and its degree of access to the process.

3.2.2 Explainability needs

Explainability needs aim to answer on a high level why we need XAI in the first place, unlike stakeholder questions (i.e., the specific questions a stakeholder wants to get answered, for example, “how can I get the agent to do _?”). We need explainability because the deployment cost is not included in the AI system’s learning objective (Doshi-Velez & Kim, 2017; Lipton, 2018). When the AI system is learning, it tries to optimize the test predictive performance in supervised learning or the return in RL. However, the test predictive performance and the return might not capture the real-world deployment costs because it is difficult or impossible to formally write it down mathematically. For example, when the RL agent moves from the training environment to deployment, we want robustness to the distributional shift. Still, like in the supervised setting, it cannot be easily encoded mathematically. The problem at hand might also require a flexible approximator that is not interpretable. Furthermore, ensuring the objectives are sound by auditing all possible situations is infeasible. The literature has defined several reasons for explainability needs (Doshi-Velez & Kim, 2017; Lipton, 2018; Arrieta et al., 2020; Burkart & Huber, 2021). We list some examples here:

Trust The concept of trust is difficult to define and has been defined differently by different researchers across disciplines (Simpson, 2012; Robbins, 2016). One way to understand trust is whether a stakeholder is willing to delegate the decision-making to the AI system. Thus, if a stakeholder is inclined to let the AI system decide on its behalf, then it trusts the system. Also, trust can be a stakeholder’s confidence that the system will behave as intended.

New insight This need is about the ability to extract knowledge from the AI system to gain a new understanding of the problem at hand. We create the system not necessarily to make decisions but to gain novel insight into the domain.

Making adjustments The idea of changing an AI system encompasses correcting and improving it. Different quantities, such as accuracy and return indicate the system’s performance but lack in their ability to find, fix, and improve the system. Hence, knowing how the system works and its strengths and weaknesses is required to find bugs, fix them, determine when the system might fail, and improve it.

Fairness and being ethical These two needs are related to ensuring that the AI system does not make decisions that, for example, might discriminate based on skin color or gender and complies with ethical standards (Goodman & Flaxman, 2017).

Apart from the aforementioned reasons, there are other reasons like effective human and AI collaboration (Hayes & Shah, 2017), privacy (Arrieta et al., 2020), and accountability (Doshi-Velez et al., 2017) that motivate the need for explainability.

3.2.3 Stakeholders

When we discuss explainability, we should reason about it in relation to an audience and their need for explanation (Kirsch, 2017). This signifies that XAI is not an isolated field concerning only ML researchers, but an interdisciplinary field that involves, for instance, human-computer interaction. Suppose that some explanations might be helpful and understandable for AI researchers. However, they might not be helpful or even understandable for the AI system’s end-users. The reason is that these two groups have different goals for explainability and expertise. In short, to talk meaningfully about explainability, it should be in the context of a specific stakeholder, such as developers, domain experts, or end-users.

There have been several works in the literature discussing and proposing stakeholder frameworks for ML explainability (Weller, 2017; Preece et al., 2018; Tomsett et al., 2018; Ribera & Lapedriza, 2019; Hohman et al., 2019; Mohseni et al., 2021; Langer et al., 2021). The stakeholder frameworks differ between studies, but there are generally two ways to group stakeholders based on their role or expertise (Suresh et al., 2021). In the role-based frameworks, stakeholders are grouped by their roles and the explainability needs align with their role. In the second group, the stakeholders are grouped by their expertise, and their explainability needs result from their expertise.

3.2.4 Explanation properties

Researchers have proposed different explanation properties with the growing research on XAI (Lipton, 2018; Murdoch et al., 2019, Murphy et al., 2023, Chapter 33.3). Depending on the situation, we need different explanation properties. In this section, we overview some of these properties:

Fidelity and faithfulness Fidelity describes the extent an explanation can accurately explain the model (Robnik-Sikonja & Bohanec, 2018; Guidotti et al., 2019; Jacovi & Goldberg, 2020). For example, how a distilled model explains the original model can be measured through accuracy in agent distillation methods. Accuracy is defined as the number of correct predictions divided by the total number of predictions. Faithfulness also expresses the accuracy of an explanation. Jacovi and Goldberg (2020) state that the term faithfulness often differs between studies and is used inconsistently. Murphy et al. (2023) (Page 1076) define fidelity and faithfulness together. They discuss a measure of faithfulness in terms of how often a distilled model provides the same outputs as the original model. This is the same as how the other aforementioned studies use the term fidelity. Similarly, Robnik-Sikonja and Bohanec (2018) use these two terms in the same context. This shows that there is no clear distinction between these two terms in the literature.

Completeness This indicates whether an explanation conveys all factors relevant to the decision-making process.

Sparsity It refers to the notion of an explanation being small and compact, which is important since it is easier to understand explanations with fewer components to inspect.

Actionability It expresses changing the content of an explanation such that it only contains components that a stakeholder can adjust.

A more exhaustive overview of different properties can be found in the aforementioned book and studies.

3.2.5 Explanation evaluation

Evaluating explanations is difficult since there is not a single mathematical definition of explanations. Moreover, we must evaluate explanations by considering the explainability needs, the task, stakeholders, and constraints, such as time and attention. For instance, given two explanations and two tasks, the stakeholder might find the first explanation more helpful for the first task but not for the second task, which is explained better by the second explanation. The dependence on the overall setup makes the explanation evaluation difficult, emphasizing the importance of evaluating explanations using the intended setup. Researchers have proposed explanation evaluation taxonomies in the literature (Doshi-Velez & Kim, 2017; Mohseni et al., 2021). Doshi-Velez and Kim (2017) proposed to divide explanation evaluation into three levels:

Functionally grounded evaluation This evaluation type involves evaluating explanations computationally, such as measuring explanations’ fidelity or sparsity. This type of evaluation involves no stakeholders and is cheap but does not evaluate explanations on the intended setup.

Human grounded evaluation It involves evaluating explanations using stakeholders but with simplified tasks. For example, participants recruited via Amazon Mechanical Turk with tasks in games. On the one hand, the evaluation does not include the intended stakeholders and tasks, giving only a partial picture. On the other hand, this evaluation form allows for a larger user pool and more feedback with fewer resources.

Application grounded evaluation It is the most accurate evaluation but also the most expensive since the evaluation involves the intended stakeholders and tasks. For instance, we can use medical doctors in medical diagnosis tasks to test explanations.

3.3 Related research fields

This systematic literature review investigates RL studies focusing explicitly on explainability. There exist other interesting research fields within RL that strive for similar goals as explainability but achieve it in a different way, including human-in-the-loop RL and safe RL. We do not discuss works from these research areas to retain a focused scope on explainability in accordance with our selection criteria. Instead, we briefly describe them and point to more in-depth resources on these topics for further reading.

Human-in-the-loop RL includes studies where a human oracle provides the agent with feedback in real-time. With human-in-the-loop RL, it is possible to align humans’ mental models of RL agents’ behavior. In turn, this increases the predictability and trust in RL agents, which are similar to the goals of XRL. Studies within human-in-the-loop RL ranges from reward function specification (III & Sadigh, 2022) to exploration (Arakawa et al., 2018). One of the challenges in this field is how feedback from the human oracle should be modeled. In our review, we only included studies where human-in-the-loop is explicitly used for explainability (Fukuchi et al., 2017a, 2017b, 2022; Bewley & Lécué 2022; Cruz & Igarashi, 2021; Tabrez et al., 2019). For further reading, there are many human-in-the-loop RL surveys (Wirth et al., 2017; Li et al., 2019a; Cruz & Igarashi, 2020).

Safe RL aims to learn policies that perform well but at the same time ensure specified safety constraints are respected in training and deployment despite uncertainty. Safe RL is about making sure a policy avoids visiting states that are considered unsafe (Hans et al., 2008). Also, it is about making sure the policy can reach any state from the states it visits so that a negative outcome can be amended (Moldovan & Abbeel, 2012). Surveys that comprehensively cover this topic for further reading include García and Fernández (2015) and Gu et al. (2022).

4 Related work

The success of RL and the recent increasing interest in XAI have resulted in many XRL literature reviews. In this section, we give an overview of previous XRL literature reviews. Furthermore, Table 1 provides a detailed overview of these literature reviews’ contributions and the number of studies they cover. Numerous relevant literature reviews exist for XAI (Arrieta et al., 2020; Burkart & Huber, 2021; Ras et al., 2022; Minh et al., 2022), but we only cover literature reviews focusing on RL.

As far as we know, Puiutta and Veith (2020) published the first XRL literature review. They provide an overview and categorization of XRL methods based on an existing XAI taxonomy (Arrieta et al., 2020). Their discussion points out that the connection between stakeholders and explanations is often not considered. They suggest that more studies should focus on interdisciplinary work to alleviate this issue. Similarly, Heuillet et al. (2021) adapt existing XAI taxonomy to categorize XRL techniques. Wells and Bednarz (2021) take a systematic approach to the literature review and follow the methodology by Kitchenham et al. (2009). Their systematic literature review focuses on answering two questions regarding XRL methods. First, what XRL methods exist in the literature? And second, what are their limitations? In contrast to the previous studies, Alharin et al. (2020) propose a novel taxonomy for categorizing XRL methods. Additionally, they described the taxonomy regarding different method properties.

Glanois et al. (2022) focus on interpretable RL. Several reasons motivate their review: (1) the need for interpretability, (2) the increasing number of studies on interpretable RL, and (3) the limited number of studies reviewed by previously mentioned literature reviews. They propose a new interpretable RL taxonomy and more thorough coverage of interpretable RL methods than Puiutta and Veith (2020), Alharin et al. (2020), and Heuillet et al. (2021). They focus mainly on studies published in the past ten years.

The reviews by Puiutta and Veith (2020), Heuillet et al. (2021), and Wells and Bednarz (2021) provide a deep dive into XRL methods, but the scope is limited. As a result, Milani et al. (2022) propose a more extensive and newer literature review on XRL techniques. Additionally, they propose a new taxonomy for XRL methods. Building on the knowledge of the previous literature reviews, Krajna et al. (2022) introduce a new taxonomy for XRL techniques and explore XRL for the multi-agent setting. Vouros (2022) comprehensively reviews XRL methods, concentrating on the deep reinforcement learning (DRL) counterpart. He describes each reviewed XRL method thoroughly, detailing the motivation, assumptions, technical details, evaluation, and more. Finally, Hickling et al. (2022) introduce another XRL review and describe XRL methods and two existing XRL literature reviews (Wells & Bednarz, 2021; Vouros, 2022).

Differently from the other literature reviews, Dazeley et al. (2021a) go beyond reviewing existing XRL methods and concentrate on Broad-XAI. They define Broad-XAI as combining and integrating explanation strategies into an individual explanation that satisfies a stakeholder’s need. Contrasting all previous studies, Zelvelder et al. (2021) review RL application domains and to what degree XRL is investigated in those application domains. Sakai and Nagai (2022) introduce a literature review on explainable autonomous robots, a related field to XRL. Lastly, Sado et al. (2023) describe methods that focus on explaining autonomous robots and agents, which overlap with XRL.

Our work differs in several ways compared to previous surveys and reviews:

  • We propose a novel taxonomy from the perspective of the reviewed studies. Our taxonomy accommodates the large spectrum of XRL methods and has the finesse needed to compare and discuss categories of methods and methods within a category. We believe the categorization of methods in previous works makes doing these comparisons and discussions more challenging. First, in previous works, methods within a category can produce explanations using different mechanisms. For instance, both agent distillation and policy summarization methods produce global explanations but use different strategies. Second, they can express different types of information. For example, feature importance is mostly limited to where the agent looks, while textual justifications allow for explanations with richer semantics. Third, they can produce explanations that answer different stakeholder questions. For example, agent distillation methods can answer specific why questions, but policy summarization methods cannot. Fourth, they can convey explanations in different ways. Our taxonomy takes these points into account. Considering these differences from ours, we believe our taxonomy with finer divisions makes discussions and comparisons easier. We illustrate these issues below.

    Puiutta and Veith (2020), Heuillet et al. (2021) and Hickling et al. (2022) use taxonomy from XAI and do not propose a XRL specific taxonomy. Wells and Bednarz (2021) propose a new taxonomy, but some of their categories like Visualization and Policy Summarization are expansive. For example, the category Policy Summarization includes methods commonly known as policy summarization (Amir & Amir, 2018; Lage et al., 2019b) in the literature. Yet, it also includes agent distillation methods (Verma et al., 2018) and methods aimed at human-robot collaboration (Hayes & Shah, 2017). Alharin et al. (2020) introduce a new taxonomy but makes it difficult for the reader to compare categories. Categories on the same level range from Computer Vision and Natural Language to Decision Trees and Summarization. The former denotes large research fields, while the latter is a machine learning model and a XRL technique. Similarly, the Feature contribution and visual methods category in Krajna et al. (2022) would be easier to discuss using finer divisions instead of as a single category. Glanois et al. (2022) include many useful studies that can promote interpretability, but they do not provide a taxonomy for XRL. In Milani et al. (2022), the subcategory Directly Generate Explanations contains both textual justification methods (Ehsan et al., 2018; Wang et al., 2019b; Hayes & Shah, 2017), feature importance methods that require specific architecture (Goel et al., 2018; Mott et al., 2019), and agnostic feature importance methods that do not involve training an agent (Greydanus et al., 2018; Shi et al., 2022). Vouros (2022) introduce an explainable deep RL specific taxonomy that consists of the large categories: Solving the (1) Model Inspection, (2) Policy Explanation, (3) Objectives Explanation, and (4) Outcome Explanation Problem. These categories can be extended to enable more nuanced comparisons and discussions. But in its current form, categories in the taxonomy are very broad. For instance, the category Solving the Policy Explanation contains both policy summarization (Amir & Amir, 2018; Huang et al., 2018) and agent distillation methods (Verma et al., 2018; Hüyük et al., 2021).

  • Our literature review is the only systematic one besides Wells and Bednarz (2021) that aims for exhaustive and comprehensive searching for literature explicitly related to XRL. However, they cover less than a fifth of the number of studies compared to ours.

  • Beyond reviewing XRL studies, we extensively summarize existing XRL surveys and a systematic review. This includes outlining their contributions and what challenges they consider currently unsolved in XRL. We believe this enables us to provide a broader view of the XRL field.

  • We divide the category that is often known as post hoc explainability (as seen in Puiutta & Veith 2020; Heuillet et al., 2021; Arrieta et al., 2020) into two categories, post hoc explainability and intrinsic explainability. We believe such a division is important when stakeholders decide on a method to use, as the methods have different use cases and requirements. While the categories overlap, they also have some significant differences, such as performance impact, agent design and training, access to the agent’s internal logic and the environment, and applicability. For instance, only post hoc explainability methods can be used if the stakeholder does not want to train or fine-tune an agent. We discuss these differences later in detail when we present the taxonomy.

  • We describe the explanation types and RL explainability characteristics that different XRL method categories satisfy. This is an important aspect since stakeholders are the consumers of explanations. Furthermore, our review outlines how explanations are communicated. Whether explanations are conveyed via generation, representation, or inspection. These two things make it easier for stakeholders to find methods more suited for their use case that previous reviews lack.

  • We outline trends in XRL and recommend methods based on stakeholder questions (e.g., “how does the agent work?” and “why did the agent do _?”). This is in addition to future directions that other surveys and reviews focus on.

  • All reviewed methods are concisely summarized in the appendix. This includes, what is the motivation of the method, what it explains, and its evaluation.

Table 1 Summary of XRL literature reviews

5 Explainable reinforcement learning

Fig. 3
figure 3

The MDP of the agent interacting with the environment (abbreviated as env) unrolled. illustrates comprehending immediate reasons for the agent’s action, in other words, what explanation methods from supervised learning can explain. and depict what we want to understand in addition to what already provides. That is, we are interested in understanding the sequential nature of RL, including, for instance, short-term and long-term consequences of actions. Moreover, we are interested in understanding the environment as a way to comprehend the agent (Color figure online)

Learning via interaction, potentially delayed feedback via reward, and short-term and long-term consequences set RL apart from supervised learning. To gain an in-depth understanding of the agent’s decision-making process, RL explainability needs to solve new challenges in addition to those from supervised learning. We illustrate this explainability difference in Fig. 3. These new challenges include understanding the short-term and long-term consequences of the agent’s behavior and not just the immediate reasons. Moreover, understanding the agent’s learning objective based on how the environment assigns rewards. Finally, in the event of a distributional shift, understand how changes in the starting state distribution and transition function affect the agent. Section 5.1 describes how we organize and classify XRL studies. Section 5.2 describes explanation types and RL explainability characteristics, indicating the types of stakeholder questions groups of methods can answer.

5.1 Taxonomy

Our taxonomy was constructed through several iterations and changed numerous times based on (1) existing XAI taxonomies for supervised learning (Arrieta et al., 2020; Guidotti et al., 2019; Burkart & Huber, 2021; Minh et al., 2022; Ras et al., 2022), (2) previously proposed taxonomies for XRL, and (3) studies from the searches. As a result, our taxonomy is a product of studies from our searches and previous taxonomies. We divide XRL methods into three categories: (1) interpretable agent (IA), (2) intrinsic explainability (IE), and (3) post hoc explainability (PHE), as seen in Fig. 4. IA refers to the agent being readily comprehensible and providing an understanding of the underlying learned relationships. These methods achieve inherent interpretability by representing the agent with a simple function approximator. IE describes methods modifying the RL system before training to make it explainable. PHE is similar to IE but endows the RL system with explainability without modifying it. The methods within PHE aim to extract information about the agent and its behavior after training. Although the categories IE and PHE overlap, we divide the methods into two categories for several reasons:

Performance impact The performance of RL agents is often positively affected or unchanged for methods in the IE category. Mott et al. (2019) show improved performance compared to models without attention bottlenecks. Likewise, Cultrera et al. (2020) show that adding attention leads to superior performance in addition to increased explainability. Tang et al. (2020) display better performance and generalization. Pan et al. (2019) indicate better data efficiency with their method. Other methods like Kim and Canny (2017) demonstrate competitive performance, but not significantly better. Similarly, Lin et al. (2021) do not show performance degradation and, in some cases, even perform better. Methods from other subcategories also show better performance. For example, Wang et al. (2021a) demonstrate that their method both converges faster and obtains higher episodic reward compared to other policies without modifications. Similar results are exhibited in Chen et al. (2022). Likewise, Kim et al. (2018) display better performance in comparison to other state-of-the-art methods. In Fukuchi et al. (2017a, 2017b), the explanation mechanism not only explains but also improves learning. In summary, IE methods increase the performance while methods in PHE do not affect the performance.

Agent architecture and training algorithm For methods in IE, the agent might require a specific neural network architecture, for example, Mott et al. (2019) with their model that has soft top-down attention mechanism, Lin et al. (2021) with their two-part agent, or Yang et al. (2019) with their variational autoencoder modified agent. Many methods in PHE have no such requirement. Nevertheless, some methods in PHE require specific prerequisites such as differentiability, but the requirement is less strict than in IE methods.

In IE, the agent’s performance shown in studies is linked to the specific RL algorithm tested. Thus, the performance of methods in IE is uncertain on untested RL algorithms. PHE methods do not have such a concern since the training algorithm is detached from the explanation algorithm.

Training the agent For methods in IE, agents are trained from scratch or fine-tuned. Thus, a pre-trained agent cannot be explained unless it is modified and trained from the beginning or fine-tuned. This is disadvantageous if the performance of the already trained agent is satisfactory, and the stakeholder only wants to debug it for final verification. For PHE methods, the agent is not changed if training is involved. For example, training a distilled agent does not involve changing the original agent.

Applicability Methods from PHE can be applied to IE agents if certain prerequisites like functions being differentiable are satisfied. For example, a model with decomposed Q-values can be distilled into a decision tree or explained using feature importance methods.

Flexible agent access Many methods from PHE do not require access to the agent’s internal logic. For example, the methods from the agent distillation category only require the state and the agent’s corresponding action or Q-values. Another example is the important state and transition category in PHE, where many of the methods only require access to the agent’s output in the form of Q-values, but not the internal logic. However, since the category is large, this does not apply to all methods in PHE. For instance, many feature importance methods require access to gradients while other perturbation-based feature importance methods only need to be able to probe the agent and receive its output.

Environment access Fewer PHE methods need access to the agent’s internal logic compared to IE methods. But, in turn, greater access to the environment is required. Many of the PHE methods require access to input–output tuples that are assumed to be obtainable by simulating the agent in the environment or via a preexisting dataset. Although IE agents also necessitate this access, this happens during training for IE methods versus after training for PHE methods.

Our top-level categorization is similar to previous studies in XAI (Lipton, 2018; Murdoch et al., 2019; Du et al., 2020; Arrieta et al., 2020). However, the subcategories in this taxonomy are new and designed to accommodate the reviewed XRL studies. We go into details on these in the coming sections as follows. First, Sect. 6 details the methods that fall into the IA category. Next, Sect. 7 describes the methods in the IE category. Finally, Sect. 8 outlines the PHE method category.

Besides the taxonomy mentioned above, we classify the methods by their scope and focus in Section Appendix A. First, we define the method as global if it reveals the overall behavior of the agent, making it possible for the stakeholder to understand the agent’s behavior in multiple states. In contrast, a local method only provides the logic behind the decision-making process that generalizes to a few states. We distinguish between two types of local scope: (1) methods explaining the short-term and long-term consequences, and (2) methods explaining using only the immediate context. Second, we classify methods by whether they try to: (1) solve the XRL problem, (2) solve RL specific problems (e.g., sample efficiency and generalization), and (3) solve application problems (e.g., applying XRL in healthcare, autonomous driving or some other domain).

Fig. 4
figure 4

Taxonomy of XRL methods. The categories do not sum to 189 studies because some span multiple categories

5.2 Stakeholder questions: explanation types and RL explainability characteristics

Stakeholders have different questions they want to ask to satisfy their needs, and different XRL techniques provide different explanations. Some techniques might produce explanations that answer several questions, while others only answer one. This section first outlines six common explanation types used to explain stakeholders’ questions (Lim et al., 2009; Mohseni et al., 2021). These explanation types are:

How does the agent work? A how explanation aims to give an all-inclusive answer to how the agent works and impart an understanding of its global behavior.

What did/will the agent do? A what explanation describes what the agent has done or will do. This is a descriptive explanation of the agent’s behavior based on the history or predicted future.

Why did the agent do_? A why explanation justifies why the agent took a specific action.

Why did the agent not do _? The why not explanation describes why the agent did not choose a specific action, for instance, the stakeholder’s anticipated action. This explanation type is also known as a contrastive explanation.

What would the agent do if _ happens? A what if explanation explains hypothetical questions of how the agent would behave in a specific situation. This type of explanation is known as a counterfactual explanation.

Howcan I get the agent to do _, given the current state? A how to explanation answers changes needed to get the agent to do a specific action. This explanation type is also known as a counterfactual explanation.

In addition to explanation types, we identify RL explainability characteristics. That is if the explanation produced includes information about short-term and long-term consequences or uses model information to explain (or both). We add this extra information since the explanation types do not provide the nuance needed to differentiate between different, for instance, why explanations. The short-term and long-term consequences describe if the explanation informs by referring to the future outcomes (e.g., what happens a few time-steps into the future or the result at the end of an episode). Model information refers to whether methods leverage the model (i.e., the transition and reward function) to explain the agent’s behavior. We outline the explanation types and RL explainability characteristics for IA and the categories of IE and PHE in Tables 2, 3 and 4. The goal is to indicate what kind of stakeholder questions each category of methods can answer. Hence, making it more straightforward to find suitable methods to answer a particular question.

6 Interpretable agent

Fig. 5
figure 5

The interpretable agent approach. The explanation is the agent itself communicated via its representation

The interpretable agent (IA) category consists of agents innately understandable to humans. These methods do not require modifications to be interpretable. Instead, IA methods achieve interpretability by carefully choosing simple function approximators to represent the agent. The interpretable agent category aims to capture methods that are mainly motivated by interpretability. While there are many methods that support interpretability (Nikou et al., 2021; Illanes et al., 2020), interpretability is not their main motivation and is rather a by-product (Burkart & Huber, 2021). We do this in line with the selection criteria to keep this literature review focused on interpretability in RL.

The resulting explanation from these methods is the agent’s representation, as illustrated in Fig. 5. Suppose that an agent is represented using a decision tree; then the decision tree itself is also the explanation. Their functional form allows stakeholders to inspect and understand them out of the box. Thus, the explanation is faithful to the policy being explained since it is the policy itself. Also, with their functional form comes inductive biases that bring advantages to generalization (Trivedi et al., 2021; Jiang & Luo, 2019). For instance, Jiang and Luo (2019) point out that relational inductive bias can help the policy generalize better than DRL policies that are not understandable. Furthermore, based on experiments on environments with symbolic representation, these methods offer competitive performance compared to their neural network counterpart (Silva et al., 2020; Qiu & Zhu, 2022; Trivedi et al., 2021).

Although these methods have many advantages, more complex environments may require functions represented using neural networks that are more flexible. Even if these methods can obtain high-performing policies in complex environments, decision trees might get too deep and rule lists too long, making them difficult to understand. Moreover, all of these methods are tested in environments where the state is low-dimensional with interpretable features. In environments where this is not the case, applying these inherently interpretable methods is not straightforward, for example, in environments using visual inputs like Atari (Bellemare et al., 2013). In these more complex environments, manual feature engineering is one possibility. However, one of the reasons why deep learning is performing so well is its ability to automatically extract features. Another approach is to use deep learning to extract features, but the feature extraction part is still a black box.

This section provides an overview of these methods organized by their functional form, as depicted in Fig. 6. Table 2 indicates the explanation types and the RL explainability characteristics that methods in IA can provide. The IA category does not explain the short-term and long-term consequences of actions since the MDP formalism only requires the agent to be reactive. The agent only needs to consider the current state and output an action. Consequently, we do not understand how the past or future (or both) affect the action choices at decision time by only inspecting the agent. Additional mechanisms are needed to gain an understanding of RL explainability characteristics for agents in the IA category.

Fig. 6
figure 6

Taxonomy of interpretable agent

Table 2 High-level overview of explanation types and RL explainability characteristics provided by IA methods

6.1 Rule-based

The rule-based category presents methods where rules are used to express agents. The rules can be simple if-then conditionals or more complicated rules, for instance, incorporating fuzzy logic. “IF cart=right slope AND speed=high right then accelerate=positive” is an example of a rule learned by Hein et al. (2017b) in the mountain car environment to control the cart.

Hein et al. (2017b) describe the fuzzy particle swarm RL (FPSRL) method that focuses on industrial applications and interpretability. They represent policies using fuzzy rules and use a model-based approach to learn these rules. The model is learned and is leveraged to evade situations that can be dangerous when exploring while learning.

Real-world data usually includes numerous features, where many might not be helpful or redundant. Consequently, the resulting agent from FPSRL can be difficult to interpret since it uses all the features in all of its rules. In response to this difficulty, Hein et al. (2018a) describe the fuzzy genetic programming RL (FGPRL) method. To overcome this problem, FGPRL includes mechanisms to automatically select features, choose compact rules, and optimize policy parameters at the same time. Furthermore, besides introducing the new method, they also improved FPSRL by extending it with a new feature selection method. Hence, making it possible to apply FPSRL to industrial applications where states are high-dimensional.

Huang et al. (2020) present the interpretable fuzzy RL (IFRL) framework that uses the actor-critic architecture to learn policies represented as if-then rules. The rules produced by the framework are interpretable and allow stakeholders to add prior knowledge. Their method is motivated by previous methods’ limitations. These limitations include specifying the policy structure beforehand and the inability to optimize policies before episodes end (Hein et al., 2017b, 2018b; Verma et al., 2018). Likmeta et al. (2020) introduce a rule-based policy for autonomous driving using RL. They sample the parameters from distributions that they optimize using gradient descent. Their work is based on the policy gradient with parameter-based exploration method (Sehnke et al., 2008). The method shifts exploration to the parameters to accommodate deterministic policies. In addition, it relaxes the differentiability requirement.

6.2 Mathematical expression

Physics has expressions that define complex phenomena in simple and compact mathematical expressions. Several studies present approaches representing RL policies using mathematical expressions to acquire interpretable agents by leveraging the same idea. The equation \(a=\frac{0.62}{\log (s_2)}\) exemplifies a simple policy produced by the method proposed by Landajuela et al. (2021) to control the cart in mountain car.

Like Hein et al. (2017b, 2018a, 2018b) present a model-based batch RL method that represents policies as mathematical expressions trained using genetic programming. Their work is motivated by interpretability, real-world applications, and difficulties with design choices regarding fuzzy rules. To find mathematical expressions representing value functions, Kubalík et al. (2021) use symbolic regression and genetic programming. Specifically, they describe three algorithms: symbolic value iteration, symbolic policy iteration, and a solution of the Bellman equation that can be obtained directly.

Kubalík et al. (2021)’s approach needs a model, and Hein et al. (2018b)’s approach results in lower performance than NN policies. Accordingly, Landajuela et al. (2021) present the deep symbolic policy method to discover policies represented using mathematical expressions where neural-guided search is leveraged. Their deep symbolic policy method consists of the policy generator and the policy evaluator. The generator is a recurrent NN that produces policies, and the evaluator assesses them and provides feedback to train the generator.

6.3 Logic-based

This category introduces methods that use logic expressions to represent the RL agent. Focusing on generalization and explainability in RL, Jiang and Luo (2019) present the framework neural logic RL (NLRL). The framework works with the policy gradient where states, actions, and policies are expressed in first-order logic. Their framework takes advantage of differentiable inductive logic programming (DILP) (Evans & Grefenstette, 2018) to learn interpretable and generalizable policies. Zhang et al. (2021b) present the off-policy differentiable logic RL (OPDLRL) framework. Their method tackles issues of execution efficiency, stability, and scalability of integrating DILP with DRL. OPDLRL solves the execution efficiency problem by using approximate inference and off-policy training. They employ maximum entropy RL to make the learning process stable. Lastly, they integrate hierarchical RL into the framework to make DILP scalable. The resulting framework resolves problems of combining DILP and DRL and yields interpretable policies. Another approach towards logic-based agents proposed by Gorji et al. (2021) apply a supervised learning method, the Tsetlin machine (TM) (Granmo, 2018), to RL by using a customized value iteration algorithm. Kimura et al. (2021) introduce a new method to learn interpretable rules as the policy using logical neural networks (Riegel et al., 2020). By using a semantic parser, the method first parses textual observations into first-order logical facts. Afterward, a logical neural network is inputted with these facts to learn rules.

6.4 Tree-based

The tree-based category outlines the approaches that represent agents using a tree-based representation. Tree-based models such as decision trees (DTs) are considered interpretable, presuming they are small in terms of depth and simple based on how splits are executed. However, they cannot be trained online using continuous optimizing like NNs, thus, they are trained offline. Responding to these considerations, Silva et al. (2020) introduce the differentiable DTs (DDTs) approach that learns DTs using online optimization. They extend Suárez and Lutsko (1999)’s work by highlighting and fixing two problems that hinder interpretability: (1) how to do splits and (2) how many features to use in each split. Besides dealing with these two disadvantages concerning interpretability, they also provide a theoretical analysis of DDTs.

In DDTs (Silva et al., 2020), function approximators like NNs cannot be taken advantage of and the internal representations in the nodes cannot be substituted. Other works on DTs for RL, such as VIPER (Bastani et al., 2018) and MoËT (Vasic et al., 2022), use imitation learning. Consequently, Topin et al. (2021) propose the iterative bounding MDP (IBMDP). The IBMDP extends the MDP formalism by wrapping around it and adding bounds for state features and additional actions. The key is that a policy learned using IBMDP will equal a decision tree policy for the MDP. Thus, if we learn a neural network policy for the IBMDP, then a corresponding decision tree policy can be extracted for the MDP. The same work shows how existing RL algorithms can be modified to solve the IBMDP.

6.5 Program-based

The program-based category introduces methods that represent policies structured in domain-specific languages. Trivedi et al. (2021) apply a variational autoencoder (Kingma & Welling, 2014) to learn a latent program space. They train the variational autoencoder to reconstruct randomly produced programs where policies with similar behavior are close to each other in the latent space. After learning the latent program space, they find the agent’s policy by maximizing the return using the cross-entropy method. Premade program templates are used by previous work on program-based policies (Verma et al., 2018, 2019). Since they produce the programs without templates, Trivedi et al. (2021) argue that their method produces more flexible policies. Qiu and Zhu (2022) propose a method to train program-based policies using policy gradient via differentiability requirement relaxation. Their method learns the architecture and parameters of the policy simultaneously by taking advantage of the progress in the neural architecture search literature. Similar to Trivedi et al. (2021), they avoid the issue of fidelity and faithfulness since an imitation learning based approach is not used. Unlike Trivedi et al. (2021), they do not need to learn a latent program space utilizing a premade dataset of programs. Cao et al. (2022) propose a domain-specific language synthesis method that adds the benefits from both imperative and declarative programming. With their method, they can synthesize hierarchical cause-effect logic programs that have good generalization and interpretability. They compare their method with various baselines in the MiniGrid environment, showing that their method has better learning ability, generalization, and interpretability.

7 Intrinsic explainability

Fig. 7
figure 7

In the intrinsic explainability approach, methods modify the agent or model (or both) to enable the RL system to become understandable and able to produce explanations

Intrinsic explainability (IE) describes methods that modify the agent or model (or both) to make the RL system explainable. When we say model in this context, we refer to the transition and reward function. For instance, a method that reduces the state space before training, making the agent operate in the reduced state space. Accordingly, it becomes easier to comprehend the agent since the stakeholder needs to inspect fewer situations to gain a global understanding of the behavior. Alternatively, if the agent is represented as a NN, a method that modifies the NN architecture, such as adding an attention module, so the agent can produce saliency map explanations during the forward pass.

The methods change the agent to endow it with the ability to generate explanations. Figure 7 illustrates these examples of approaches where methods transform the agent or model (or both) into their explainable counterpart. IE methods apply the modifications before they train the agent. Consequently, the modifications enabling explainability are tied to the agent and its training and affect the agent’s performance. We divide methods into categories based on how they represent and communicate the explanations. A complete overview of all IE categories is illustrated in Fig. 8.

Table 3 indicates explanation types and RL explainability characteristics provided by the different IE categories. As we can see, the methods in the subcategories produce diverse explanation types and can explain sequential information that methods from the IA category cannot. By offering this table, we hope it becomes easier for stakeholders to find a suitable method for their task.

Fig. 8
figure 8

Taxonomy of the intrinsic explainability. We separate the category based on how the explanation is conveyed: (1) via generation, (2) via representation, and (3) via inspection. The categories do not sum to 59 studies because some span multiple categories

Table 3 High-level overview of categories of methods in the IE based on their explanation types and RL explainability characteristics

7.1 Explanation via generation

This section describes IE methods that modify the agent to generate an object explicitly representing the explanation. The object can be a saliency map, a textual response, or some other explanatory object given to the stakeholder as the explanation. For instance, the explanation “the car stops because the light is red” (Ben-Younes et al., 2022) in autonomous driving illustrates the representation of explanations communicated by methods in this category.

7.1.1 Feature importance

In this section, we overview methods that modify the agent so it can explain by highlighting important features using saliency maps. Saliency maps are defined for most methods in this section as highlighting task-relevant information in the input. Nevertheless, there are some exceptions to how saliency maps are defined, which we outline at the end of this section. The modifications to the agent involve changing the agent’s NN architecture. These methods mainly aim to answer the why question by pointing out features affecting the agent’s behavior.

Kim and Canny (2017) propose an explainable self-driving agent represented using a modified convolutional NN architecture. The agent produces explanations in the form of saliency maps. Clustering and filtering are used to make the explanations concise after generating the saliency maps. These saliency maps emphasize important input parts that impact the agent’s behavior causally. In a similar line of work, Cultrera et al. (2020) introduce an end-to-end model for autonomous driving that can explain its decision using saliency maps. Their approach does not involve post-processing in contrast to Kim and Canny (2017)’s approach, which does. Also working on driving agents, Bao et al. (2021) introduce the deep reinforced accident anticipation with visual explanation (DRIVE) model. In traffic accident anticipation systems that already exist, methods to create visual explanations are lacking. In response, DRIVE was created to make visual explanations in the context of accident anticipation. DRIVE merges two kinds of attention by leveraging the dynamic attention fusion method proposed by the authors. The result of combining these attentions is improved accident anticipation and better saliency maps.

Goel et al. (2018) propose the motion-oriented RL (MOREL) method. Their method is motivated by the need for more sample-efficient and explainable systems. In addition, they point out the disadvantage of requiring hand-crafted templates by a previous approach (Iyer et al., 2018). MOREL works by first learning a representation that can be used to find and segment objects in inputs. The representation is later utilized to train the policy. As a result, learning a high-performing policy requires fewer environmental interactions. Moreover, the learned representation makes creating saliency and optical flow maps possible. The saliency map emphasizes the agent’s confidence that objects exist at given locations. At the same time, the motion of the objects is captured by the optical flow map. Mott et al. (2019) propose a new method that uses soft attention to create saliency map explanations. The explanations generated by their system aim to focus on features impacting the agent’s behavior both in the present and future. According to the authors, compared to existing saliency methods for RL (Zahavy et al., 2017; Greydanus et al., 2018), explanations created by their method are easier to understand. Using the asynchronous advantage actor-critic (A3C) algorithm, Itaya et al. (2021) train convolutional NN architectures with two attention modules. The built-in attention modules enable interpretation from two perspectives by explaining the control and state value separately. According to the results, the policy performs better with the attention modules and, at the same time, facilitates explainability. Aiming to capture the input content that causally affects the output, Dai et al. (2022c) introduce a module named conceptual embedding that they integrate into DRL agents. The conceptual embedding extracts concepts by compressing the high-dimensional state into a compact representation. After extracting concepts, importance values are assigned to them via perturbation to explain the agent. In this work, they assume that there is a causal relationship from observation to action. Thus, they can explain the cause and effect between concepts and actions.

Integrating attention modules into an agent’s architecture can hamper its performance  (Nikulin et al., 2019). Nikulin et al. (2019) describe a new module that can be inserted into a convolutional NN agent instead of proposing a new modified architecture. They demonstrate that the agent does not show degraded performance with the new module via experimentation and at the same time provide explainability.

Visual inputs contain many features, RL agents must distill inputs to obtain the relevant features. However, trying to extract them using brute force can affect training and explainability negatively. To resolve the issue, Zhang et al. (2021c) propose to divide the decision-making process into two parts, first finding the task-relevant features and then using those features to make decisions. To explain decisions, they describe the temporal-adaptive feature attention algorithm to explain the importance of the features. Similarly, Wei et al. (2022) introduce a feature selection approach based on attention. The introduced method identifies important features and then assesses the features’ importance. Liu et al. (2022) present the adaptive region scoring (ARS) module, which is motivated by how humans process visual data. Their method is incorporated into an agent by modifying the feature extractor, which provides explainability.

In addition to explainability, several other reasons motivate many studies. Focused on generalization, Tang et al. (2020) use neuroevolution to train agents with self-attention. The self-attention module can be used to explain the agent’s decision-making. Also motivated by generalization, a NN architecture of the agent with relational reasoning is proposed by Zambaldi et al. (2019), which is also used for explainability purposes. Josef and Degani (2020) introduce a DRL agent with built-in attention to provide explainability in the context of safe unmanned ground vehicle navigation that happens in rough terrains. Kim et al. (2022) describe integrating attention and risk-sensitive agents to yield explainability. In addition, they argue for being the first to work on saliency maps and risk-sensitive agents by reviewing several XRL studies. Finally, Wang et al. (2022) work on incorporating saliency map explanations into the exploration strategies of agents to explore more effectively.

Below, we concisely list how saliency is defined for each method:

  • Goel et al. (2018) provide two different saliency maps, one highlighting objects and another highlighting flow information for each moving object.

  • Mott et al. (2019) highlight important task-relevant information in visual inputs. From their method, we can get two different saliency maps, one on where the agent looks and the other on what the agent looks at.

  • Zambaldi et al. (2019) produce saliency maps that show what different entities in the input space attend to, which shows the relationship between the entities.

  • Bao et al. (2021) produce two different saliency maps, one highlighting the most salient objects, while the other focuses on risky regions in traffic accident anticipation. These two are merged via weighted sum to create a single saliency map that improves traffic accident anticipation.

  • Dai et al. (2022c) highlight relevant concepts using perturbation. Concepts are found via a layer termed concept embedding that compresses the observation.

  • Kim and Canny (2017), Nikulin et al. (2019), Cultrera et al. (2020), Josef and Degani (2020), Tang et al. (2020), Itaya et al. (2021), Zhang et al. (2021c), Kim et al. (2022), Liu et al. (2022), Wang et al. (2022), Wei et al. (2022) highlight task-relevant input features.

Methods in this category provide explanations that are easy to convey, as long as the stakeholder understands the features. These explanations can be used to confirm whether an agent is looking at “reasonable” features rather than spurious ones. In addition, explanations are generated during the forward pass, and thus, they do not require much more computational power. On the downside, for visual inputs, which most methods in this category focus on, the methods are mostly limited to where the agent looks. Ideally, a stakeholder would not only like to know where the agent is looking at, but also what it is looking at. For example, is it the car, the car’s color, or the edges of the car triggering the agent’s response? These ambiguities make it difficult to understand the explanation. Furthermore, it has been shown that it is hard for humans to detect spurious signals, even if the saliency explanations can show them (Adebayo et al., 2022). Compared to post hoc saliency methods, explanations of these methods are faithful since they are used in decision-making. Although this has been contended, for example, attention is not the same as explanation (Jain & Wallace, 2019; Wiegreffe & Pinter, 2019).

7.1.2 Intended behavior

The intended behavior category describes methods enabling the agent to inform the stakeholder about planned actions for several steps into the future. For example, the explanation, “I will go left” (Fukuchi et al., 2017b) by a robot in a human-robot collaboration task. Knowing the planned actions makes it possible for the stakeholder to anticipate the agent’s behavior. Thus, answering the what question with sequential information embedded into the explanation.

Focusing on human-robot collaboration, Hayes and Shah (2017) introduce a method to answer questions like “When do you do _?”, “What do you do when _?” and “Why didn’t you do _?”. Their approach consists of parsing the stakeholder’s queries by matching them with pre-made templates, finding states matching the stakeholder’s queries, and generating explanations in natural language explaining the matching states. According to Fukuchi et al. (2017b), Hayes and Shah (2017)’s approach has three limitations. (1) Needs manual engineering, (2) assumes that the policy will not change, and (3) only the immediate context is used to explain actions. To resolve these issues, Fukuchi et al. (2017b) introduce the instruction-based behavior explanation (IBE) method that uses interactive RL. In this framework, an agent gets instructions from an expert. They assume that the agent followed the instructions if the agent received high rewards in the episode. The instructions can speed up the agent’s learning and are saved for later use by the agent to explain its actions over a short term. They use clustering to explain situations with saved instructions. However, if the policy parameters get updated, this method will not work and needs to be revised. In response to this limitation, Fukuchi et al. (2017a) propose using a supervised learning approach instead of clustering to translate from state to explanation. Extending these two studies, Fukuchi et al. (2022) focus on the connection between the agent and the stakeholder. More specifically, they focus on the communication divergence that may arise between them due to different goals.

Leveraging probabilistic graphical models, Chen et al. (2022) and Wang et al. (2021a) propose an end-to-end driving system. To explain the driving system, they output a semantic mask that provides a bird’s-eye-view of road conditions, objects in the car’s surroundings, and routing information. The semantic mask shows the car’s perception, comprehension of the driving situation, and planned driving route. The planned route gives the stakeholder an understanding of the vehicle’s short-term behavior.

For human-robot collaboration, the type of explanation offered by methods in this category is useful since it reveals the agent’s intent. The main use for these types of explanations is during real-time collaboration and when the main interest is in the agent’s future behavior close in time. On the downside, depending on the task and how far into the future explanations explain, they may have limited usefulness. For these methods to be useful in real-time situations, the explanations must be sparse and fast to produce.

7.1.3 Textual justification

Textual justification methods enable the agent to provide textual explanations in natural language to the stakeholder. For example, the explanation “The car slows down because it is preparing to turn to the road.” from Kim et al. (2018) explains the behavior of a driving policy. Although textual response explanations can answer a variety of questions, the existing methods in this category mainly respond to the why and why not questions.

To create textual explanations for driving agents, Kim et al. (2018) introduce a new method by extending Kim and Canny (2017)’s work. They create faithful explanations for the driving policy rather than making rationalizations that aim to explain how a human spectator would explain an action. To achieve this, they utilize visual explanations to produce textual justifications. The visual explanation is produced by an attention model represented as a feed-forward neural network that outputs importance values. The neural network is given state features and the previous hidden state from the LSTM model that represents the policy. The textual explanation is similarly generated by a separate neural network, in this case, an LSTM model. In the same work, they create a new dataset, Berkeley deep drive-X, that partially enriches the Berkeley deep drive dataset (Xu et al., 2017) with textual justifications. Focusing on the same applications, Ben-Younes et al. (2022) describe a new method to create textual justifications. This method differs from the previously mentioned approach in two ways. First, they use a different approach to create faithful explanations. Second, they focus on generating explanations in the online setting. Besides these works focusing on autonomous driving, Wang et al. (2019b) describe a new method to create faithful textual justifications by leveraging attention.

Cruz and Igarashi (2021) propose interactive explanations using templates in natural language. Utilizing these interactive explanations, stakeholders can find and fix bugs. In addition, stakeholders can make the agent’s behavior align with their preferences. In short, they propose actionable and interactive explanations that are more than just explanatory.

Textual explanations can be easier to understand for a larger group of stakeholders than other types of explanations. Depending on the design of the textual explanations, stakeholders do not need to understand the inner workings of the agent. However, the cost of textual explanation is higher, since some form of human intervention is often needed. If there is a dataset that can be used for explanations, it is often limited to certain domains. For example, the Berkeley deep drive-X can only be exclusively used for driving environments.

7.1.4 Important states and transitions

The methods within this category explain the agent by pinpointing important states and transitions encountered during learning or after. These states and transitions can be situations where an alternative action can significantly affect the agent’s learning or future outcome (or both). The goal of these methods is to align the agent’s and stakeholder’s mental models through examples of situations. As these situations communicate diverse agent behavior and provide a global overview, the aim is to answer the how and what questions.

According to Dao et al. (2018), Zahavy et al. (2017)’s approach requires manual feature engineering, and Greydanus et al. (2018) provide local explanations that do not give insights into the training process (information about these methods is given in Sects. 8.1.1 and 8.2.1). Motived by these shortcomings, Dao et al. (2018) describe DRL-Monitor. DRL-Monitor saves important transitions the agent encounters during learning that can later be analyzed to gain insights. The approach extends the sparse Bayesian RL (SBRL) (Lee, 2017) method, which requires feature engineering that DRL-Monitor does not. On the downside, DRL-Monitor saves too many transitions, pointed out by Dao et al. (2021). This, in turn, makes it costly to use DRL-Monitor. To overcome this limitation, Dao et al. (2021) present a new approach to balance the information retained and the number of transitions saved. Accordingly, fewer transitions are saved, making it less laborious to analyze them. In a similar line of work, Mishra et al. (2018) introduce Visual-SBRL that aims to save important transitions. However, unlike DRL-Monitor, Visual-SBRL does feature engineering via an autoencoder.

The standard MDP formalism is extended into the lazy-MDP by Jacq et al. (2022). In the lazy-MDP, we have a policy trained to solve the standard MDP called the default policy. In addition, we have the lazy policy trained to solve the lazy-MDP. For every action the agent has to make, the agent can either delegate action selection to the default policy or use the lazy policy and get a penalty. Thus, the lazy policy will only be used to act if the action selection is critical, where the penalty matters less than the outcome, showing a new way to identify critical states.

This form of explanation is useful if justifications for specific situations are not needed. It is valuable if the goal is to understand the agent’s behavior in general. The difficulty with these methods is to find states helpful to the stakeholder, which can differ based on their needs. Thus, the importance measure used to assess which states to save must be adapted to the situation. Another difficulty is whether looking at a state is enough for the stakeholder to understand why a state was picked, or if more information is needed.

7.1.5 Expected outcome

This section presents studies that aim to answer the short-term and long-term consequences of the agent’s decisions. The consequence can range from what is encountered for choosing a specific action to how much time it will take to reach the goal state because of that action. Additionally, the methods in this category can contrast the outcomes of different actions, thus, answering why not questions.

A set of methods decomposes the reward into interpretable components, since finer details provide a better understanding (Erwig et al., 2018; Juozapaitis et al., 2019; Anderson et al., 2019). For example, Juozapaitis et al. (2019) show that in a gridworld environment, the reward can be decomposed into the cliff, gold, monster, and treasure. By using decomposed reward rather than a single numerical value, the methods can, in turn, learn decomposed Q-values that are more meaningful than plain Q-values. These decomposed Q-values can be used to explain by pointing to the outcome and contrasting the consequences of various actions in a state. Although more meaningful, Anderson et al. (2019) demonstrated that different situations require different explanations. Moreover, they show that reward decomposition and saliency maps complement each other. Focusing on safety in human-robot collaboration, Iucci et al. (2021) introduce a new method that integrates the reward decomposition method with Hayes and Shah (2017)’s method. When both methods are used together, the stakeholder’s trust increases because of better explainability. Likewise, Rietz et al. (2022) propose a new XRL method by extending the reward decomposition method. According to the authors, the reward decomposition method lacks a high-level overview and context. To resolve this issue, they integrate hierarchical RL with the reward decomposition method. Feit et al. (2022) focus on explaining deep RL for self-adaptive systems by combining two existing methods: reward decomposition (Juozapaitis et al., 2019) and interestingness elements (Sequeira et al., 2019). They argue reward decomposition suffers from not providing states that are interesting for a stakeholder to understand. Furthermore, the interestingness elements method provides states that may interest a stakeholder but does not provide details beyond that. Hence, by combining these methods, both weaknesses are addressed. Terra et al. (2022) introduce a new method, both ends explanations for RL (BEERL). BEERL aims to explain both input features and output rewards. They reason that existing methods like saliency methods only explain input features, while reward decomposition (Juozapaitis et al., 2019) only considers rewards when explaining. To explain both components, they propose BEERL which utilizes both by correlating feature importance with output rewards, giving stakeholders more comprehensive explanations.

Building upon the idea of contrasting the outcome in reward decomposition, Lin et al. (2021) propose a technique where they first construct interpretable features and then use them to predict Q-values. They construct a two-part agent, which they coined the embedded self-prediction model. The first part predicts the expected discounted cumulative features. At the same time, the second part utilizes these aggregated features to predict Q-values. By contrasting the expected cumulative features of the actions, their method can generate contrastive and minimal sufficient explanations (Erwig et al., 2018; Juozapaitis et al., 2019). Instead of explaining the result of an action, Yau et al. (2020) explain the time it takes for the agent to get to an episode’s end. Yau et al. (2020) present an approach to estimate the expected discounted number of state visits from a state to explain the policy. Their approach is motivated by the fact that goal-oriented explanations are associated with 70% of daily life explanations. Additional information is saved during policy learning since this information cannot be extracted from the Q-function to create these goal-oriented explanations. Focusing on autonomous driving, Pan et al. (2019) introduce the semantic predictive control framework. Their method forecasts the evolution of features to explain to stakeholders the future outcome of actions.

Similar to the intended behavior category, methods in this category offer explanations explaining the future of a specific situation. In contrast, methods in this category offer more detailed explanations that justify actions in terms of future outcomes and not only short-term outcomes of actions. This is more helpful in situations where time is not a pressing matter and more detail is needed. To understand explanations presented by methods in this category, more domain knowledge is needed, since there is often reference to rewards and engineered input features.

7.1.6 Generative modeling

This category consists of approaches to understanding the agent using generative models. These generative models can, for example, be variational autoencoders and generative adversarial networks. By utilizing the latent encoding of these generative models, methods in this category can create why not, what if, and how to explanations.

Yang et al. (2019) propose a new architecture, the action conditioned (AC)-\(\beta\) variational autoencoder. First, the method disentangles the latent space into interpretable dimensions. Then, the policy uses the interpretable dimensions to make decisions and reconstructs them by conditioning on the actions. The goal is to understand how the interpretable dimensions affect the actions by moving in the latent space and reconstructing them using the decoder. Rupprecht et al. (2020) propose a generative model similar to the variational autoencoder. The new model comes with a new loss function and aims to generate counterfactual states to comprehend the RL agent. First, they modify the evidence lower bound so that the agent interprets both the inputs and reconstructions similarly. In addition, they extend the reconstruction loss to concentrate more on the crucial input areas. Finally, they introduce a new method to generate counterfactual states that can be interesting and useful.

Olson et al. (2019, 2021) present a method using deep generative models to make counterfactual states. The counterfactual states are made by moving in the latent space and used by the policy to make decisions. Doing so makes it possible to ask what if questions, in turn, understand the policy’s behavior in new states. The proposed architecture utilizes the adversarial autoencoder (Makhzani et al., 2015) and the Wasserstein autoencoder (Tolstikhin et al., 2018).

One of the greatest challenges with methods in this category is to generate realistic counterfactual states. When generating counterfactual states, it is important for them to be in-distribution, something that is actually plausible. The danger is that generated states are out-of-distribution, showing an agent’s unrealistic behavior. With unrealistic counterfactual states, a stakeholder might trust the results less. On the bright side, these methods offer a quicker way to understand an agent’s behavior in interesting states without running numerous simulations to find these interesting states.

7.2 Explanation via representation

This section outlines methods providing interpretability through communicating the agents’ representation like in Sect. 6, for instance, a decision tree agent where the decision tree itself is the explanation. Providing interpretability via representation also includes, for example, how we express the state space, since reducing it can also alleviate interpretability problems. Compared to Sect. 7.1, the methods in this section do not produce explanations by explicitly generating an object.

7.2.1 State abstraction

The state abstraction category details methods making the agents more interpretable by reducing the state space. The reduction typically happens by clustering states into abstract states. Reducing the state space reduces the number of situations we must consider when trying to understand the agent’s behavior. Thus, not directly addressing explainability, but still helps to make it easier to interpret.

For environments where simulation costs are high, Bougie and Ichise (2020) argue that Verma et al. (2018)’s method might not be appropriate (information about the method in Sect. 8.2.2). Hence, they propose an approach where they create rules and then use them to cluster states. They train the agent by leveraging these abstract states deduced from the rules. As a result, this makes it possible to modify several Q-values simultaneously, increasing sample efficiency and interpretability. Akrour et al. (2021) introduce an approach using a mixture of experts represented using fuzzy logic with interpretable experts. Their agent chooses actions by considering the current state with a list of abstract states that has to be small to enable interpretability. The states representing each abstract state are chosen from interaction data to guarantee interpretability.

Focusing on the time discretization problem in batch RL and the healthcare setting, Zhang et al. (2021a) propose to create abstract states by locating states they term decision points. Decision points are states deemed important and where patients are given a different treatment, although similar. They use batch data to determine these decision points, cluster them, and train the agent using the resulting abstract states.

With a reduced state space, it is easier for a stakeholder to understand an agent’s behavior. It might become easier to determine when certain actions are executed and what conditions trigger them. Also, since these reductions are used during the agent’s learning, they can speed up learning. However, even the reduced state space can be overwhelming and large in complex environments. Thus, methods from this category might work better if they are combined with methods from the important states and transitions category to select or highlight abstracted states for inspection.

7.2.2 Task decomposition

The task decomposition category contains methods that make the agent interpretable by breaking a task into smaller and more compact problems. Essentially, the methods in this category use a divide-and-conquer procedure to solve the XRL problem. A common RL approach to dividing a task into subtasks is through hierarchical RL. Hierarchical RL decomposes a task into a hierarchy of subtasks, where we can solve the parent task by solving the child tasks, using them as primitive actions (Hengst, 2010).

Motivated by lifelong learning, Shu et al. (2018) describe a new hierarchical RL framework. In this framework, the agent learns to act by recursively utilizing previously learned policies to train new policies to solve a problem. In addition, a stochastic temporal grammar model is used to keep a tab on the connections between the tasks. Finally, each task is labeled using human language to keep the framework interpretable. Likewise, focusing on lifelong learning, Wu et al. (2020) introduce the model primitive hierarchical RL (MPHRL) framework. In MPHRL, they assume that substandard models of the world exist. These substandard world models perform well in a specific area but suboptimal outside. Utilizing these world models, they do task decomposition and learn several sub-policies. After training, these sub-policies are used by a gating controller as a mixture of experts to act. Beyret et al. (2019) propose a new hierarchical RL method, named dot-to-dot, that focuses on solving robotic manipulation tasks. In this method, a high-level policy learns sub-goals and manages and assigns subtasks to sub-policies based on learned sub-goals. The sub-policies try to maximize the return for the sub-goals, while the high-level policy tries to maximize the overall return. These subtasks are smaller and potentially more manageable. Thus, when a stakeholder wants to understand the agent, it can inspect the high-level policy without getting bogged down by details in the sub-policies. Likewise, Ye and Yang (2021) propose a hierarchical RL approach named hierarchical policy learning with intrinsic-extrinsic modeling (HIEM) for object finding tasks. HIEM similarly employs high-level and sub-policies to solve tasks. Gangopadhyay et al. (2022) introduce the hierarchical program-triggered RL (HPRL) framework, which focuses on autonomous driving. Similarly to the previous approaches, HPRL utilizes a high-level policy and sub-policies. Specific to HPRL, the high-level policy is represented as a structured program that can be inspected and overrule sub-policies for safety.

Lyu et al. (2019) present the new framework, symbolic deep RL (SDRL). SDRL consists of a symbolic planner, meta controller, and controller. The symbolic planner does long-term planning, the controller learns policies to act, and the meta controller evaluates and bridges these components. Hasanbeig et al. (2021) describe DeepSynth that aims to solve problems with sparse reward and partial observability. DeepSynth learns a deterministic finite automaton (DFA) that keeps track of the tasks’ sequential dependencies. For each DFA state, there is a policy specializing in the task. The DFA can be inspected to gain insights into the decision-making process.

Breaking down an action into smaller actions will provide a stakeholder with a better understanding of how an agent behaves. However, they still miss shedding light on the smaller decomposed actions and only answer why questions for the high-level actions. This is especially difficult in cases where the smaller actions taken do not match human intuition. Moreover, there is the difficulty of how to decompose an action and how these methods will scale to more complex environments.

7.2.3 Reward function

The reward function category includes various methods that leverage the reward function to understand the agent. For example, explicitly representing the reward function in an interpretable format. Doing so helps stakeholders understand the agent’s goal by understanding the reward assignment. In addition, using the reward function, a stakeholder can model and align agent behavior with its preferred behavior.

Tabrez et al. (2019) study human-robot collaboration where an accurate mental model of the task is crucial since it leads to safer and smoother teamwork. They describe the reward augmentation and repair through explanation (RARE) framework that aims to assist a stakeholder. More specifically, it assumes that the stakeholder has an internal reward function that the agent can estimate. If the reward function is wrong, the agent explains it to the stakeholder to help correct the reward function. Thus, their mental models will align, which improves teamwork.

To specify an agent’s behavior, Li et al. (2019b) employ formal methods to define an interpretable reward function. Similarly, Bautista-Montesano et al. (2020) describe a new approach that uses fuzzy logic to define the reward function in the context of autonomous driving. To learn an interpretable reward function represented using a tree-based model, Bewley and Lécué (2022) describe an approach using preference-based RL. They create and refine tree-based reward functions using human preferences over behaviors. Compared to the previous methods, this approach advantageously offers several ways to express the reward function. Bica et al. (2021) aims to learn what if explanations of expert behavior from batch data in terms of an interpretable reward function. To achieve this, they leverage counterfactual reasoning and use batch RL since interactive learning is impossible in healthcare. The reward function is expressed as a linear function of the expected outcomes conditioned by history. A linear function is inherently interpretable and can be inspected to understand how experts from various organizations reason and value different outcomes when making decisions.

Reward functions can be difficult to specify (Abbeel & Ng, 2004), but having an interpretable reward function can increase the understanding of agent behavior. However, knowing the reward function and understanding how it works does not stop the agent from reward hacking and learning unwanted behavior. Hence, although having an interpretable reward function helps, it does not fully shed light on the behavior the agent will learn. There are exceptions to this where the reward function itself is closely tied to the learning process, for instance, Bewley and Lécué (2022). The main use case for methods in this category is when stakeholders want to modify the reward function and, in turn, change the agent’s behavior.

7.3 Explanation via inspection: exploratory analysis

The explanation via inspection category presents studies that propose new agent representations, making it easier to analyze agents. However, a stakeholder needs to inspect, analyze, and assess the explanations manually to extract insights from the agents. Compared to the methods we have already seen, the explanations produced by the methods in this category are more open-ended.

Using a modified NN architecture, Annasamy and Sycara (2019) describe a new approach using Q-learning and an autoencoder with key-value memory. The key-value memory can be analyzed using t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten & Hinton, 2008) to produce global explanations. However, t-SNE can be challenging to use, although often used in the context of explainability (Wattenberg et al., 2016). In addition to these global explanations, the method can produce local explanations using saliency maps.

Focusing on robot collision avoidance, Kuramoto et al. (2020) present a new NN architecture where the network’s hidden layers are easily visualizable to understand the agent. Tylkin et al. (2022) use neural circuit policies (NCPs) to represent agents in the flight domain. NCPs are used since they have few neurons making analyzing the agent easier, such as visualizing neuron activations and characterizing them using decision trees.

The open-endedness of these methods requires more human labor and is more suited in situations where there is sufficient time to explore the explanations. Nevertheless, they are very useful in cases where a more detailed analysis of an agent’s inner workings is needed. In conclusion, these methods are suitable for situations where the neural network architecture needs to be more interpretable.

8 Post hoc explainability

Fig. 9
figure 9

The post hoc explainability approach. To comprehend the agent’s decision-making process, we apply the explanation method to the agent or the model (or both) after training. In this context, the model refers to the transition and reward function

Post hoc explainability (PHE) consists of methods applied to uninterpretable agents or models (or both). The aim is to extract insight and produce explanations without changing the agent, as shown in Fig. 9. A method being post hoc is not the same as being model-agnostic. For example, some techniques in this category can only be applied to NNs. There are various reasons for using an uninterpretable agent. For example, a company has invested in an existing agent and is satisfied with its performance. However, they need to extract explanations to demonstrate that it is safe. Instead of starting anew, the company can use post hoc techniques. Moreover, situations requiring flexible function approximators might make it impossible to use interpretable agents. This section overviews the different categories of the post hoc explainability approach depicted in Fig. 10.

Table 4 describes which explanation types and RL explainability characteristics the different PHE categories can provide. Like the methods in Sect. 7, the methods in the PHE category create a diverse set of explanations. Using this table as a guide, stakeholders can select categories to satisfy their explainability needs and narrow them down to specific methods.

Fig. 10
figure 10

Post hoc explainability taxonomy. We separate the category based on how the explanation is conveyed: (1) via generation, (2) via representation, and (3) via inspection. The categories do not sum up because some studies span multiple categories

Table 4 High-level overview of the categories in the PHE based on their explanation types and RL explainability characteristics

8.1 Explanation via generation

Similar to Sect. 7.1, this category describes methods that generate an object as the explanation. The explanation can be visual, textual, or some other format. For instance, the explanation can be “I inspect a part when the stock feed is on and I detect a part” (Hayes & Shah, 2017) in a robotic inspection task. Unlike the methods in Sect. 7.1, we can apply methods in this category to pre-trained agents without modifying them.

8.1.1 Feature importance

The following section introduces post hoc feature importance methods. On the one hand, unlike the methods in Sect. 7.1.1, the post hoc ones are not built into the agent architecture, making these techniques more flexible. On the other hand, methods cannot positively affect the training of agents. Moreover, we need to make sure that these methods produce faithful explanations as the explanation generation process is separate from the agent’s decision-making process. This category of methods provides the same explanation as the methods presented in Sect. 7.1.1, namely the why explanation. Additionally, some techniques presented here can also give global explanations, like Shapley additive explanations (SHAP) (Lundberg & Lee, 2017). Thus, some methods provide the how explanation.

Zahavy et al. (2017) contribute with several techniques to better understand an agent. Similar to Simonyan and Zisserman (2015), they produce saliency maps using a gradient-based backpropagation approach. However, gradient-based approaches can produce low-quality saliency maps. Therefore, Greydanus et al. (2018) introduce a perturbation-based approach to produce saliency maps instead. They perturb the input with Gaussian blur and measure the impact on the output to determine the importance of the different input parts. Likewise, Iyer et al. (2018) propose a perturbation-based approach but focus on object-level importance instead. They do it by perturbing the objects with the background color found using the template matching method. Puri et al. (2020) notice that Greydanus et al. (2018) and Iyer et al. (2018) aggregate over all actions causing loss of details. They introduce the specific and relevant feature attribution (SAFRA) that focuses on satisfying the two properties specificity and relevance. Consequently, more concise and task-relevant saliency maps are produced that have been shown to help chess players.

The conservation property states that the importance scores at the input sum up to the output value (Bach et al., 2015). According to Huber et al. (2019), existing feature importance methods for RL lack satisfying this property. In response, they extend the layer-wise relevance propagation (LRP) (Bach et al., 2015) to DQN and propose a new argmax rule to produce more concise explanations. Also, they extend the work to dueling Q-network (Wang et al., 2016). In an extended work, Huber et al. (2021) combines this method and a global explanation method (see Sect. 8.1.4) to study the effect of combining explanations. Atrey et al. (2020) review saliency methods for RL and demonstrate that they do not necessarily explain causal relations. In the same work, they conclude that saliency maps should be treated as exploratory information instead of explanatory.

Shi et al. (2022) highlight that applying Mott et al. (2019)’s method to pre-trained agents is impossible since the architecture cannot be modified. Instead, Shi et al. (2022) introduce a self-supervised interpretable network to generate explanations for agents whose architecture can no longer be changed. The method focuses on satisfying two properties, maximum behavior resemblance and minimum region retaining, to generate saliency maps with improved quality. In a later study, Shi et al. (2021b) propose the temporal-spatial causal interpretation model that focuses on understanding long-term behavior and temporal relationships. The model relies on Granger causality, expressing that causes in the past affect future outcomes.

Besides saliency methods, several studies use SHAP to explain RL agents. SHAP uses a game theoretical approach to explain agents. Rizzo et al. (2019) use SHAP to explain a traffic signal control RL agent. Similarly, Jiang et al. (2022) apply SHAP to understand DRL driving agents. Wang et al. (2020) utilize SHAP in an automatic crane control task since perturbation-based saliency techniques are unsuitable for tabular data. Zhang et al. (2022) apply DeepSHAP to a DRL agent in a power system emergency control task. Liessner et al. (2021) introduce a new SHAP value representation for RL called the RL-SHAP diagram. They experimentally demonstrate the method on a longitudinal control task. Working on a lever manipulation task using a robotic manipulator, Remman and Lekkas (2021) use SHAP to explain RL agents. He et al. (2021) merge the class activation map (Zhou et al., 2016) and SHAP to create a new explanation method applied to a policy controlling an aerial vehicle. Besides the visual explanation, they complement it with textual information. Apart from these applications of Shapley values to RL, Beechey et al. (2023) present the first theoretical analysis of applying Shapley values to explain RL and show that previous uses are incorrect or incomplete.

Borrowing ideas from the supervised learning XAI literature, Weitkamp et al. (2018) and Joo and Kim (2019) apply the gradient-weighted class activation mapping (Selvaraju et al., 2017) to agents assigned to play Atari games. Similarly, Nie et al. (2019) use the gradient-weighted class activation mapping and deconvolutional network (Zeiler & Fergus, 2014) to interpret agents in a swarm robotic system individually. Also drawing from the supervised learning literature, Lim et al. (2021) apply the deep learning important features (Shrikumar et al., 2017) to comprehend an agent trained to control blood glucose. Focusing on the financial application, Guan and Liu (2021) employ integrated gradients (Sundararajan et al., 2017) to explain an agent for portfolio management. Also working on portfolio management, Shi et al. (2021a) use the class activation map to understand portfolio allocation. Kim and Choi (2021) employ several saliency methods, deep Taylor decomposition (Montavon et al., 2017), relative attribution propagation (Nam et al., 2020), and guided backpropagation (Springenberg et al., 2015) to understand a deep visuomotor policy for robotic manipulation. To accommodate negative inputs and outputs, they changed the relevance propagation approach.

Pan et al. (2020) present the explainable generative adversarial imitation learning (xGAIL) framework that produces both local and global explanations. xGAIL aims to explain agents trained using GAIL (Ho & Ermon, 2016). They produce local explanations utilizing a perturbation-based method. At the same time, global explanations are produced by finding observations that maximize the probability of interesting actions.

Methods in this category are closely tied to their counterparts in the IE category. One crucial difference between these two categories of methods is the fidelity. In the IE category, we need to worry less about fidelity since methods are built into the system and used during decision-making. For methods in this category, it is important to test their fidelity since the methods are independent from the decision-making. Moreover, a previous study has shown that feature importance methods producing saliency maps are not always faithful to the model they explain (Adebayo et al., 2018). The methods in this category are fitting for situations where stakeholders want answers to why questions using saliency maps but do not want to retrain the agent.

8.1.2 Agent behavior

This category contains methods to comprehend the agent by characterizing its behavior. For instance, understand what the agent will do in various situations.

Focusing on human-robot collaboration, Hayes and Shah (2017) introduce several methods to answer questions from stakeholders like “When do you do _?”, “What do you do when _?” and “Why didn’t you do _?” from the stakeholder, as previously mentioned.

By distilling interaction data, Acharya et al. (2020) describe a method to create a conceptual model of agent behavior. This conceptual model conveys the agent’s strategy, conditions for their execution, and consequences. Stork et al. (2020) apply various distance measures to compare agents’ behaviors. Furthermore, the distance measures are used to find important states and characterize the relationship between reward and behavior.

Many RL explanation methods do not exploit the full MDP formalism and focus on the underlying function approximator. In response, Finkelstein et al. (2021) utilize the full MDP formalism to explain the gap between the agent’s behavior and the behavior that the stakeholder anticipated. To explain the gap, they apply abstraction and transformation methods that have previously been used to speed up policy learning.

The methods in this category use the agent’s behavior to explain it. The mechanisms to generate these explanations vary greatly between the methods. From Hayes and Shah (2017) answering many types of questions about the agent’s behavior to more specific answers, like the gap in stakeholders’ mental models (Finkelstein et al., 2021). One downside with some methods in this category is their scalability; for example, Hayes and Shah (2017) require query templates that need manual intervention. Methods in this category are suited for cases where stakeholders want to understand the agent’s behavior both locally and globally.

8.1.3 Textual justification

The method in this category aims to extract textual explanations expressed in natural language, similar to methods in Sect. 7.1.3. For example, the explanation “Object ghost and dot have drawn attention of Pacman. The Pacman moves right to eat the dot in the lower right even she is approaching the ghost in the lower right” (Wang et al., 2019b) in the Ms. Pac-Man game. Ehsan et al. (2018) present a method to generate more human-like explanations that translate state-action pairs to natural language expressions. Their method first collects a dataset of state-action pairs and natural language explanations. Then, it uses supervised learning to learn to translate state-action pairs into explanations. According to the authors, this approach offers several advantages, such as being fast to generate explanations and easier to interpret. However, on the downside, the method focuses on explaining how a human would explain the situation and may not reflect the agent’s internal reasoning process.

Like its counterpart in the IE category, the method here offers explanations that are more human-like. Furthermore, it offers the flexibility for explanations to be semantically rich. On the downside, it can be more laborious to create these explanations that are only rationalizations.

8.1.4 Important states and transitions

The important states and transitions category contains techniques that explain the agent by showcasing important states and transitions. How the term important is defined depends on the particular method. The motivation for displaying important states and transitions is due to the simplicity of the resulting explanations. Inspecting the agent’s behavior in the whole state space is impossible; thus, a trade-off is to review how the agent behaves in a few critical situations. These methods aim for the stakeholders to develop an accurate mental model of the agent’s behavior by seeing examples of it in a few situations. Consequently, stakeholders will be able to anticipate how the agent will behave in seen and unseen situations, thus, gaining a global understanding. Besides imparting a global understanding, a few methods in this category try to explain critical situations in an episode after the fact.

Amir and Amir (2018) present an approach to generate a summary of the agent’s behavior by showing how it acts in important states. They select important states based on an importance measure that uses the agent’s Q-function. More specifically, the more significant the gap between the best and worst actions’ Q-values, the higher the importance. In addition, they describe a method to avoid selecting redundant states. In a later work, they provide the entire conceptual framework of explaining agents by using summaries (Amir et al., 2019). They detail the different components of the agent summarization approach, how to evaluate these methods, and position them within related work. The components consist of selecting states, state representation, and the interface to communicate with the stakeholders. In a similar line of work, Huang et al. (2018) and Watkins et al. (2021) present methods to select important states. They aim to help stakeholders to build an accurate mental model of the agent. Thus, being able to determine when it is appropriate to trust the agent. Similar to the other work, Karino et al. (2020) use the Q-function and its variance to select important states. However, in contrast to the others, they additionally explore how these important states can help speed up learning.

Instead of exploiting the agent’s output, Huang et al. (2019) use an algorithmic teaching approach to generate summaries. They assume humans do inverse RL and select the states based on their usefulness to learn the reward function. Rather than inverse RL, Lage et al. (2019a, 2019b) propose choosing states that are most helpful to imitating the agent through imitation learning. They experimentally explore the imitation learning and inverse RL approaches. Their results demonstrate the importance of using an appropriate method based on the situation, as no method fits all. Like the aforementioned approaches, the aim is for the stakeholder to develop an accurate mental of the agent. Sequeira and Gervasio (2020) describe a method that gathers interaction data and distills potentially interesting information from it, which they call interestingness elements (Sequeira et al., 2019). They use these interestingness elements to choose states and create agent summaries. Their results show that, on the one hand, more than one summarization approach is needed for a task to convey a complete understanding of agent behavior. On the other hand, too complex explanations can affect stakeholders negatively.

Huber et al. (2021) propose an explanation method that produces both local and global explanations. The method achieves that by integrating a saliency method with the agent summarization technique (Huber et al., 2019; Amir & Amir, 2018). Their results demonstrate that, although saliency maps provide useful information, in most situations, adding saliency maps as an addition to the agent summary did not significantly improve the understanding.

Previously mentioned agent summarization methods are less suited when comparing two agents. According to Amitai and Amir (2022), agents with different performances can act similarly in important states. To compare two agents, they describe a new agent summarization technique named DISAGREEMENTS. They find states where two agents disagree via simulation. In a later work, Gajcin et al. (2021) argue that the DISAGREEMENTS method only conveys the difference between the agents’ performances, but the agents may also differ in their preferred strategies. Therefore, they propose a new method that showcases the agents’ gaps in performance and strategy preferences.

Frost et al. (2022) and Watkins et al. (2021) argue that seeing an agent’s behavior from training time might be less helpful in the case of a distributional shift. They present a method to find states that instead convey test time behavior. The method achieves this by first defining a prior distribution of test time states. Then it uses an exploration policy to find states matching the prior distribution. Afterward, it runs the original policy from these states to construct the agent summary. Thus, it avoids initializing at out-of-reach states.

Unlike the other methods, Gottesman et al. (2020) propose a framework for off-policy evaluation using an influence function. An influence function is a technique from robust statistics and, in this context, used to determine the importance of transitions with respect to the policy parameters. The function answers what happens to the policy parameters if a transition is upweighted by an infinitesimal amount (Koh & Liang, 2017). The aim is to find important transitions in a batch of data using the influence function. Afterward, show these important transitions to an expert to validate the evaluation. Although not motivated by explainability, the influence function can be integrated into the agent summarization framework.

Besides providing a global understanding of the agent, some methods try to explain agent behavior in an episode. Sakai et al. (2021) determine the sub-goals in episodes and use them to construct the agent summaries that explain the agent’s behavior in episodes. Instead of using sub-goals, Guo et al. (2021b) find the important transitions that affect the agent’s return in episodes.

Like its counterpart in IE, this category offers global explanations. However, the way states and transitions are found is detached from the agent’s learning. Accordingly, methods here do not affect the agent’s performance. As noted by Lage et al. (2019a, 2019b), no single method will fit all situations. Thus, the methods here complement rather than outcompete each other. As we have seen throughout this section, the methods themselves have different use cases, from understanding a single agent to comparing two agents. The downside of these methods is that a simulator or access to a buffer of data points is needed to find these states and transitions.

8.1.5 Expected outcome

In this section, we look at methods that explain the outcome of the agent’s behavior. For example, the long-term consequences of what kind of states and rewards the agent will observe and receive when taking a specific action.

van der Waa et al. (2018) introduce a method that constructs a policy based on the stakeholder’s question and the agent’s policy, called the foil policy. The method explains the outcome of the agent’s actions alone but also in contrast with the foil policy. The explanation consists of outcomes in terms of actions that will be taken, states that will be encountered, and rewards that will be received. These are translated into human-understandable concepts, similar to Hayes and Shah (2017). To create explanations that can answer questions from the stakeholder through a mutually understandable vocabulary, Sreedharan et al. (2022) describe a method that leverages a locally approximated model. More specifically, the method explains by referring to the outcome of a specific action and contrasting it to the stakeholder’s suggestion or explaining that the suggested action cannot be executed. Differently from the other approaches, Davoodi and Komeili (2021) present a method that highlights features impacting the risk, which tells us something about the outcome. They define risk in terms of states where an episode ends before the expected time or leads to failures.

Cruz et al. (2019) introduce the memory-based explainable RL (MXRL) method. This method explains an action by referring to the probability and time needed to reach the goal. They use interaction data gathered to compute these values. To improve the efficiency of this method, Cruz et al. (2021) propose the learning-based and introspection-based methods that extend MXRL. These two approaches were later extended by Portugal et al. (2022) to accommodate continuous state spaces.

In contrast to similar methods in the IE category, the methods here use transition data to explain what they expect an agent to do. This requires that the agent can be simulated in the environment or that there is already data which can be used to analyze the agent. The methods here are more flexible in comparison to their counterpart in the IE category, as no modification to the agent is required. If a stakeholder wants to understand the long-term behavior from a state for a pretrained agent, methods from this category can be chosen.

8.2 Explanation via representation

Like Sects. 6 and 7.2, the methods in this category explain by referring to the representation rather than generating objects as explanations. The representation ranges from a Markov chain expressing the agent’s behavior in the state space to a simplified alternative representation of the agent. Unlike the other categories, the representation is extracted from the agent after training. Thus, the explanation is not necessarily the agent itself.

8.2.1 State abstraction

State abstraction methods cluster states by employing various similarity measures to reduce the state space complexity, which entails trading off between explanation fidelity and complexity. The reduction makes it possible to explain the agent’s behavior globally, providing an understanding of the overall agent-environment interaction dynamic. However, making a concise abstraction for large state spaces may be challenging. Nevertheless, getting a local explanation that explains short-term behavior is still more insightful than explaining a single state.

One of the first state abstraction methods for XRL was introduced by Zahavy et al. (2017). They present the semi aggregated MDP (SAMDP) that abstracts across states and actions. The SAMDP is an extension of the semi MDP and aggregated MDP and inherits both of their benefits. To overcome the need for human intervention in the SAMDP approach, Topin and Veloso (2019) present the abstract policy graph (APG) that builds a state space abstraction from interaction data. The APG represents the abstracted state space as a graph where the nodes are abstracted states and edges are actions denoting transitions between them. The authors present the APG Gen method in the same work to build APGs. McCalmon et al. (2022) point out that the graphs produced by previous state abstraction methods are not interpretable beyond their structure. Moreover, for example, with APG, the graph size can be unbearably large in some situations, such as with stochastic policies. To overcome these hurdles, they describe the comprehensible abstracted policy summaries method. They make abstracted states interpretable by labeling them in human-understandable language. Focusing on visualizing the state space and value function, Nakamura and Shibuya (2020) introduce RL mapper method that extends the mapper (Singh et al., 2007) method. RL mapper visualizes the state space and value function by utilizing topological data analysis.

To express a recurrent NN policy in terms of a Moore machine, Koul et al. (2019) introduce the quantized bottleneck network (QBN) insertion. A QBN is an autoencoder with discretized latent space and can be used to construct discretized input and memory of the policy represented as Moore machines. To reduce the size of Moore machines produced, they employ standard Moore machine minimization techniques to translate them into minimal equivalent Moore machines. However, Danesh et al. (2021) notice that these standard Moore machine minimization techniques cause the resulting Moore machines hard to interpret. This is due to the techniques not considering state semantics. To resolve this issue and effectively reduce these Moore machines, Danesh et al. (2021) describe reductions that do not negatively affect interpretability.

While the other methods expect a trained agent, Bewley et al. (2022) introduce a method applicable to agents under training. They construct the state abstraction using interaction data and an information-theoretic divergence measure and express the abstract state space as a Markov chain. As the agent learns, the Markov chain will change; thus, several Markov chains are constructed, each assigned to a time window. The method can also be used to compare several policies.

The state abstraction methods offer comprehensive global explanations by showing groups of states and transitions between them. The difficulty for these methods is to interpret what the abstract states represent. Showing examples of states in an abstract state might not convey enough nuanced information. It might not be apparent why it is natural for the policy to consider them as similar. Also, McCalmon et al. (2022)’s approach to creating textual summarizations can be labor-intensive. Another issue is choosing the right number of abstract states and what kind of heuristic can be used for that. Too few abstract states can hide information from stakeholders, while too many can overwhelm them.

8.2.2 Agent distillation

The agent distillation category is a collection of methods trying to explain the agent by simplifying its decision-making logic. Specifically, methods in this category do it by treating the agent as an expert and using imitation learning to learn a distilled agent that is easier to interpret. The goal of the distilled agent is to imitate the original agent as well as possible, that is, having a high fidelity. However, beyond having high fidelity, it is also essential to consider when the distilled agent imitates the expert well, since rarely visited states might be less critical.

We often consider decision trees to be interpretable since it is possible to follow the entire reasoning process for a decision. Furthermore, if the decision tree is small, it is even globally interpretable instead of being locally only. Many methods use decision trees to imitate, frequently with modifications to solve previous limitations and adapt to new use cases. We refer to them collectively as tree-based agent distillation methods. Liu et al. (2018) introduce the linear model u-tree (LMUT) that extends the u-tree method with linear models in the leaf nodes for increased flexibility. LMUT aims to approximate the Q-function of the agent. Coppens et al. (2019) describe the soft decision tree method (Frosst & Hinton, 2017) applied on Mario AI benchmark (Karakovskiy & Togelius, 2012). The expert we try to imitate often supplies both the action and Q-values. Bastani et al. (2018) utilize this fact and improve the dataset aggregation (DAgger) algorithm and propose Q-DAgger, which results in less complex distilled agents. The Q-DAgger method improves DAgger by prioritizing and sampling state-action pairs based on the Q-values. In addition, they propose the verifiability via iterative policy extraction (VIPER) method to extract tree-based agents leveraging Q-DAgger and show how these tree-based agents extracted can be verified. VIPER has been very influential for methods in this category and several other methods extend and improve upon it (Schmidt et al., 2021; Jayawardana et al., 2021; Zhu et al., 2021; Jhunjhunwala et al., 2020). Roth et al. (2021) propose an agent distillation method that extends VIPER (Bastani et al., 2018). They argue that previous tree-based methods are uninterpretable (Frosst & Hinton, 2017; Gupta et al., 2015) and that no domain-specific tree modifications have previously been proposed. Specifically, after extracting a decision tree policy from a DRL policy, the decision tree policy is improved by modifying it, such as adding or changing nodes. These modifications focus on finding and fixing unwanted behavior from the policy in navigation tasks, such as oscillating action selection. Besides VIPER and methods that extend it, numerous other studies utilize tree-based imitation agents to enable interpretability (Gjærum et al., 2021, 2021; Ghosh et al., 2021; Dhebar et al., 2022; Vasic et al., 2022; Dai et al., 2022b; Bewley & Lawry, 2021), with some focusing on first transforming the input to their interpretable counterpart (Bewley et al., 2020; Sieusahai & Guzdial, 2021; Liu et al., 2021).

Motivated by the fact that many agents are hard to understand and verify, Verma et al. (2018) propose the programmatically interpretable RL (PIRL) framework. With PIRL, learning agents represented as programs become possible, which uses an expert agent to guide the learning process. In a later work, Verma et al. (2019) describe the imitation-projected programmatic RL (PROPEL), a new method to learn program-based agents. PIRL and PROPEL are later extended by Larsen and Schmidt (2021) to accommodate a different program space. Finally, many of these methods are summarized in Bastani et al. (2020).

Rules, such as if-then statements, are another function representation used to express distilled agents. Nageshrao et al. (2019), for example, seek to obtain a distilled rule-based agent leveraging fuzzy logic trained using the evolving Takagi-Sugeno method (Angelov & Filev, 2004). Another approach proposed by Soares et al. (2021) first clusters states before distilling the agent, thus reducing the complexity of the resulting distilled agent. Skirzynski et al. (2021) seek to improve human decision-making by first distilling an agent into simple rules. These simple rules are converted into flowcharts that can assist humans in making better decisions. Honda and Hagiwara (2022) express states and actions using first-order logic and extract a distilled rule-based agent.

Driven by the human aspect of explainability, Madumal et al. (2020) describe the action influence model that builds on the structural causal model (Halpern & Pearl, 2005) by extending it with actions. They learn the actions’ causal effect during learning by constructing the graph’s structure beforehand. Using the graph to explain and create hypothetical scenarios, they generate why and contrastive explanations. Also focusing on the human aspect of explainability, Mitsopoulos et al. (2021) describe utilizing cognitive models to understand agent behavior.

Focusing on traffic signal control and explainability, Ault et al. (2020) describe the regulatable precedence function as a representation for the distilled agent. A regulatable precedence function is a function that is monotonic in the state variables. They introduce several modifications of the DQN approach to express and learn regulatable precedence function agents. Working on the same problem, Wollenstein-Betech et al. (2020) utilize knowledge compilation techniques to comprehend DRL agents. Zhang et al. (2020a) present an agent distillation technique built upon the evolutionary feature synthesis regression algorithm (Arnaldo et al., 2015).

Hüyük et al. (2021) focus on understanding expert decision-making behavior rather than the behavior of an agent. To that end, they describe the model-based Bayesian method for interpretable policy learning (INTERPOLE). INTERPOLE approximates decision dynamics and boundaries and aims to satisfy three characteristics: (1) inherently interpretable, (2) partial observability accommodation, and (3) completely offline operation, which are needed in the healthcare setting. Focusing on the same problem, Pace et al. (2022) introduce the policy extraction through decision trees (POETREE) framework that builds probabilistic tree policies with recurrent structure. POETREE is designed to handle partial observability and offline training for the same reason as INTERPOLE.

Xie et al. (2022) use adversarial inverse RL to distill the reward function where the discriminator is represented using the logistic regression model. The resulting reward function provides a global explanation due to its simple functional form. After training, the function can be analyzed to understand how the agent values different situations.

The agent distillation category offers comprehensive global explanations that explain the whole decision-making process. However, since they only distill the input–output relation of the original agent, they might not explain the true underlying decision-making process. Instead, they give a plausible explanation that disregards the original agent’s internal logic. Another issue is the complexity of the distilled agent. If the distilled agent is too complex, such as decision trees with large depth, they may not be as useful to a stakeholder. Thus, fidelity and accuracy often need to be traded in these models.

8.3 Explanation via inspection

The explanation via inspection category introduces methods applied to RL agents after training to extract understanding. For example, a new user interface dashboard that lets a stakeholder freely explore different scenarios to understand the agent or analyze the agent using various dimension reduction techniques. Like Sect. 7.3, the explanations extracted are open-ended and require human analysis to extract insight. However, in contrast to Sect. 7.3, the methods do not modify the agent or propose a custom architecture to extract explanations more easily.

8.3.1 Exploratory analysis

The exploratory analysis category contains various methods to extract knowledge about the agent’s behavior. The methods range from dimension reduction techniques to applying several existing XAI methods to extract insight.

Sequeira et al. (2019) propose a new approach to understanding the task and agent behavior by analyzing the interaction data and deriving various insights. For example, how often does the agent encounter different states, or how often does the agent execute the same action in a state. The interaction data is gathered from the agent’s past interaction with the environment. The method’s analysis is independent of the environment and the underlying RL algorithm. Similarly, Ullauri et al. (2022) use interaction data to understand the agent. Their method is also model agnostic, like the previously mentioned one.

Druce et al. (2019) presents two metrics that can be used to understand the agent’s generalization ability. Also, they present a method to understand how agents will behave in modified states via state intervention conditioned on the current state. The metrics and state intervention information are communicated in a new user interface described in the same study. Hilton et al. (2020) utilize XAI methods to understand an agent that they trained explicitly in the CoinRun (Cobbe et al., 2019) environment. These XAI methods include applying several feature importance methods and a dimension reduction technique. Similarly, utilizing dimension reduction, Agrawal and McComb (2022) seek to understand the exploration process of agents tasked with designing cyber-physical systems. They say that the information about the exploration process can be leveraged to choose algorithms for designing these systems.

Løver et al. (2021) apply several XAI methods to understand how a docking agent trained using DRL works. Specifically, they use (1) SHAP, (2) local interpretable model-agnostic explanations (LIME) (Ribeiro et al., 2016), and (3) LMT. Focusing on heating, ventilation, and air conditioning energy controller, Kotevska et al. (2020) introduce a comprehensive framework to understand these agents trained using DRL. The framework uses existing XAI methods to extract local (i.e., LIME) and global explanations (i.e., PDP (Friedman, 2001) and ICE (Goldstein et al., 2015)). Moreover, it gathers interaction data of the controller. These two sources of information are analyzed and visualized to understand the agent. Dai et al. (2022a) aim to comprehend how, in simulated robotics tasks, domain randomization affects DRL agents. To gain insight, they also apply XAI methods, test the agent in different environments, and use out-of-distribution generalization tests. Pankiewicz and Kowalczyk (2022) present to understand RL agents using a combination of techniques. These techniques include Integrated Gradients (Sundararajan et al., 2017), analysis of variance, hypothesis tests, and examining correlation on state-action data generated by the policy.

Russell and Santos (2019) aim to understand better the reward function an agent tries to optimize. They use LIME to create local explanations and decision trees to imitate the reward function to get global explanations. Likewise, using local explanations, specifically saliency maps, Michaud et al. (2020) use it to understand the reward function and how well it aligns with the stakeholder’s preferred behavior. Similarly, utilizing saliency maps, Guo et al. (2021a) seeks to understand the relationship between human and machine attention. To accomplish that, they ask two questions. First, “how similar are the visual representations learned by RL agents and humans when performing the same task?”. And second, “how do similarities and differences in these learned representations explain RL agents’ performance on these tasks?”.

This category is a collection of methods that can be used to explain RL agents. They are a mix of methods from the supervised XAI literature and offer a view of how methods together can support understanding RL agents. To aggregate insights from several methods, human intervention is needed. Furthermore, they do not explain short-term and long-term consequences, as they use methods that are mainly designed for supervised learning.

8.3.2 Visual analytics

Visual analytics systems provide interactive visualizations and analysis tools to better understand RL agents. They aim to help stakeholders better understand agents through insights into the agent’s behavior and its internal representation, including how they change during training. Various sources of information are gathered and processed to create these visualizations. For instance, how the return per episode evolves or which actions are executed in a state throughout the training. Several visual analytics systems have been created for unsupervised learning (e.g., GANViz (Wang et al., 2018)) and supervised learning (e.g., CNNVis (Liu et al., 2017), RNNVis (Ming et al., 2017), and LSTMVis (Strobelt et al., 2018)). This section overviews the visual analytics method designed explicitly for RL.

Wang et al. (2019a) describe DQNViz, the first visual analytics system designed for RL specifically. DQNViz aims to help developers understand, debug, and improve DQN models. Their system is designed and evaluated together with deep learning experts. It provides different views into a DQN model, such as how the training evolves (e.g., Q-value change throughout training) or how the DQN performs in a single episode (e.g., action distribution per episode). Likewise, Seng et al. (2021) introduce a visual analytics system made to understand DQN models but does differ by providing insights into other aspects not covered by DQNViz. Jaunet et al. (2020) point out that previously proposed visual analytics systems cannot interpret agents with memory that are designed for environments with partial observability. They, therefore, propose DRLViz, a new visual analytics system focusing on understanding agents with memory and analyzing the memory in detail, such as understanding its role. The system was created with the help of experts and evaluated in the ViZDoom environment (Kempka et al., 2016). Another visual analytics system focusing on a different aspect of the agent is DynamicsExplorer (He et al., 2020). DynamicsExplorer aims to understand how trained agents are affected by the distribution shift of the environment. They test DynamicsExplorer in the marble maze game, a robotics control task (van Baar et al., 2019).

In contrast to the previously mentioned systems, Wang et al. (2021b) introduce DRLIVE, which focuses on being applicable to all RNN-based models. Furthermore, it seeks to be applicable to multiple game settings rather than a few selected. Besides these aforementioned systems, there are other visual analytics systems that aim to help experts to better understand RL agents (Mishra et al., 2022; Cheng et al., 2022).

Visual analytics systems provide comprehensive tools for a stakeholder to analyze an agent. However, they are often tailored to specific agents such as the DQN or specific environments like Atari games. They are also more suited for users with in-depth domain knowledge and RL knowledge as explanations are more open-ended and technical. Thus, they need human analysis to draw insights. Nevertheless, visual analytics provide comprehensive explanations that clarify all parts of an agent. This is especially useful for debugging and verification before deployment.

9 Discussion

In this section, we give a high-level analysis of the trends within XRL and recommend some methods for practitioners to use to explain RL agents that have stood the test of time. We call these methods foundational, as they have inspired many works that come after via extensions and as baseline methods in many experiments. Finally, we look at some future directions that forthcoming XRL work should focus on.

9.1 Trends

Lately, XRL has focused on the PHE category, as seen in Table 5. Less work is done within the IE category, likely due to being more challenging. For example, adding a built-in explainability mechanism is undesirable if it negatively affects the agent’s performance. This issue can be avoided by using PHE methods. From the perspective of this literature review, the IA category is receiving the least attention. However, this is because only studies mainly driven by interpretability are included and not by, for instance, generalization and sample efficiency.

We observe that feature importance methods from both IE and PHE are most researched among XRL methods. Other trending XRL methods are agent distillation methods that aim to explain the agent via distilled models such as decision trees. Aside from those three categories of methods, the categories expected outcome in IE and important states and transitions in PHE are popular in XRL research. State abstraction methods and visual analytics systems in PHE are less popular than previously mentioned categories but still important. Although exploratory analysis in PHE looks popular, it is much more diverse and is not one of the trending categories within XRL. Overall, the works within XRL are diverse; the focus is spread across areas like feature importance, important states and transitions, agent distillation, expected outcome, state abstraction, and visual analytics.

Table 5 Overview of the number of studies published each year for each category

9.2 Recommendations

This section recommends methods that are suited for different stakeholder questions. There is no single method suitable for all stakeholders’ needs. Each method has its use cases, strengths, and weaknesses. For instance, feature importance methods are limited to explaining where an agent looks in the input space, but are easy to convey to stakeholders. They supply us with the ability to answer one specific stakeholder question, namely, “why did the agent do _?” regarding where the agent is looking. Methods that can answer stakeholder questions can be found via Tables 2, 3 and 4. Our focus here when recommending methods is based on whether a method has stood the test of time. The methods that have been extended a few times are more robust and proven to work in accordance with their experiments consistently. Also, one could argue they are considered more useful methods by researchers since more resources are used to study them. For example, this is apparent in Sect. 8.1.5 where most studies extend or take inspiration (or both) from Juozapaitis et al. (2019). Newly published studies are more likely to have stronger experimental results but have not been independently verified by other researchers.

Methods from important states and transitions, state abstraction, and agent distillation categories can be used to answer how questions. For example, HIGHLIGHTS (Amir & Amir, 2018) answers how questions and is a popular method many studies have extended. Stakeholders can use Amitai and Amir (2022)’s method if they want to compare two different agents. If stakeholders need more detailed how explanations, VIPER (Bastani et al., 2018) from the agent distillation category can be used and is used as a baseline for many studies. Likewise, there is the interpretable agent category that can provide detailed how explanations, such as Silva et al. (2020), Trivedi et al. (2021). Visual analytics systems like Wang et al. (2019a) offer comprehensive insights into an agent and can be utilized to answer how questions and many other questions. However, visual analytics systems should be reserved for situations where comprehensive explanations are needed, as they are more open-ended. Furthermore, when using visual analytics systems, stakeholders need expertise in RL as many technical terms are used in these explanations.

When it comes to human-robot collaboration that requires an agent to describe what it will do in real-time, methods from intended behavior are suitable. For example Fukuchi et al. (2017a, 2017b) provide explanations that are easy to digest but lack detail. For more comprehensive explanations, state abstraction methods like Topin and Veloso (2019), McCalmon et al. (2022) offer explanations that describe what the agent will do via a Markov chain. Another popular method answering what questions is Hayes and Shah (2017)’s method which others take inspiration from. Their method is more suitable when the stakeholder wants answers to questions such as, “when do you do _?” and “what do you do when _?” in natural language.

The why question is answered by many methods with varying details. The methods in the feature importance categories from both IE and PHE answer why questions. Methods range from those developed with RL in mind to others adapted from the supervised learning XAI literature. Puri et al. (2020) is specifically designed for RL and is a result of addressing weaknesses of many previous feature importance methods. If a stakeholder wants to understand what and where the agent is looking, Mott et al. (2019)’s method can be used. Juozapaitis et al. (2019)’s method can be used if stakeholders want explanations focusing on the reward instead of the input. Like many of the other questions, methods from the interpretable agent category can provide detailed why explanations. The same applies to the agent distillation category, for example, Verma et al. (2018), Bastani et al. (2018). Methods from the interpretable agent, agent distillation, and feature importance categories are unable to answer why concerning long-term consequences. Thus, if stakeholders want explanations containing sequential information, the reward decomposition method by Juozapaitis et al. (2019) or Yau et al. (2020)’s method that also focuses on the outcome should be utilized.

The why not questions can be answered by the expected outcome category by contrasting the outcome of actions. As an example, Juozapaitis et al. (2019)’s method answers why not questions. Other methods like Yau et al. (2020)’s method also answer why not questions but uses time steps instead of rewards. Depending on the functional form of the distilled agent, the agent distillation category can answer these questions. For example, using VIPER (Bastani et al., 2018), a stakeholder can traverse the decision tree to answer the why not question. Additionally, methods from interpretable agents can be used.

The what if and how to questions are closely connected. Generative modeling approaches like Olson et al. (2021), Rupprecht et al. (2020) are two ways to answer these counterfactual questions. Madumal et al. (2020)’s method is another approach to answering counterfactual questions via causality. There are also tree-based methods like Liu et al. (2018), Bastani et al. (2018) answering these questions. These questions can be answered by inspecting paths in the decision trees. Similarly, like most questions, methods from the interpretable agent category can be utilized.

9.3 Future directions

We have seen numerous XRL studies throughout this review. However, there are still problems that remain open. Here, we highlight essential and fruitful avenues for future studies. More specially, we highlight five directions: (1) state the intent, (2) more research on interpretable agents in the context of XRL, (3) focus on RL specific aspects, (4) satisfying explanation properties, and (5) better evaluation.

Intent

Numerous review studies briefly explain why we need explainability. However, future studies should also state what kind of explainability needs their method specifically aims to satisfy (e.g., debugging or extracting novel insight from the domain). Moreover, they should describe the intended stakeholders and what kind of stakeholder questions the method seeks to answer (e.g., “how can I get the agent to do _?”). Finally, many studies evaluate the methods in toy environments or games, but the methods might be suitable or intended for other tasks. We urge researchers to state the intent to make it easier for stakeholders to find suitable methods. Accordingly, we believe XRL studies will have a broader adoption by other stakeholders besides researchers.

Interpretable agent

As pointed out in the supervised learning literature (Burkart & Huber, 2021; Rudin, 2019), are black box agents required, or can we design inherently interpretable agents? In situations for already deployed agents, post hoc explainability is desirable. Nevertheless, interpretable agents are still crucial to XRL since they truly reflect an agent’s behavior rather than creating plausible explanations. Moreover, they have several advantages, such as being sample efficient and better at generalizing and are less researched in the context of XRL. Therefore, we recommend future research on XRL to focus more on interpretable agents. While much research from other related research areas can fit into this category, we believe more XRL specific research with evaluations targeting interpretability explicitly is needed.

RL specific aspects

Many reviewed RL explanation methods borrow ideas from supervised learning (e.g., saliency methods) and explain the actions using only the immediate context. However, RL differs from supervised learning, and fewer studies try explaining characteristics unique to sequential decision-making. Also, RL can be model-free and model-based. If we use model-based RL, how can we explain the model? For example, why does the model predict \(s^\prime\) as the future state and not some other state \({\hat{s}}\)? How can we create an explanation that coherently explains the agent and model simultaneously? Furthermore, in the case of partial observability where a policy depends on the history and not just the current state. How can we explain to the stakeholders when the agent is no longer just reactive but also has an internal state? Based on these open problems, we recommend future studies to focus on developing methods leveraging and explaining these unique characteristics so that we can fully grasp the reasoning process of these agents.

Explanation properties

The authors outline several explanation properties in Chapter 33 of Murphy et al. (2023). When proposing new XRL methods, more focus should be placed on covering various explanation properties because different situations require different properties. For instance, few reviewed studies have explicitly focused on explaining time-critical decisions. In time-critical decision-making, we need explanations that generate quickly, are easily understandable, and do not necessarily cover all the reasons for a decision. In contrast, we might want complete explanations in situations without time constraints. In short, we should examine which explanation properties existing methods fulfill and work towards covering others to accommodate different situations and use cases that various stakeholders have.

XRL evaluation

XAI and XRL are large ecosystems with many components. Apart from the explanation method and the agent, there are other elements like the need for explainability, stakeholders, and explanation properties. A single evaluation without specifying the method’s setup is dissatisfactory. Additionally, developing a holistic evaluation to cover all aspects is impossible. As pointed out by previous studies (Puiutta & Veith, 2020; Heuillet et al., 2021) and Section Appendix A, most studies evaluate methods using functionally-grounded evaluations. However, functionally-grounded evaluations do not consider the setup. We must carry out evaluations with respect to the task and the stakeholder rather than evaluating without specifications. Although costly, we recommend doing more human-grounded evaluations in future studies and, if possible, application-grounded evaluations.

Besides the aforementioned issues on evaluation, an equally important problem is the lack of standardized user studies. User studies without some standardization make it challenging to compare different studies. Thus, when researchers develop new methods, it is hard to know how the different methods compare and which to use. Consequently, we propose future studies to work toward more comparable standardized user studies.

10 Conclusions

We have systematically searched five electronic databases and reviewed 189 state-of-the-art XRL studies published within the last five years. Moreover, we have systematically obtained ten existing XRL literature reviews and compared them to this review by showing how it is systematic, more comprehensive, and updated. This review proposed a new taxonomy that reflects the XRL studies reviewed and divided methods into three main categories: (1) interpretable agent, (2) intrinsic explainability, and (3) post hoc explainability. Also, the taxonomy organizes the studies based on how the explanations are communicated to stakeholders: (1) via generation, (2) via representation, or (3) via inspection. Each included study in this literature review was outlined, extracted for details, and organized into the taxonomy. Additionally, we overviewed which stakeholder questions can be satisfied by the different taxonomy categories. For example, if a category of methods can answer questions like, “how does the agent work?” and “why did the agent do _?”. Afterward, we outline trends in XRL and make recommendations for XRL methods based on stakeholder questions. Finally, this review highlighted five future directions in this fast-growing field that tackle challenges hindering broader RL adoption. We intend to unify the XRL field with this review. Moreover, we hope this review can be a resource that helps stakeholders become acquainted with the state-of-the-art XRL and find suitable methods to answer their questions. Lastly, we seek to help researchers find research gaps with this review.