Skip to main content
Log in

Causal explanation for reinforcement learning: quantifying state and temporal importance

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Explainability plays an increasingly important role in machine learning. Because reinforcement learning (RL) involves interactions between states and actions over time, it’s more challenging to explain an RL policy than supervised learning. Furthermore, humans view the world through a causal lens and thus prefer causal explanations over associational ones. Therefore, in this paper, we develop a causal explanation mechanism that quantifies the causal importance of states on actions and such importance over time. We also demonstrate the advantages of our mechanism over state-of-the-art associational methods in terms of RL policy explanation through a series of simulation studies, including crop irrigation, Blackjack, collision avoidance, and lunar lander.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data used in simulations can be generated by the code in the supplementary file.

Code availability

The code is available in the supplementary file.

References

  1. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540

  2. Bryson AE (1975) Applied optimal control: optimization, estimation and control. CRC Press, Boca Raton

    Google Scholar 

  3. Byrne RM (2019) Counterfactuals in explainable artificial intelligence (xai): Evidence from human reasoning. In: IJCAI, pp 6276–6282

  4. Chattopadhyay A, Manupriya P, Sarkar A, Balasubramanian VN (2019) Neural network attributions: A causal perspective. In: International Conference on Machine Learning, PMLR, pp 981–990

  5. Datta A, Sen S, Zick Y (2016) Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In: 2016 IEEE symposium on security and privacy (SP), IEEE, pp 598–617

  6. Gawlikowski J, Tassi CRN, Ali M, Lee J, Humt M, Feng J, Kruspe A, Triebel R, Jung P, Roscher R, et al. (2021) A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342

  7. Glymour M, Pearl J, Jewell NP (2016) Causal inference in statistics: A primer. John Wiley & Sons, Hoboken

    MATH  Google Scholar 

  8. Greydanus S, Koul A, Dodge J, Fern A (2018) Visualizing and understanding atari agents. In: International Conference on Machine Learning, PMLR, pp 1792–1801

  9. Heuillet A, Couthouis F, Díaz-Rodríguez N (2021) Explainability in deep reinforcement learning. Knowledge-Based Systems 214:106685

    Article  Google Scholar 

  10. Hilton D (2007) Causal explanation: From social perception to knowledge-based causal attribution

  11. Hoyer P, Janzing D, Mooij JM, Peters J, Schölkopf B (2008) Nonlinear causal discovery with additive noise models. Advances in neural information processing systems 21:689–696

    MATH  Google Scholar 

  12. Iyer R, Li Y, Li H, Lewis M, Sundar R, Sycara K (2018) Transparency and explanation in deep reinforcement learning neural networks. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp 144–150

  13. Jaiswal A, AbdAlmageed W, Wu Y, Natarajan P (2018) Bidirectional conditional generative adversarial networks. In: Asian Conference on Computer Vision, Springer, pp 216–232

  14. Juozapaitis Z, Koul A, Fern A, Erwig M, Doshi-Velez F (2019) Explainable reinforcement learning via reward decomposition. In: IJCAI/ECAI Workshop on Explainable Artificial Intelligence

  15. Kalainathan D, Goudet O (2019) Causal discovery toolbox: Uncover causal relationships in python. arXiv preprint arXiv:1903.02278

  16. Lopez-Paz D, Nishihara R, Chintala S, Scholkopf B, Bottou L (2017) Discovering causal signals in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6979–6987

  17. Lundberg S, Lee SI (2017) A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874

  18. Madumal P, Miller T, Sonenberg L, Vetere F (2020) Explainable reinforcement learning through a causal lens. Proceedings of the AAAI Conference on Artificial Intelligence 34:2493–2500

    Article  Google Scholar 

  19. Miller T (2019) Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence 267:1–38

    Article  MathSciNet  MATH  Google Scholar 

  20. Mott A, Zoran D, Chrzanowski M, Wierstra D, Rezende DJ (2019) Towards interpretable reinforcement learning using attention augmented agents. arXiv preprint arXiv:1906.02500

  21. Olson ML, Khanna R, Neal L, Li F, Wong WK (2021) Counterfactual state explanations for reinforcement learning agents via generative deep learning. Artificial Intelligence 295:103455

    Article  MathSciNet  MATH  Google Scholar 

  22. Pearl J (2009) Causality. Causality: Models, Reasoning, and Inference, Cambridge University Press, Cambridge, https://books.google.com/books?id=f4nuexsNVZIC

  23. Peters J, Mooij JM, Janzing D, Schölkopf B (2014) Causal discovery with continuous additive noise models

  24. Puiutta E, Veith E (2020) Explainable reinforcement learning: A survey. In: International cross-domain conference for machine learning and knowledge extraction, Springer, pp 77–95

  25. Puri N, Verma S, Gupta P, Kayastha D, Deshmukh S, Krishnamurthy B, Singh S (2019) Explain your move: Understanding agent actions using specific and relevant feature attribution. arXiv preprint arXiv:1912.12191

  26. Ribeiro MT, Singh S, Guestrin C (2016) “why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1135–1144

  27. Schwab P, Karlen W (2019) Cxplain: Causal explanations for model interpretation under uncertainty. arXiv preprint arXiv:1910.12336

  28. Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A, Jordan M (2006) A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7(10)

  29. Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034

  30. Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: Visualising image classification models and saliency maps

  31. Spirtes P, Glymour CN, Scheines R, Heckerman D (2000) Causation, prediction, and search. MIT press, Cambridge

    MATH  Google Scholar 

  32. Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. In: International Conference on Machine Learning, PMLR, pp 3319–3328

  33. Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press, Cambridge

    MATH  Google Scholar 

  34. Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30

  35. Verma A, Murali V, Singh R, Kohli P, Chaudhuri S (2018) Programmatically interpretable reinforcement learning. In: International Conference on Machine Learning, PMLR, pp 5045–5054

  36. Wells L, Bednarz T (2021) Explainable ai and reinforcement learning–a systematic review of current approaches and trends. Frontiers in artificial intelligence 4:550030

    Article  Google Scholar 

  37. Williams J, Jones C, Kiniry J, Spanel DA (1989) The epic crop growth model. Transactions of the ASAE 32(2):497–0511

    Article  Google Scholar 

  38. Yang M, Liu F, Chen Z, Shen X, Hao J, Wang J (2021) Causalvae: Disentangled representation learning via neural structural causal models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9593–9602

  39. Zhang K, Zhu S, Kalander M, Ng I, Ye J, Chen Z, Pan L (2021) gcastle: A python toolbox for causal discovery. arXiv preprint arXiv:2111.15155

Download references

Funding

The work was partially supported by NSF through grants USDA-020-67021-32855, IIS-1838207, CNS 1901218, and OIA-2134901.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoxiao Wang.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Additional experiments and details

In this section, we provide additional details regarding the crop irrigation problem, the collision avoidance problem, and the Blackjack experiments. Furthermore, we describe our results on an additional testing environment, Lunar Lander.

Fig. 11
figure 11

The policy we use for the blackjack game. The blue line shows the decision boundary

Fig. 12
figure 12

The skeleton of the cascading SCM for a 5-step blackjack game

All experiments were conducted on a machine with 8 NVIDIA RTX A5000 GPU, an dual AMD EPYC 7662 CPU, and 256 GB RAM.

1.1 A.1 Crop irrigation

This section contains details of the crop irrigation experiment.

System dynamics

$$\begin{aligned} \text {Precipitation}= & {} U(0,1)\\ \text {SolarRadiation}= & {} U(0,1)\\ \text {Humidity}= & {} 0.3 \cdot \text {Humidity}_\text {prev} + 0.7 \cdot \text {Precipitation}\\ \text {CropWeight}= & {} \text {CropWeight}_\text {prev} \\{} & {} + 0.07 \cdot \big (1-(0.4 \cdot \text {Humidity}\\{} & {} + 0.6 \cdot \text {Irrigation}- \text {Radiation}^2)^2\big ) \\{} & {} + 0.03 \cdot U(0,1) \end{aligned}$$

The change in CropWeight at each step is determined by humidity, irrigation and radiation, and maximum growth is achieved when \(0.4 \cdot \text {Humidity} + 0.6 \cdot \text {Irrigation} = \text {Radiation}^2\). An additional exogenous variable is also included in the change of CropWeight. This can be regarded as some unobserved confounders that affect the growth that are not included in the system dynamics, such as CO\(_2\)Concentration or the temperature.

Policy

$$\begin{aligned} \text {Irrigation} = (\text {Radiation}^2 - 0.4 \cdot \text {Humidity}) \cdot (1.6 \cdot \text {CropWeight} + 0.2) / 0.6 \end{aligned}$$

The policy we used is a suboptimal policy that multiplies an additional coefficient \(1.6 \cdot \text {CropWeight} + 0.2\) on the optimal policy. This will cause the irrigation value to be less than optimal when CropWeight is less than 0.5, and more than optimal and vice versa.

Training

We use a neural network to learn the causal functions in the SCM. The network has three fully-connected layers, each with a hidden size of four. We use Adam with a learning rate of \(3 \times 10^{-5}\) as the optimizer. The training dataset consists of 1000 trajectories (10000 samples) and the network is trained for 50 epochs.

Perturbation

The perturbation value \(\delta \) used in the intervention is 0.1 w.r.t. the range of each value.

1.2 A.2 Blackjack

This section contains details and additional figures for the blackjack simulation.

System dynamics

This simulation is done in the blackjack environment in OpenAI Gym [1]. The goal is to draw cards such that the sum is close to 21 but never exceeds it. Jack, queen and king have a value of 10, and an ace can be either a 1 or an 11, and an ace is called “usable” when it can be used at an 11 without exceeding 21. We assume the deck is infinite, or equivalently each card is drawn with replacement.

In each game, the dealer starts with a shown card and a face-down card, while the player starts with two shown cards. The game ends if the player’s hand exceeds 21, at which the player loses, or if the player chooses to stick, the dealer will reveal the face-down card and draw cards until his sum is 17 or higher. The player wins if the player’s sum is closer to 21 or the dealer goes bust.

Policy

We trained the agent using on-policy Monte-Carlo control. Figure 11 shows the policy and the decision boundary.

SCM structure

We assume the blackjack game has a causal structure as shown in Fig. 7. Additionally, Fig. 12 shows the 5-step cascading SCM we used to test the temporal importance.

Training

We use a neural network to learn the causal functions in the SCM. The network has three fully-connected layers and each layer has a hidden size of four. We use Adam with a learning rate of \(3 \times 10^{-5}\) as the optimizer. The training dataset consists of 50000 trajectories (\(\sim \)76000 samples) and the network is trained for 50 epochs.

Perturbation

Since blackjack has a discrete state space, for numerical features “hand” and “dealer”, we use a perturbation value \(\delta = 1\). For the boolean feature “ace”, we flip its value as the perturbation.

1.3 A.3 Collision avoidance problem

We use the collision avoidance problem to further illustrate that our causal method can find a more meaningful importance vector than saliency map, i.e., which state feature is more impactful to decision-making.

System dynamics

The state \(\textbf{S}_t\) includes the distance from the start \(X_t\), the distance to the end \(D_t\), and the velocity \(V_t\) of the car, i.e., \(\textbf{S}_t := [V_t, X_t, D_t]\), where \(V_t \le v_{\max }\) and \(v_{\max }\) is the maximum speed of the car. The action \(A_t\) is the car’s acceleration, which is bounded \(\vert A_t \vert \le e_{\max }\). The state transition is defined as follows:

$$\begin{aligned} \begin{aligned} V_{t+1}&:= V_t + A_t\Delta t\\ X_{t+1}&:= X_t + V_t\Delta t + \frac{1}{2}A_t\Delta t^2\\ D_{t+1}&:= X_\text {goal} - X_{t+1} \end{aligned} \end{aligned}$$

The objective of the RL problem is to find a policy \(\pi \) to minimize the traveling time under the condition that the final velocity is zero at the endpoint (collision avoidance).

Policy

An RL agent learns the following optimal control policy also known as the bang-bang control (optimal under certain technical conditions) defined as Eq. (7)

SCM structure

We use Fig. 5b as the SCM skeleton and use linear regression to learn the structural equations as the entire dynamics are linear.

Perturbation

The perturbation value \(\delta \) used in the intervention is 0.1 after normalization.

1.4 A.4 Lunar lander

System dynamics

Lunar lander problem is a simulation testing environment developed by OpenAI Gym [1]. The goal is to control a rocket to land on the pad at the center of the surface while conserving fuel. The state space is an 8-dimensional vector containing the horizontal and vertical coordinates, the horizontal and vertical speed, the angle, the angular speed, and if the left/right leg has contacted or not.

The four possible actions are to fire one of its three engines: the main, the left, or the right engine, or to do nothing.

The landing pad location is always at (0, 0). The rocket always starts upright at the same height and position but has a random initial acceleration. The shape of the ground is also randomly generated, but the area around the landing pad is guaranteed to be flat.

Policy

We train our RL policy using DQN [34].

Fig. 13
figure 13

The causal structure of lunar lander that includes previous state and actions. There should also be edges from each feature to the action at its time step, e.g. edges from x_pos_prev to a_prev, or from x_pos to a. These edges are not shown in this graph for simplicity

SCM structure

We use the Fig. 13 as the skeleton of SCM. The structural functions are learned with linear regression using 100 trajectories (\(\sim \)25000 samples).

Fig. 14
figure 14

A lunar lander trajectory instance we used to evaluate our algorithm and the corresponding causal importance vector. The “freefall phase” is roughly between steps 0-70, “adjusting phase” is between steps 70-170, and “touchdown phase” is from about step 170 to the end

Evaluation

Figure 14 shows a trajectory of the agent interacting with the lunar lander environment and the corresponding causal importance using our mechanism. We notice that our mechanism discovers three importance peaks, and we explain this as the agent’s decision-making during the landing process consisting of three phases: a “free fall phase”, in which the agent mainly falls straight and slightly adjusts its angle to negate the initial momentum; an “adjusting phase”, in which the agent mostly fires the main engine to reduce the Y-velocity; and a “touchdown phase”, during which the lander is touching the ground and the agent is performing final adjustments to stabilize its angle and speed. Figures 15a, 15b and 15c show our causal importance vector during each of the three phases. We notice that during the “free fall phase”, features such as angle, angular velocity and x-velocity are more important since the agent needs to rotate to negate the initial x-velocity. However, as the rocket approaches the ground during the “adjusting phase”, we find an increase in importance for y-velocity since a high vertical velocity is more dangerous to control when the rocket is closer to the ground. In the last “touchdown phase”, a large x-position and x-velocity importance can be observed as a change in those features is highly likely to cause the lander to fail to land inside the designated landing zone. Since the lander is already touching the ground, it will take much more effort for the agent to adjust compared to when the lander is still high in the air.

Fig. 15
figure 15

The importance vector on lunar lander calculated using our method and a comparison with the saliency map method. The solid bars in the first three figures representing the importance of the current-step features and the shaded bars are for the previous-step features

The results are similar to those of saliency-based algorithms [8], and Fig. 16 shows the difference in importance vector between our algorithm and saliency-based algorithm. Note that differences only occur for the positions and the angle. This is because other features don’t have any additional causal paths to the action besides the direct connection. Therefore, the intervention operation is equivalent to the conditioning operation for these features. The features position and angle have an additional causal path through the legs, which causes the difference. Notably, our method captures higher importance for angle, which we interpret as that the landing angle is crucial and is actively managed by the agent.

Fig. 16
figure 16

Difference between our method and the saliency map method for current-step features

We are also able to compute the importance of the features in the previous steps, and Fig. 14c and the shaded bars in Fig. 15 represent such importance vectors. The previous-step importances are rather similar to those of the current-step features since the size of the time step is comparatively small. However, our algorithm captures that during the “adjusting phase”, the previous-step importance for the angle is in general higher than the current-step importance, as changing the previous angle may have a cascading effect on the trajectory and is especially important to the agent when it is actively adjusting the angle.

Appendix B: Sensitivity analysis

This section performs a sensitivity analysis on how the perturbation amount affects the result of our explanation.

For action-based importance, too small of a perturbation may not yield a meaningful result. This is due to the fact that, depending on the environment and the policy, a too small perturbation may fail to trigger a noticeable change in the action, resulting in a zero importance. This differs from the zero importance case where the policy disregards the feature when making decisions. In our experiments, we use 0.01 with respect to the range of the features for continuous features and the smallest unit for discrete features.

In general, using different perturbation amounts \(\delta \) on the same state in the same SCM may result in different importance vectors, and vectors calculated using different \(\delta \) cannot be meaningfully compared. However, if we desire the importance of using different \(\delta \) to be more on the same level, we suggest finding the highest importance across all features and all time steps and normalizing all results by said number. Section 2 contains an example comparing the importance score with and without the aforementioned normalization.

Fig. 17
figure 17

The importance vector of \(\textbf{S}^{(1)}\) from both our method and the saliency map method with respect to the perturbation amount

Fig. 18
figure 18

Sensitivity analysis on the collision avoidance problem

Fig. 19
figure 19

Sensitivity analysis on the lunar lander environment

1.1 B1. One-step MDP

As we demonstrated in the example of one-step MDP in Fig. 3 and Table 1, our importance vector will sometimes be affected by the perturbation amount. For this experiment, we use Fig. 3 as the skeleton and the following settings. The constants are

$$\begin{aligned} c_1 = 1, c_2 = -2, c_3 = 3, c_{12} = 2, c_p = -1 \end{aligned}$$

We use unit Gaussian distributions as the exogenous variables and the values are

$$\begin{aligned} u_1 = 0.50, u_2 = -0.14, u_3 = 0.65, u_p = 1.52, u_a = -0.23 \end{aligned}$$

The state value and the corresponding action are then

$$\begin{aligned} \textbf{s}^{(1)} = 0.50, \textbf{s}^{(2)} = 0.86, \textbf{s}^{(3)} = -0.88, v_p = 1.52, a = 3.83 \end{aligned}$$

The result of running our method and the saliency map method on the feature \(\textbf{S}^{(1)}\) is shown in Fig. 17. Same as in Table 1. Our algorithm is linear w.r.t. \(\delta \) while the saliency map result is constant. The increased importance comes from the causal link \(\textbf{S}^{(1)}\!\rightarrow \!\textbf{S}^{(2)}\!\rightarrow \! A\), which also introduces the linear relationship.

Fig. 20
figure 20

Sensitivity analysis on the Blackjack environment

1.2 B.2 Collision avoidance

Figure 18 shows the importance vector of \(X_t\) in the collision avoidance problem and different color lines correspond to different perturbation amounts. Note that similar to the result shown in Fig. 6b, the importance of \(D_t\) is the same as \(X_t\), and \(X_{t-1}\) is the same but off by one time step. Other features have negligible importance.

There are two effects of using different perturbation amounts: 1) The number of steps with non-zero importance is increasing as \(\delta \) increases since a larger \(\delta \) will cause states further away from the decision boundary to cross the boundary after the perturbation; 2) The value of peak importance is lower. Since we use the action-based importance and the action is essentially binary, the difference in importance solely comes from the normalization we applied on \(\delta \) (the denominator in Eq. (3). If this is undesirable, one way to combat this is to normalize the result using the highest importance across all features and time steps. The normalized result is shown in Fig. 18b, in which the peak value will be one regardless of \(\delta \).

Fig. 21
figure 21

The skeleton of SCM of the one step MDP

1.3 B.3 Lunar lander

Figure 19 shows the sensitivity analysis on lunar lander and the different color lines correspond to different perturbation amounts. Binary features including left and right leg are not included. The general trend of the result is the same while the value and the exact shape of the curve vary slightly when different \(\delta \) is used and our result is robust w.r.t. \(\delta \).

1.4 B.4 Blackjack

Figure 20 shows the sensitivity analysis for blackjack, with different color lines representing different perturbation amounts. The binary feature ace is not included. In blackjack, since the smallest legal perturbation amount is one and the range of the value is at most 21, increasing \(\delta \) has a much larger effect on the result. However, we can observe that the general shape of the curves is similar, indicating the robustness of our method.

Appendix C: Action-based importance versus Q-value-based importance

This section discusses the comparison between the action-based importance method and the Q-value-based importance method. It demonstrates that the Q-value-based method sometimes fails to reflect the features in the state that the policy relies on.

Consider a one-step MDP with the SCM shown in Fig. 21, where the state \(\textbf{S}=[S_1, S_2]\), \(S_i \in [-1,1]\), \(i=1,2\), and the action \(a \in [-1,1]\). The reward is defined as \(R(\textbf{S},a)= 100\times S_2+ a \times S_1\). Under this setting, the optimal policy is:

$$\begin{aligned} A = {\left\{ \begin{array}{ll} -1 &{} S_1 < 0\\ 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

Intuitively, the policy selects the minimum value in the action space when \(S_1\) is negative , and the maximum value otherwise.

The action-based importance method correctly identifies \(S_1\) as more important, as the policy only depends on \(S_1\). However, the Q-value-based method produces a different result. In a one-step MDP, the Q-function is the same as the reward function. As the coefficient in the Q(reward) function is larger for \(S_2\), the Q-value-based method finds \(S_2\) more important, which is different from the features that the policy relies on.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Meng, F., Liu, X. et al. Causal explanation for reinforcement learning: quantifying state and temporal importance. Appl Intell 53, 22546–22564 (2023). https://doi.org/10.1007/s10489-023-04649-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04649-7

Keywords

Navigation