Causal explanation for reinforcement learning: quantifying state and temporal importance

Wang, Xiaoxiao; Meng, Fanyu; Liu, Xin; Kong, Zhaodan; Chen, Xin

doi:10.1007/s10489-023-04649-7

Causal explanation for reinforcement learning: quantifying state and temporal importance

Published: 30 June 2023

Volume 53, pages 22546–22564, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xiaoxiao Wang ORCID: orcid.org/0000-0002-0418-4877¹,
Fanyu Meng¹,
Xin Liu¹,
Zhaodan Kong¹ &
…
Xin Chen²

411 Accesses
2 Citations
Explore all metrics

Abstract

Explainability plays an increasingly important role in machine learning. Because reinforcement learning (RL) involves interactions between states and actions over time, it’s more challenging to explain an RL policy than supervised learning. Furthermore, humans view the world through a causal lens and thus prefer causal explanations over associational ones. Therefore, in this paper, we develop a causal explanation mechanism that quantifies the causal importance of states on actions and such importance over time. We also demonstrate the advantages of our mechanism over state-of-the-art associational methods in terms of RL policy explanation through a series of simulation studies, including crop irrigation, Blackjack, collision avoidance, and lunar lander.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Contrastive Visual Explanations for Reinforcement Learning via Counterfactual Rewards

Causal Based Action Selection Policy for Reinforcement Learning

Synchronisms Using Reinforcement Learning as an Heuristic

Data availability

The data used in simulations can be generated by the code in the supplementary file.

Code availability

The code is available in the supplementary file.

References

Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540
Bryson AE (1975) Applied optimal control: optimization, estimation and control. CRC Press, Boca Raton
Google Scholar
Byrne RM (2019) Counterfactuals in explainable artificial intelligence (xai): Evidence from human reasoning. In: IJCAI, pp 6276–6282
Chattopadhyay A, Manupriya P, Sarkar A, Balasubramanian VN (2019) Neural network attributions: A causal perspective. In: International Conference on Machine Learning, PMLR, pp 981–990
Datta A, Sen S, Zick Y (2016) Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In: 2016 IEEE symposium on security and privacy (SP), IEEE, pp 598–617
Gawlikowski J, Tassi CRN, Ali M, Lee J, Humt M, Feng J, Kruspe A, Triebel R, Jung P, Roscher R, et al. (2021) A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342
Glymour M, Pearl J, Jewell NP (2016) Causal inference in statistics: A primer. John Wiley & Sons, Hoboken
MATH Google Scholar
Greydanus S, Koul A, Dodge J, Fern A (2018) Visualizing and understanding atari agents. In: International Conference on Machine Learning, PMLR, pp 1792–1801
Heuillet A, Couthouis F, Díaz-Rodríguez N (2021) Explainability in deep reinforcement learning. Knowledge-Based Systems 214:106685
Article Google Scholar
Hilton D (2007) Causal explanation: From social perception to knowledge-based causal attribution
Hoyer P, Janzing D, Mooij JM, Peters J, Schölkopf B (2008) Nonlinear causal discovery with additive noise models. Advances in neural information processing systems 21:689–696
MATH Google Scholar
Iyer R, Li Y, Li H, Lewis M, Sundar R, Sycara K (2018) Transparency and explanation in deep reinforcement learning neural networks. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp 144–150
Jaiswal A, AbdAlmageed W, Wu Y, Natarajan P (2018) Bidirectional conditional generative adversarial networks. In: Asian Conference on Computer Vision, Springer, pp 216–232
Juozapaitis Z, Koul A, Fern A, Erwig M, Doshi-Velez F (2019) Explainable reinforcement learning via reward decomposition. In: IJCAI/ECAI Workshop on Explainable Artificial Intelligence
Kalainathan D, Goudet O (2019) Causal discovery toolbox: Uncover causal relationships in python. arXiv preprint arXiv:1903.02278
Lopez-Paz D, Nishihara R, Chintala S, Scholkopf B, Bottou L (2017) Discovering causal signals in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6979–6987
Lundberg S, Lee SI (2017) A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874
Madumal P, Miller T, Sonenberg L, Vetere F (2020) Explainable reinforcement learning through a causal lens. Proceedings of the AAAI Conference on Artificial Intelligence 34:2493–2500
Article Google Scholar
Miller T (2019) Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence 267:1–38
Article MathSciNet MATH Google Scholar
Mott A, Zoran D, Chrzanowski M, Wierstra D, Rezende DJ (2019) Towards interpretable reinforcement learning using attention augmented agents. arXiv preprint arXiv:1906.02500
Olson ML, Khanna R, Neal L, Li F, Wong WK (2021) Counterfactual state explanations for reinforcement learning agents via generative deep learning. Artificial Intelligence 295:103455
Article MathSciNet MATH Google Scholar
Pearl J (2009) Causality. Causality: Models, Reasoning, and Inference, Cambridge University Press, Cambridge, https://books.google.com/books?id=f4nuexsNVZIC
Peters J, Mooij JM, Janzing D, Schölkopf B (2014) Causal discovery with continuous additive noise models
Puiutta E, Veith E (2020) Explainable reinforcement learning: A survey. In: International cross-domain conference for machine learning and knowledge extraction, Springer, pp 77–95
Puri N, Verma S, Gupta P, Kayastha D, Deshmukh S, Krishnamurthy B, Singh S (2019) Explain your move: Understanding agent actions using specific and relevant feature attribution. arXiv preprint arXiv:1912.12191
Ribeiro MT, Singh S, Guestrin C (2016) “why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1135–1144
Schwab P, Karlen W (2019) Cxplain: Causal explanations for model interpretation under uncertainty. arXiv preprint arXiv:1910.12336
Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A, Jordan M (2006) A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7(10)
Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034
Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: Visualising image classification models and saliency maps
Spirtes P, Glymour CN, Scheines R, Heckerman D (2000) Causation, prediction, and search. MIT press, Cambridge
MATH Google Scholar
Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. In: International Conference on Machine Learning, PMLR, pp 3319–3328
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press, Cambridge
MATH Google Scholar
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30
Verma A, Murali V, Singh R, Kohli P, Chaudhuri S (2018) Programmatically interpretable reinforcement learning. In: International Conference on Machine Learning, PMLR, pp 5045–5054
Wells L, Bednarz T (2021) Explainable ai and reinforcement learning–a systematic review of current approaches and trends. Frontiers in artificial intelligence 4:550030
Article Google Scholar
Williams J, Jones C, Kiniry J, Spanel DA (1989) The epic crop growth model. Transactions of the ASAE 32(2):497–0511
Article Google Scholar
Yang M, Liu F, Chen Z, Shen X, Hao J, Wang J (2021) Causalvae: Disentangled representation learning via neural structural causal models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9593–9602
Zhang K, Zhu S, Kalander M, Ng I, Ye J, Chen Z, Pan L (2021) gcastle: A python toolbox for causal discovery. arXiv preprint arXiv:2111.15155

Download references

Funding

The work was partially supported by NSF through grants USDA-020-67021-32855, IIS-1838207, CNS 1901218, and OIA-2134901.

Author information

Authors and Affiliations

University of California, Davis, 1 Shields Ave, Davis, 95616, CA, USA
Xiaoxiao Wang, Fanyu Meng, Xin Liu & Zhaodan Kong
Georgia Institute of Technology, 755 Ferst Drive, Atlanta, 30332, GA, USA
Xin Chen

Authors

Xiaoxiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fanyu Meng
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhaodan Kong
View author publications
You can also search for this author in PubMed Google Scholar
Xin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoxiao Wang.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Additional experiments and details

In this section, we provide additional details regarding the crop irrigation problem, the collision avoidance problem, and the Blackjack experiments. Furthermore, we describe our results on an additional testing environment, Lunar Lander.

All experiments were conducted on a machine with 8 NVIDIA RTX A5000 GPU, an dual AMD EPYC 7662 CPU, and 256 GB RAM.

1.1 A.1 Crop irrigation

This section contains details of the crop irrigation experiment.

System dynamics

$$\begin{aligned} \text {Precipitation}= & {} U(0,1)\\ \text {SolarRadiation}= & {} U(0,1)\\ \text {Humidity}= & {} 0.3 \cdot \text {Humidity}_\text {prev} + 0.7 \cdot \text {Precipitation}\\ \text {CropWeight}= & {} \text {CropWeight}_\text {prev} \\{} & {} + 0.07 \cdot \big (1-(0.4 \cdot \text {Humidity}\\{} & {} + 0.6 \cdot \text {Irrigation}- \text {Radiation}^2)^2\big ) \\{} & {} + 0.03 \cdot U(0,1) \end{aligned}$$

The change in CropWeight at each step is determined by humidity, irrigation and radiation, and maximum growth is achieved when $0.4 \cdot \text {Humidity} + 0.6 \cdot \text {Irrigation} = \text {Radiation}^2$. An additional exogenous variable is also included in the change of CropWeight. This can be regarded as some unobserved confounders that affect the growth that are not included in the system dynamics, such as CO$_2$Concentration or the temperature.

Policy

$$\begin{aligned} \text {Irrigation} = (\text {Radiation}^2 - 0.4 \cdot \text {Humidity}) \cdot (1.6 \cdot \text {CropWeight} + 0.2) / 0.6 \end{aligned}$$

The policy we used is a suboptimal policy that multiplies an additional coefficient $1.6 \cdot \text {CropWeight} + 0.2$ on the optimal policy. This will cause the irrigation value to be less than optimal when CropWeight is less than 0.5, and more than optimal and vice versa.

Training

We use a neural network to learn the causal functions in the SCM. The network has three fully-connected layers, each with a hidden size of four. We use Adam with a learning rate of $3 \times 10^{-5}$ as the optimizer. The training dataset consists of 1000 trajectories (10000 samples) and the network is trained for 50 epochs.

Perturbation

The perturbation value $\delta $ used in the intervention is 0.1 w.r.t. the range of each value.

1.2 A.2 Blackjack

This section contains details and additional figures for the blackjack simulation.

System dynamics

This simulation is done in the blackjack environment in OpenAI Gym [1]. The goal is to draw cards such that the sum is close to 21 but never exceeds it. Jack, queen and king have a value of 10, and an ace can be either a 1 or an 11, and an ace is called “usable” when it can be used at an 11 without exceeding 21. We assume the deck is infinite, or equivalently each card is drawn with replacement.

In each game, the dealer starts with a shown card and a face-down card, while the player starts with two shown cards. The game ends if the player’s hand exceeds 21, at which the player loses, or if the player chooses to stick, the dealer will reveal the face-down card and draw cards until his sum is 17 or higher. The player wins if the player’s sum is closer to 21 or the dealer goes bust.

Policy

We trained the agent using on-policy Monte-Carlo control. Figure 11 shows the policy and the decision boundary.

SCM structure

We assume the blackjack game has a causal structure as shown in Fig. 7. Additionally, Fig. 12 shows the 5-step cascading SCM we used to test the temporal importance.

Training

We use a neural network to learn the causal functions in the SCM. The network has three fully-connected layers and each layer has a hidden size of four. We use Adam with a learning rate of $3 \times 10^{-5}$ as the optimizer. The training dataset consists of 50000 trajectories ($\sim $76000 samples) and the network is trained for 50 epochs.

Perturbation

Since blackjack has a discrete state space, for numerical features “hand” and “dealer”, we use a perturbation value $\delta = 1$. For the boolean feature “ace”, we flip its value as the perturbation.

1.3 A.3 Collision avoidance problem

We use the collision avoidance problem to further illustrate that our causal method can find a more meaningful importance vector than saliency map, i.e., which state feature is more impactful to decision-making.

System dynamics

The state $\textbf{S}_t$ includes the distance from the start $X_t$, the distance to the end $D_t$, and the velocity $V_t$ of the car, i.e., $\textbf{S}_t := [V_t, X_t, D_t]$, where $V_t \le v_{\max }$ and $v_{\max }$ is the maximum speed of the car. The action $A_t$ is the car’s acceleration, which is bounded $\vert A_t \vert \le e_{\max }$. The state transition is defined as follows:

$$\begin{aligned} \begin{aligned} V_{t+1}&:= V_t + A_t\Delta t\\ X_{t+1}&:= X_t + V_t\Delta t + \frac{1}{2}A_t\Delta t^2\\ D_{t+1}&:= X_\text {goal} - X_{t+1} \end{aligned} \end{aligned}$$

The objective of the RL problem is to find a policy $\pi $ to minimize the traveling time under the condition that the final velocity is zero at the endpoint (collision avoidance).

Policy

An RL agent learns the following optimal control policy also known as the bang-bang control (optimal under certain technical conditions) defined as Eq. (7)

SCM structure

We use Fig. 5b as the SCM skeleton and use linear regression to learn the structural equations as the entire dynamics are linear.

Perturbation

The perturbation value $\delta $ used in the intervention is 0.1 after normalization.

1.4 A.4 Lunar lander

System dynamics

Lunar lander problem is a simulation testing environment developed by OpenAI Gym [1]. The goal is to control a rocket to land on the pad at the center of the surface while conserving fuel. The state space is an 8-dimensional vector containing the horizontal and vertical coordinates, the horizontal and vertical speed, the angle, the angular speed, and if the left/right leg has contacted or not.

The four possible actions are to fire one of its three engines: the main, the left, or the right engine, or to do nothing.

The landing pad location is always at (0, 0). The rocket always starts upright at the same height and position but has a random initial acceleration. The shape of the ground is also randomly generated, but the area around the landing pad is guaranteed to be flat.

Policy

We train our RL policy using DQN [34].

SCM structure

We use the Fig. 13 as the skeleton of SCM. The structural functions are learned with linear regression using 100 trajectories ($\sim $25000 samples).

Evaluation

Figure 14 shows a trajectory of the agent interacting with the lunar lander environment and the corresponding causal importance using our mechanism. We notice that our mechanism discovers three importance peaks, and we explain this as the agent’s decision-making during the landing process consisting of three phases: a “free fall phase”, in which the agent mainly falls straight and slightly adjusts its angle to negate the initial momentum; an “adjusting phase”, in which the agent mostly fires the main engine to reduce the Y-velocity; and a “touchdown phase”, during which the lander is touching the ground and the agent is performing final adjustments to stabilize its angle and speed. Figures 15a, 15b and 15c show our causal importance vector during each of the three phases. We notice that during the “free fall phase”, features such as angle, angular velocity and x-velocity are more important since the agent needs to rotate to negate the initial x-velocity. However, as the rocket approaches the ground during the “adjusting phase”, we find an increase in importance for y-velocity since a high vertical velocity is more dangerous to control when the rocket is closer to the ground. In the last “touchdown phase”, a large x-position and x-velocity importance can be observed as a change in those features is highly likely to cause the lander to fail to land inside the designated landing zone. Since the lander is already touching the ground, it will take much more effort for the agent to adjust compared to when the lander is still high in the air.

The results are similar to those of saliency-based algorithms [8], and Fig. 16 shows the difference in importance vector between our algorithm and saliency-based algorithm. Note that differences only occur for the positions and the angle. This is because other features don’t have any additional causal paths to the action besides the direct connection. Therefore, the intervention operation is equivalent to the conditioning operation for these features. The features position and angle have an additional causal path through the legs, which causes the difference. Notably, our method captures higher importance for angle, which we interpret as that the landing angle is crucial and is actively managed by the agent.

We are also able to compute the importance of the features in the previous steps, and Fig. 14c and the shaded bars in Fig. 15 represent such importance vectors. The previous-step importances are rather similar to those of the current-step features since the size of the time step is comparatively small. However, our algorithm captures that during the “adjusting phase”, the previous-step importance for the angle is in general higher than the current-step importance, as changing the previous angle may have a cascading effect on the trajectory and is especially important to the agent when it is actively adjusting the angle.

Appendix B: Sensitivity analysis

This section performs a sensitivity analysis on how the perturbation amount affects the result of our explanation.

For action-based importance, too small of a perturbation may not yield a meaningful result. This is due to the fact that, depending on the environment and the policy, a too small perturbation may fail to trigger a noticeable change in the action, resulting in a zero importance. This differs from the zero importance case where the policy disregards the feature when making decisions. In our experiments, we use 0.01 with respect to the range of the features for continuous features and the smallest unit for discrete features.

In general, using different perturbation amounts $\delta $ on the same state in the same SCM may result in different importance vectors, and vectors calculated using different $\delta $ cannot be meaningfully compared. However, if we desire the importance of using different $\delta $ to be more on the same level, we suggest finding the highest importance across all features and all time steps and normalizing all results by said number. Section 2 contains an example comparing the importance score with and without the aforementioned normalization.

1.1 B1. One-step MDP

As we demonstrated in the example of one-step MDP in Fig. 3 and Table 1, our importance vector will sometimes be affected by the perturbation amount. For this experiment, we use Fig. 3 as the skeleton and the following settings. The constants are

$$\begin{aligned} c_1 = 1, c_2 = -2, c_3 = 3, c_{12} = 2, c_p = -1 \end{aligned}$$

We use unit Gaussian distributions as the exogenous variables and the values are

$$\begin{aligned} u_1 = 0.50, u_2 = -0.14, u_3 = 0.65, u_p = 1.52, u_a = -0.23 \end{aligned}$$

The state value and the corresponding action are then

$$\begin{aligned} \textbf{s}^{(1)} = 0.50, \textbf{s}^{(2)} = 0.86, \textbf{s}^{(3)} = -0.88, v_p = 1.52, a = 3.83 \end{aligned}$$

The result of running our method and the saliency map method on the feature $\textbf{S}^{(1)}$ is shown in Fig. 17. Same as in Table 1. Our algorithm is linear w.r.t. $\delta $ while the saliency map result is constant. The increased importance comes from the causal link $\textbf{S}^{(1)}\!\rightarrow \!\textbf{S}^{(2)}\!\rightarrow \! A$, which also introduces the linear relationship.

1.2 B.2 Collision avoidance

Figure 18 shows the importance vector of $X_t$ in the collision avoidance problem and different color lines correspond to different perturbation amounts. Note that similar to the result shown in Fig. 6b, the importance of $D_t$ is the same as $X_t$, and $X_{t-1}$ is the same but off by one time step. Other features have negligible importance.

There are two effects of using different perturbation amounts: 1) The number of steps with non-zero importance is increasing as $\delta $ increases since a larger $\delta $ will cause states further away from the decision boundary to cross the boundary after the perturbation; 2) The value of peak importance is lower. Since we use the action-based importance and the action is essentially binary, the difference in importance solely comes from the normalization we applied on $\delta $ (the denominator in Eq. (3). If this is undesirable, one way to combat this is to normalize the result using the highest importance across all features and time steps. The normalized result is shown in Fig. 18b, in which the peak value will be one regardless of $\delta $.

1.3 B.3 Lunar lander

Figure 19 shows the sensitivity analysis on lunar lander and the different color lines correspond to different perturbation amounts. Binary features including left and right leg are not included. The general trend of the result is the same while the value and the exact shape of the curve vary slightly when different $\delta $ is used and our result is robust w.r.t. $\delta $.

1.4 B.4 Blackjack

Figure 20 shows the sensitivity analysis for blackjack, with different color lines representing different perturbation amounts. The binary feature ace is not included. In blackjack, since the smallest legal perturbation amount is one and the range of the value is at most 21, increasing $\delta $ has a much larger effect on the result. However, we can observe that the general shape of the curves is similar, indicating the robustness of our method.

Appendix C: Action-based importance versus Q-value-based importance

This section discusses the comparison between the action-based importance method and the Q-value-based importance method. It demonstrates that the Q-value-based method sometimes fails to reflect the features in the state that the policy relies on.

Consider a one-step MDP with the SCM shown in Fig. 21, where the state $\textbf{S}=[S_1, S_2]$, $S_i \in [-1,1]$, $i=1,2$, and the action $a \in [-1,1]$. The reward is defined as $R(\textbf{S},a)= 100\times S_2+ a \times S_1$. Under this setting, the optimal policy is:

$$\begin{aligned} A = {\left\{ \begin{array}{ll} -1 &{} S_1 < 0\\ 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

Intuitively, the policy selects the minimum value in the action space when $S_1$ is negative , and the maximum value otherwise.

The action-based importance method correctly identifies $S_1$ as more important, as the policy only depends on $S_1$. However, the Q-value-based method produces a different result. In a one-step MDP, the Q-function is the same as the reward function. As the coefficient in the Q(reward) function is larger for $S_2$, the Q-value-based method finds $S_2$ more important, which is different from the features that the policy relies on.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Meng, F., Liu, X. et al. Causal explanation for reinforcement learning: quantifying state and temporal importance. Appl Intell 53, 22546–22564 (2023). https://doi.org/10.1007/s10489-023-04649-7

Download citation

Accepted: 18 April 2023
Published: 30 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04649-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Causal explanation for reinforcement learning: quantifying state and temporal importance

Abstract

Access this article

Similar content being viewed by others

Contrastive Visual Explanations for Reinforcement Learning via Counterfactual Rewards

Causal Based Action Selection Policy for Reinforcement Learning

Synchronisms Using Reinforcement Learning as an Heuristic

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Additional experiments and details

1.1 A.1 Crop irrigation

System dynamics

Policy

Training

Perturbation

1.2 A.2 Blackjack

System dynamics

Policy

SCM structure

Training

Perturbation

1.3 A.3 Collision avoidance problem

System dynamics

Policy

SCM structure

Perturbation

1.4 A.4 Lunar lander

System dynamics

Policy

SCM structure

Evaluation

Appendix B: Sensitivity analysis

1.1 B1. One-step MDP

1.2 B.2 Collision avoidance

1.3 B.3 Lunar lander

1.4 B.4 Blackjack

Appendix C: Action-based importance versus Q-value-based importance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation