Skip to main content
Log in

A survey on interpretable reinforcement learning

  • Published:
Machine Learning Aims and scope Submit manuscript


Although deep reinforcement learning has become a promising machine learning approach for sequential decision-making problems, it is still not mature enough for high-stake domains such as autonomous driving or medical applications. In such contexts, a learned policy needs for instance to be interpretable, so that it can be inspected before any deployment (e.g., for safety and verifiability reasons). This survey provides an overview of various approaches to achieve higher interpretability in reinforcement learning (RL). To that aim, we distinguish interpretability (as an intrinsic property of a model) and explainability (as a post-hoc operation) and discuss them in the context of RL with an emphasis on the former notion. In particular, we argue that interpretable RL may embrace different facets: interpretable inputs, interpretable (transition/reward) models, and interpretable decision-making. Based on this scheme, we summarize and analyze recent work related to interpretable RL with an emphasis on papers published in the past 10 years. We also discuss briefly some related research areas and point to some potential promising research directions, notably related to the recent development of foundation models (e.g., large language models, RL from human feedback).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others


  1. See Puterman (1994) or Bertsekas and Tsitsiklis (1996) for a more complete discussion.

  2. A functional understanding “relies on an appreciation for functions, goals, and purpose” while a mechanistic understanding “relies on an appreciation of parts, processes, and proximate causal mechanisms” (Páez, 2019).

  3. Some objectual understanding is particularly beneficial when considering legal accountability and public responsibility.

  4. Recall, in contrast to propositional logic (i.e., Boolean vector representation), FOL describes the world in terms of objects, predicates (i.e., relations between objects), and functions (i.e., objects defined from other objects).

  5. PRMs may be understood as “relational” extensions of “propositional” Bayesian networks.

  6. Relational Dynamic Influence Diagram Language (RDDL, Sanner, 2011), extending DBN using state-dependent rewards aggregated over objects, is able to model parallel effects. In contrast, Probabilistic Planning Domain Definition Language (PPDDL Younes & Littman, 2004) employs action-transition-based rewards and models correlated effect. Note that Guestrin et al. (2003) assume static representations, which are unfit for real-world dynamics or relational environments such as Blocks World.

  7. Physical theories are a typical example of this practice, where laws—such as laws of motions—are reused across instantiations and scenes with various primitive entities.

  8. A common assumption in contemporary cognitive science is that these representations have to emerge in strong dependency to the actions and goals of the agent (enacted) and the environment (situated).

  9. In contrast, the work in RL+SP mentioned in Sect. 4.2 does not assume similar HL domain knowledge, and aims to learn the mapping from the low-level domain to high-level symbols.

  10. For instance, Serafini and d’Avila Garcez (2016) use FOL-based loss-function to constrain the learned semantic representations to be logically consistent.


  • Adjodah, D., Klinger, T., & Joseph, J. (2018). Symbolic relation networks for reinforcement learning. In NeurIPS workshop on representation learning.

  • Agnew, W., & Domingos, P. (2018). Unsupervised object-level deep reinforcement learning. In NeurIPS workshop on deep RL.

  • Akrour, R., Tateo, D., & Peters, J. (2019). Towards reinforcement learning of human readable policies. In Workshop on deep continuous-discrete machine learning.

  • Aksaray, D., Jones, A., Kong, Z., et al. (2016). Q-Learning for robust satisfaction of signal temporal logic specifications. In CDC.

  • Alharin, A., Doan, T. N., & Sartipi, M. (2020). Reinforcement learning interpretation methods: A survey. IEEE Access, 8, 171058–171077.

    Article  Google Scholar 

  • Alshiekh, M., Bloem, R., Ehlers, R., et al. (2018). Safe reinforcement learning via shielding. In AAAI.

  • Amodei, D., Olah, C., Steinhardt, J., et al. (2016). Concrete Problems in AI Safety. arXiv: 1606.06565

  • Ananny, M., & Crawford, K. (2018). Seeing without knowing: Limitations of the transparency ideal and its application to algorithmic accountability. New Media and Society, 20(3), 973–89.

    Article  Google Scholar 

  • Andersen, G., & Konidaris, G. (2017). Active exploration for learning symbolic representations. In NeurIPS.

  • Anderson, G., Verma, A., Dillig, I., et al. (2020). Neurosymbolic reinforcement learning with formally verified exploration. In NeurIPS.

  • Andreas, J., Klein, D., & Levine, S. (2017). Modular multitask reinforcement learning with policy sketches. In ICML.

  • Annasamy, R.M., & Sycara, K. (2019). Towards better interpretability in deep Q-networks. In AAAI.

  • Arnold, T., Kasenberg, D., & Scheutz, M. (2017). Value alignment or misalignment: What will keep systems accountable? In AAAI workshop.

  • Arora, S., & Doshi, P. (2018). A survey of inverse reinforcement learning: Challenges, methods and progress. arXiv:1806.06877

  • Atrey, A., Clary, K., & Jensen, D. (2020). Exploratory not explanatory: Counterfactual analysis of saliency maps for deep reinforcement learning. In ICLR.

  • Ault, J., Hanna, J. P., & Sharon, G. (2020). Learning an interpretable traffic signal control policy. In AAMAS.

  • Bader, S., & Hitzler, P. (2005). Dimensions of neural-symbolic integration: A structured survey. In We Will Show Them: Essays in Honour of Dov Gabbay.

  • Barredo Arrieta, A., Díaz-Rodríguez, N., Ser, J. D., et al. (2020). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82–115.

    Article  Google Scholar 

  • Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems

  • Barwise, J. (1977). An introduction to first-order logic. Studies in Logic and the Foundations of Mathematics, 90, 5–46.

    Article  MathSciNet  Google Scholar 

  • Bastani, O., Pu, Y., & Solar-Lezama, A. (2018). Verifiable reinforcement learning via policy extraction. In NeurIPS.

  • Battaglia, P., Pascanu, R., Lai, M., et al. (2016). Interaction networks for learning about objects, relations and physics. In NeurIPS.

  • Battaglia, P. W., Hamrick, J. B., Bapst, V., et al. (2018). Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261

  • Bear, D., Fan, C., Mrowca, D., et al. (2020). Learning physical graph representations from visual scenes. In NeurIPS.

  • Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Athena Scientific.

  • Bewley, T., & Lawry, J. (2021). TripleTree: A versatile interpretable representation of black box agents and their environments. In AAAI.

  • Bewley, T., & Lécué, F. (2022). Interpretable preference-based reinforcement learning with tree-structured reward functions. In AAMAS.

  • Beyret, B., Shafti, A., & Faisal, A. A. (2019). Dot-to-dot: Explainable hierarchical reinforcement learning for robotic manipulation. In IROS.

  • Bommasani, R., Hudson, D. A., Adeli, E., et al. (2022). On the opportunities and risks of foundation models. arXiv:2108.07258

  • Bonnefon, J., Shariff, A., & Rahwan, I. (2019). The trolley, the bull bar, and why engineers should care about the ethics of autonomous cars [point of view]. Proceedings of the IEEE, 107(3), 502–4.

    Article  Google Scholar 

  • Boutilier, C., Dearden, R., & Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1–2), 49–107.

    Article  MathSciNet  Google Scholar 

  • Brunelli, R. (2009). Template matching techniques in computer vision: Theory and practice. Wiley Publishing.

  • Brunner, G., Liu, Y., Pascual, D., et al. (2020). On identifiability in transformers. In ICLR

  • Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In KDD.

  • Burke, M., Penkov, S., & Ramamoorthy, S. (2019). From explanation to synthesis: Compositional program induction for learning from demonstration. In RSS.

  • Camacho, A., Toro Icarte, R., Klassen, T. Q., et al. (2019). LTL and beyond: Formal languages for reward function specification in reinforcement learning. In IJCAI.

  • Cao, Y., Li, Z., Yang, T., et al. (2022). GALOIS: Boosting deep reinforcement learning via generalizable logic synthesis. In NeurIPS.

  • Casper, S., Davies, X., Shi, C., et al. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv:2307.15217

  • Chang, M. B., Ullman, T., Torralba, A., et al. (2017). A compositional object-based approach to learning physical dynamics. In ICLR.

  • Chari, S., Gruen, D. M., Seneviratne, O., et al. (2020). Directions for explainable knowledge-enabled systems. arXiv:2003.07523

  • Chen, J., Li, S. E., & Tomizuka, M. (2020). Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. In ICML workshop on AI for autonomous driving.

  • Cichosz, P., & Pawełczak, L. (2014). Imitation learning of car driving skills with decision trees and random forests. International Journal of Applied Mathematics and Computer Science, 24, 579–97.

    Article  Google Scholar 

  • Cimatti, A., Pistore, M., & Traverso, P. (2008). Automated planning. In Handbook of knowledge representation.

  • Cole, J., Lloyd, J., & Ng, K. S. (2003). Symbolic learning for adaptive agents. In Annual partner conference.

  • Commission, E. (2019). Ethics guidelines for trustworthy AI.

  • Coppens, Y., Efthymiadis, K., Lenaerts, T., et al. (2019). Distilling deep reinforcement learning policies in soft decision trees. In IJCAI workshop on XAI.

  • Corazza, J., Gavran, I., & Neider, D. (2022). Reinforcement learning with stochastic reward machines. In AAAI.

  • Cranmer, M., Sanchez Gonzalez, A., Battaglia, P., et al. (2020). Discovering symbolic models from deep learning with inductive biases. In NeurIPS.

  • Crawford, K., Dobbe, R., Dryer, T., et al. (2016). AI Now Report. AI Now Institute: Tech. rep.

  • Cropper, A., Dumančić, S., & Muggleton, S.H. (2020). Turning 30: New ideas in inductive logic programming. In IJCAI.

  • Cruz, F., Dazeley, R., & Vamplew, P. (2019). Memory-based explainable reinforcement learning. In Advances in artificial intelligence.

  • Daly, A., Hagendorff, T., Li, H., et al. (2019). Artificial Intelligence, Governance and Ethics: Global Perspectives. SSRN Scholarly Paper: Chinese University of Hong Kong.

    Google Scholar 

  • d’Avila Garcez, A., Dutra, A. R. R., & Alonso, E. (2018). Towards Symbolic Reinforcement Learning with Common Sense. arXiv:1804.08597

  • De Raedt, L., & Kimmig, A. (2015). Probabilistic (logic) programming concepts. Machine Learning, 100(1), 5–47.

    Article  MathSciNet  Google Scholar 

  • Dean, T., & Kanazawa, K. (1990). A model for reasoning about persistence and causation. Computational Intelligence, 5(3), 142–150.

    Google Scholar 

  • Degris, T., Sigaud, O., & Wuillemin, P. H. (2006). Learning the structure of factored Markov decision processes in reinforcement learning problems. In ICML.

  • Delfosse, Q., Shindo, H., Dhami, D., et al. (2023). Interpretable and explainable logical policies via neurally guided symbolic abstraction. In NeurIPS.

  • Demeester, T., Rocktäschel, T., & Riedel, S. (2016). Lifted rule injection for relation embeddings. In EMNLP.

  • Diligenti, M., Gori, M., & Saccà, C. (2017). Semantic-based regularization for learning and inference. Artificial Intelligence, 244, 143–65.

    Article  MathSciNet  Google Scholar 

  • Diuk, C., Cohen, A., & Littman, M. L. (2008). An object-oriented representation for efficient reinforcement learning. In ICML.

  • Donadello, I., Serafini, L., & D’Avila Garcez, A. (2017). Logic tensor networks for semantic image interpretation. In IJCAI.

  • Dong, H., Mao, J., Lin, T., et al. (2019). Neural logic machines. In ICLR.

  • Doshi-Velez, F., Kortz, M., Budish, R., et al. (2019). Accountability of AI under the law: The role of explanation. arXiv:1711.01134

  • Dragan, A. D., Lee, K. C., & Srinivasa, S. S. (2013). Legibility and predictability of robot motion. In HRI.

  • Driessens, & Blockeel, H. (2001). Learning digger using hierarchical reinforcement learning for concurrent goals. In EWRL.

  • Driessens, K., Ramon, J., & Gartner, T. (2006). Graph kernels and Gaussian processes for relational reinforcement learning. Machine Learning

  • Dutra, A. R., & d’Avila Garcez, A. S. (2017). A Comparison between deep Q-networks and deep symbolic reinforcement learning. In CEUR workshop proceedings.

  • Dwork, C., Hardt, M., Pitassi, T., et al. (2012). Fairness through awareness. In ICTS.

  • Dzeroski, S., Raedt, L. D., & Blockeel, H. (1998). Relational reinforcement learning. In ICML.

  • Džeroski, S., De Raedt, L., & Driessens, K. (2001). Relational reinforcement learning. Machine Learning, 43(1), 7–52.

    Article  Google Scholar 

  • Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. JMLR, 6, 503–556.

    MathSciNet  Google Scholar 

  • Evans, R., & Grefenstette, E. (2018). Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61, 1–64.

    Article  MathSciNet  Google Scholar 

  • Eysenbach, B., Salakhutdinov, R. R., & Levine, S. (2019). Search on the replay buffer: Bridging planning and reinforcement learning. In NeurIPS.

  • Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS.

  • Finn, C., & Levine, S. (2017). Deep visual foresight for planning robot motion. In ICRA.

  • Franca, M. V. M., Zaverucha, G., & Garcez, A. (2014). Fast relational learning using bottom clause propositionalization with artificial neural networks. Machine Learning, 94(1), 81–104.

    Article  MathSciNet  Google Scholar 

  • Francois-Lavet, V., Bengio, Y., Precup, D., et al. (2019). Combined reinforcement learning via abstract representations. In AAAI.

  • Friedler, S. A., Scheidegger, C., & Venkatasubramanian, S. (2021). The (Im)possibility of fairness: Different value systems require different mechanisms for fair decision making. Communications of the ACM, 64(4), 136–143.

    Article  Google Scholar 

  • Friedman, D., Wettig, A., & Chen, D. (2023). Learning transformer programs. In NeurIPS.

  • Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In ICML.

  • Fukuchi, Y., Osawa, M., Yamakawa, H., et al. (2017). Autonomous self-explanation of behavior for interactive reinforcement learning agents. In International conference on human agent interaction.

  • Furelos-Blanco, D., Law, M., Jonsson, A., et al. (2021). Induction and exploitation of subgoal automata for reinforcement learning. JAIR, 70, 1031–1116.

    Article  MathSciNet  Google Scholar 

  • Gaon, M., & Brafman, R. I. (2020). Reinforcement learning with non-Markovian rewards. In AAAI.

  • Garg, S., Bajpai, A., Mausam. (2020). Symbolic network: Generalized neural policies for relational MDPs. arXiv:2002.07375

  • Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards deep symbolic reinforcement learning. In NeurIPS workshop on DRL.

  • Gilmer, J., Schoenholz, S. S., Riley, P. F., et al. (2017). Neural message passing for quantum chemistry. In ICML.

  • Gilpin, L. H., Bau, D., Yuan, B. Z., et al. (2019). Explaining explanations: An overview of interpretability of machine learning. In DSAA.

  • Glaese, A., McAleese, N., Trebacz, M., et al. (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv:2209.14375

  • Glanois, C., Jiang, Z., Feng, X., et al. (2022). Neuro-symbolic hierarchical rule induction. In ICML.

  • Goel, V., Weng, J., & Poupart, P. (2018). Unsupervised video object segmentation for deep reinforcement learning. In NeurIPS.

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

  • Greydanus, S., Koul, A., Dodge, J., et al. (2018). Visualizing and understanding atari agents. In ICML.

  • Grzes, M., & Kudenko, D. (2008). Plan-based reward shaping for reinforcement learning. In International conference intelligent systems.

  • Guestrin, C., Koller, D., Gearhart, C., et al. (2003). Generalizing plans to new environments in relational MDPs. In IJCAI.

  • Gulwani, S., Polozov, O., & Singh, R. (2017). Program synthesis. Foundations and Trends in Programming Languages, 4(1–2), 1–119.

    Article  Google Scholar 

  • Gupta, P., Puri, N., Verma, S., et al. (2020). Explain your move: Understanding agent actions using focused feature saliency. In ICLR.

  • Gupta, U. D., Talvitie, E., & Bowling, M. (2015). Policy tree: Adaptive representation for policy gradient. In AAAI.

  • Haarnoja, T., Zhou, A., Abbeel, P., et al. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML.

  • Harnad, S. (1990). The symbol grounding problem. Physica D-Nonlinear Phenomena, 42, 335–346.

    Article  Google Scholar 

  • Hasanbeig, M., Kroening, D., & Abate, A. (2020). Deep reinforcement learning with temporal logics. In Formal modeling and analysis of timed systems.

  • Hayes, B., & Shah, J. A. (2017). Improving robot controller transparency through autonomous policy explanation. In International conference on HRI.

  • Hein, D., Hentschel, A., Runkler, T., et al. (2017). Particle swarm optimization for generating interpretable fuzzy reinforcement learning policies. Engineering Applications of AI, 65, 87–98.

    Google Scholar 

  • Hein, D., Udluft, S., & Runkler, T. A. (2018). Interpretable policies for reinforcement learning by genetic programming. Engineering Applications of AI, 76, 158–169.

    Google Scholar 

  • Hein, D., Udluft, S., & Runkler, T. A. (2019). Generating interpretable reinforcement learning policies using genetic programming. In GECCO.

  • Henderson, P., Islam, R., Bachman, P., et al. (2018). Deep reinforcement learning that matters. In AAAI.

  • Hengst, B. (2010). Hierarchical reinforcement learning. Encyclopedia of machine learning (pp. 495–502). Springer.

  • Heuillet, A., Couthouis, F., & Díaz-Rodríguez, N. (2021). Explainability in deep reinforcement learning. Knowledge-Based Systems, 214, 106685.

    Article  Google Scholar 

  • Higgins, I., Amos, D., Pfau, D., et al. (2018). Towards a definition of disentangled representations. arXiv:1812.02230

  • Horvitz, E., & Mulligan, D. (2015). Data, privacy, and the greater good. Science, 349(6245), 253–255.

    Article  MathSciNet  Google Scholar 

  • Huang, S., Papernot, N., Goodfellow, I., et al. (2017). Adversarial attacks on neural network policies. In ICLR workshop.

  • Hussein, A., Gaber, M. M., Elyan, E., et al. (2017). Imitation learning: A survey of learning methods. ACM Computing Surveys, 50(2), 211–2135.

    Google Scholar 

  • Illanes, L., Yan, X., Icarte, R. T., et al. (2020). Symbolic plans as high-level instructions for reinforcement learning. In ICAPS.

  • Iyer, R., Li, Y., Li, H., et al. (2018). Transparency and explanation in deep reinforcement learning neural networks. In AIES.

  • Jain, S., & Wallace, B. C. (2019). Attention is not explanation. In NAACL.

  • Janisch, J., Pevný, T., & Lisý, V. (2021). Symbolic relational deep reinforcement learning based on graph neural networks. arXiv:2009.12462

  • Jia, R., Jin, M., Sun, K., et al. (2019). Advanced building control via deep reinforcement learning. In Energy Procedia.

  • Jiang, Y., Yang, F., Zhang, S., et al. (2018). Integrating task-motion planning with reinforcement learning for robust decision making in mobile robots. In ICAPS.

  • Jiang, Z., & Luo, S. (2019). Neural logic reinforcement learning. In ICML.

  • Jin, M., Ma, Z., Jin, K., et al. (2022). Creativity of ai: Automatic symbolic option discovery for facilitating deep reinforcement learning. In AAAI.

  • Juozapaitis, Z., Koul, A., Fern, A., et al. (2019). Explainable reinforcement learning via reward decomposition. In IJCAI/ECAI workshop on explainable artificial intelligence.

  • Kaiser, M., Otte, C., Runkler, T., et al. (2019). Interpretable dynamics models for data-efficient reinforcement learning. In ESANN.

  • Kansky, K., Silver, T., Mély, D. A., et al. (2017). Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In ICML.

  • Kasenberg, D., & Scheutz, M. (2017). Interpretable apprenticeship learning with temporal logic specifications. In CDC.

  • Kenny, E. M., Tucker, M., Shah, J. (2023). Towards interpretable deep reinforcement learning with human-friendly prototypes. In ICLR.

  • Kim, J., & Bansal, M. (2020). Attentional bottleneck: Towards an interpretable deep driving network. In CVPR workshop.

  • Koller, D. (1999). Probabilistic relational models. In Inductive logic programming (pp. 3–13).

  • Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2014). Constructing symbolic representations for high-level planning. In AAAI.

  • Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2015). Symbol acquisition for probabilistic high-level planning. In IJCAI.

  • Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2018). From skills to symbols: Learning symbolic representations for abstract high-level planning. JAIR, 61, 215–289.

    Article  MathSciNet  Google Scholar 

  • Koul, A., Greydanus, S., & Fern, A. (2019). Learning finite state representations of recurrent policy networks. In ICLR.

  • Kulick, J., Toussaint, M., & Lang, T. et al (2013). Active learning for teaching a robot grounded relational symbols. In IJCAI.

  • Kunapuli, G., Odom, P., & Shavlik, J. W. et al (2013). Guiding autonomous agents to better behaviors through human advice. In ICDM.

  • Kwon, M., Xie, S. M., & Bullard, K. et al (2023). Reward design with language models. In ICLR.

  • Lao, N., & Cohen, W. W. (2010). Relational retrieval using a combination of path-constrained random walks. In Machine learning.

  • Leonetti, M., Iocchi, L., & Stone, P. (2016). A synthesis of automated planning and reinforcement learning for efficient, robust decision-making. Artificial Intelligence, 241, 103–130.

    Article  MathSciNet  Google Scholar 

  • Leslie, D. (2020). Understanding artificial intelligence ethics and safety: A guide for the responsible design and implementation of AI systems in the public sector. SSRN Electronic Journal

  • Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909

  • Li, X., Serlin, Z., Yang, G., et al. (2019). A formal methods approach to interpretable reinforcement learning for robotic planning. Science Robotics, 4(37), eaay6276.

    Article  Google Scholar 

  • Li, X., Vasile, C. I., & Belta, C. (2017a). Reinforcement learning with temporal logic rewards. In IROS.

  • Li, Y., Sycara, K., & Iyer, R. (2017b). Object-sensitive deep reinforcement learning. In Global conference on AI.

  • Li, Y., Tarlow, D., Brockschmidt, M. et al (2017c). Gated graph sequence neural networks. In ICLR.

  • Likmeta, A., Metelli, A. M., Tirinzoni, A., et al. (2020). Combining reinforcement learning with rule-based controllers for transparent and general decision-making in autonomous driving. Robotics and Autonomous Systems, 131, 103568.

    Article  Google Scholar 

  • Lim, B. Y., Yang, Q., & Abdul, A. et al (2019). Why these explanations? Selecting intelligibility types for explanation goals. In IUI workshops.

  • Lipton, Z. C. (2017). The mythos of model interpretability. arXiv:1606.03490

  • Littman, M. L., Topcu, U., & Fu, J. et al (2017). Environment-independent task specifications via GLTL, arXiv:1704.04341

  • Liu, G., Schulte, O., & Zhu, W. et al (2018). Toward interpretable deep reinforcement learning with linear model U-trees. In ECML.

  • Liu, Y., Han, T., Ma, S., et al. (2023). Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, 1(2), 100017.

    Article  Google Scholar 

  • Lo Piano, S. (2020). Ethical principles in machine learning and artificial intelligence: Cases from the field and possible ways forward. Humanities and Social Sciences Communications, 7(1), 1–7.

    Article  Google Scholar 

  • Lu, K., Zhang, S., & Stone, P. et al (2018). Robot representation and reasoning with knowledge from reinforcement learning. arXiv:1809.11074

  • Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In NeurIPS.

  • Lyu, D., Yang, F., & Liu, B. et al (2019). SDRL: Interpretable and data-efficient deep reinforcement learning leveraging symbolic planning. In AAAI.

  • Ma, Z., Zhuang, Y., & Weng, P. et al (2020). Interpretable reinforcement learning with neural symbolic logic. arXiv:2103.08228

  • Maclin, R., & Shavlik, J. W. (1996). Creating advice-taking reinforcement learners. Machine Learning, 22, 251–282.

    Article  Google Scholar 

  • Madumal, P., Miller, T., & Sonenberg, L. et al (2020a). Distal explanations for model-free explainable reinforcement learning. arXiv:2001.10284

  • Madumal, P., Miller, T., & Sonenberg, L. et al (2020b). Explainable reinforcement learning through a causal lens. In AAAI.

  • Maes, F., Fonteneau, R., & Wehenkel, L. et al (2012a). Policy search in a space of simple closed-form formulas: towards interpretability of reinforcement learning. In Discovery science.

  • Maes, F., Wehenkel, L., & Ernst, D. (2012b). Automatic discovery of ranking formulas for playing with multi-armed bandits. In Recent advances in reinforcement learning.

  • Maes, P., Mataric, M. J., & Meyer, J. A. et al (1996). Learning to use selective attention and short-term memory in sequential tasks. In International conference on simulation of adaptive behavior.

  • Mania, H., Guy, A., & Recht, B. (2018). Simple random search of static linear policies is competitive for reinforcement learning. In NeurIPS.

  • Marom, O., & Rosman, B. (2018). Zero-shot transfer with deictic object-oriented representation in reinforcement learning. In NeurIPS.

  • Martínez, D., Alenyà, & G., Torras, C. et al (2016). Learning relational dynamics of stochastic domains for planning. In ICAPS.

  • Martínez, D., Alenyà, G., Ribeiro, T., et al. (2017). Relational reinforcement learning for planning with exogenous effects. Journal of Machine Learning Research, 18(78), 1–44.

    MathSciNet  Google Scholar 

  • Martínez, D., Alenyà, G., & Torras, C. (2017). Relational reinforcement learning with guided demonstrations. Artificial Intelligence, 247, 295–312.

    Article  MathSciNet  Google Scholar 

  • Mehrabi, N., Morstatter, F., & Saxena, N., et al. (2019). A survey on bias and fairness in machine learning. arXiv:1908.09635

  • Metzen, J. H. (2013). Learning graph-based representations for continuous reinforcement learning domains. In ECML.

  • Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. In ICML.

  • Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38.

    Article  MathSciNet  Google Scholar 

  • Minervini, P., Demeester, T., & Rocktäschel, T., et al. (2017). Adversarial sets for regularising neural link predictors. In UAI.

  • Mittelstadt, B., Russell, C., & Wachter, S. (2019). Explaining explanations in AI. In Conference on fairness, accountability, and transparency.

  • Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533.

    Article  Google Scholar 

  • Mohseni, S., Zarei, N., & Ragan, E. D. (2020). A multidisciplinary survey and framework for design and evaluation of explainable AI systems. arXiv:1811.11839

  • Molnar, C. (2019). Interpretable machine learning: A guide for making black box models explainable.

  • Morley, J., Floridi, L., Kinsey, L., et al. (2020). From what to how: An initial review of publicly available AI ethics tools, methods and research to translate principles into practices. Science and Engineering Ethics, 26(4), 2141–68.

    Article  Google Scholar 

  • Mott, A., Zoran, D., & Chrzanowski, M., et al. (2019). Towards interpretable reinforcement learning using attention augmented agents. In NeurIPS.

  • Munzer, T., Piot, B., & Geist, M., et al. (2015). Inverse reinforcement learning in relational domains. In IJCAI.

  • Nageshrao, S., Costa, B., & Filev, D. (2019). Interpretable approximation of a deep reinforcement learning agent as a set of if-then rules. In ICMLA.

  • Natarajan, S., Joshi, S., & Tadepalli, P., et al. (2011). Imitation learning in relational domains: A functional-gradient boosting approach. In IJCAI.

  • Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In ICML.

  • OpenAI, Akkaya, I., & Andrychowicz, M., et al. (2019). Solving Rubik’s Cube with a Robot Hand. arXiv:1910.07113

  • OpenAI, & Achiam, J., et al. (2023). Gpt-4 technical report. arXiv:2303.08774

  • Osa, T., Pajarinen, J., Neumann, G., et al. (2018). Algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1–2), 1–179.

    Article  Google Scholar 

  • Pace, A., Chan, A., & van der Schaar, M. (2022). POETREE: Interpretable policy learning with adaptive decision trees. In ICLR.

  • Páez, A. (2019). The pragmatic turn in explainable artificial intelligence (XAI). Minds and Machines, 29(3), 441–459.

    Article  Google Scholar 

  • Paischer, F., Adler, T., & Hofmarcher, M., et al. (2023). Semantic helm: A human-readable memory for reinforcement learning. In NeurIPS.

  • Pasula, H. M., Zettlemoyer, L. S., & Kaelbling, L. P. (2007). Learning symbolic models of stochastic domains. In JAIR.

  • Payani, A., & Fekri, F. (2019a). Inductive logic programming via differentiable deep neural logic networks. arXiv:1906.03523

  • Payani, A., & Fekri, F. (2019b). Learning algorithms via neural logic networks. arXiv:1904.01554

  • Payani, A., & Fekri, F. (2020). Incorporating Relational Background Knowledge into Reinforcement Learning via Differentiable Inductive Logic Programming. arXiv:2003.10386

  • Penkov, S., & Ramamoorthy, S. (2019). Learning programmatically structured representations with perceptor gradients. In ICLR.

  • Plumb, G., Al-Shedivat, M., & Cabrera, AA., et al. (2020). Regularizing black-box models for improved interpretability. arXiv:1902.06787

  • Pomerleau, D. (1989). Alvinn: An autonomous land vehicle in a neural network. In NeurIPS.

  • Puiutta, E., & Veith, E. M. (2020). Explainable reinforcement learning: A survey. In LNCS.

  • Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. Wiley.

  • Qiu, W., & Zhu, H. (2022). Programmatic reinforcement learning without oracles. In ICLR.

  • Rafailov, R., Sharma, A., & Mitchell, E., et al. (2023). Direct preference optimization: Your language model is secretly a reward model. In NeurIPS.

  • Raji, I. D., Smart, A., & White, R. N., et al. (2020). Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. arXiv:2001.00973

  • Ramesh, A., Pavlov, M., & Goh, G., et al. (2021). Zero-shot text-to-image generation. arXiv:2102.12092

  • Randlov, J., & Alstrom, P. (1998). Learning to drive a bicycle using reinforcement learning and shaping. In ICML.

  • Redmon, J., Divvala, S., & Girshick, R., et al. (2016). You only look once: Unified, real-time object detection. In CVPR.

  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016a). Model-Agnostic Interpretability of Machine Learning. In ICML workshop on human interpretability in ML.

  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016b). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In KDD.

  • Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation extraction. In Human language technologies.

  • Rombach, R., Blattmann, A., & Lorenz, D., et al. (2022). High-resolution image synthesis with latent diffusion models. In CVPR.

  • Ross, S., Gordon, G. J., & Bagnell, J. A. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS.

  • Roth, A. M., Topin, N., & Jamshidi, P., et al. (2019). Conservative Q-Improvement: Reinforcement Learning for an Interpretable Decision-Tree Policy. arXiv:1907.01180

  • Rothkopf, C. A., & Dimitrakakis, C. (2011). Preference elicitation and inverse reinforcement learning. In ECML.

  • Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.

    Article  Google Scholar 

  • Rudin, C., & Carlson, D. (2019). The secrets of machine learning: ten things you wish you had known earlier to be more effective at data analysis. In Operations research & management science in the age of analytics (pp. 44–72).

  • Russell, S. (1998). Learning agents for uncertain environments. In COLT.

  • Rusu, A. A., Colmenarejo, S. G., Gülçehre, Ç., et al. (2016). Policy distillation. In ICLR.

  • Sanchez-Gonzalez, A., Heess, N., & Springenberg, J. T., et al. (2018). Graph networks as learnable physics engines for inference and control. In ICML.

  • Sanner, S. (2005). Simultaneous learning of structure and value in relational reinforcement learning. In ICML workshop on rich representations for RL.

  • Sanner, S. (2011). Relational dynamic influence diagram language (RDDL): Language description. In International planning competition.

  • Santoro, A., Raposo, D., Barrett, D. G. T., et al. (2017). A simple neural network module for relational reasoning. In NeurIPS.

  • Scarselli, F., Gori, M., Tsoi, A. C., et al. (2009). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80.

    Article  Google Scholar 

  • Scholz, J., Levihn, M., & Isbell, C. L., et al. (2014). A physics-based model prior for object-oriented MDPs. In ICML.

  • Schulman, J., Wolski, F., & Dhariwal, P., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347

  • Sequeira, P., & Gervasio, M. (2020). Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitations. Artificial Intelligence, 288, 103367.

    Article  MathSciNet  Google Scholar 

  • Serafini, L., & d’Avila Garcez, A. (2016). Logic tensor networks: Deep learning and logical reasoning from data and knowledge. In CEUR workshop.

  • Shi, W., Huang, G., & Song, S., et al. (2020). Self-supervised discovering of interpretable features for reinforcement learning. arXiv:2003.07069

  • Shu, T., Xiong, C., & Socher, R. (2018). Hierarchical and interpretable skill acquisition in multi-task reinforcement learning. In ICLR.

  • Silva, A., & Gombolay, M. (2020). Neural-encoding Human Experts’ Domain Knowledge to Warm Start Reinforcement Learning. arXiv:1902.06007

  • Silva, A., Gombolay, M., & Killian, T., et al. (2020). Optimization methods for interpretable differentiable decision trees applied to reinforcement learning. In AISTATS.

  • Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature, 550, 354–359.

    Article  Google Scholar 

  • Singh, C., Askari, A., Caruana, R., et al. (2023). Augmenting interpretable models with large language models during training. Nature Communications, 14, 7913.

    Article  Google Scholar 

  • Slaney, J., & Thiébaux, S. (2001). Blocks world revisited. Artificial Intelligence, 125(1–2), 119–153.

    Article  MathSciNet  Google Scholar 

  • Sridharan, M., Gelfond, M., Zhang, S., et al. (2019). REBA: A refinement-based architecture for knowledge representation and reasoning in robotics. JAIR, 65, 87–180.

    Article  MathSciNet  Google Scholar 

  • Srinivasan, S., & Doshi-Velez, F. (2020). Interpretable batch IRL to extract clinician goals in ICU hypotension management. In AMIA joint summits on translational science.

  • Sun, S. H., Wu, T. L., & Lim, J. J. (2020). Program guided agent. In ICLR.

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press

  • Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.

    Article  MathSciNet  Google Scholar 

  • Swain, M. (2013). Knowledge Representation. In Encyclopedia of Systems Biology (pp. 1082–1084).

  • Tang, Y., Nguyen, D., & Ha, D. (2020). Neuroevolution of self-interpretable agents. In GECCO.

  • Tasse, G. N., James, S., & Rosman, B. (2020). A boolean task algebra for reinforcement learning. In NeurIPS.

  • Tasse, G. N., James, S., & Rosman, B. (2022). Generalisation in lifelong reinforcement learning through logical composition. In ICLR.

  • Todorov, E. (2009). Compositionality of optimal control laws. In NeurIPS.

  • Topin, N., & Veloso, M. (2019). Generation of policy-level explanations for reinforcement learning. In AAAI.

  • Topin, N., Milani, S., & Fang, F., et al. (2021). Iterative bounding MDPs: Learning interpretable policies via non-interpretable methods. In AAAI.

  • Toro Icarte, R., Klassen, T., & Valenzano, R., et al. (2018a). Using reward machines for high-level task specification and decomposition in reinforcement learning. In ICML.

  • Toro Icarte, R., Klassen, T. Q., & Valenzano, R., et al. (2018b). Teaching multiple tasks to an rl agent using LTL. In AAMAS.

  • Toro Icarte, R., Waldie, E., & Klassen, T., et al. (2019). Learning reward machines for partially observable reinforcement learning. In NeurIPS.

  • Torrey, L., & Taylor, M. E. (2013). Teaching on a budget: Agents advising agents in reinforcement learning. In AAMAS.

  • van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. JMLR Sci 9(86), 2579–2605.

  • van der Waa, J., van Diggelen, J., van den Bosch, K., et al. (2018). Contrastive explanations for reinforcement learning in terms of expected consequences. In IJCAI workshop on XAI.

  • van Otterlo, M. (2005). A survey of reinforcement learning in relational domains. CTIT Technical Report Series: Tech. rep.

  • van Otterlo, M. (2009). The logic of adaptive behavior: Knowledge representation and algorithms for adaptive sequential decision making under uncertainty in first-order and relational domains. IOS Press.

  • van Otterlo, M. (2012). Solving relational and first-order logical markov decision processes: A Survey. In M. Wiering & M. van Otterlo (Eds.), Reinforcement learning (Vol. 12, pp. 253–292). Berlin Heidelberg: Springer.

  • Vasic, M., Petrovic, A., & Wang, K., et al. (2019). MoET: Interpretable and verifiable reinforcement learning via mixture of expert trees. arXiv:1906.06717

  • Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. In NeurIPS.

  • Veerapaneni, R., Co-Reyes, J. D., & Chang, M., et al. (2020). Entity abstraction in visual model-based reinforcement learning. In CoRL.

  • Verma, A., Murali, V., & Singh, R., et al. (2018). Programmatically interpretable reinforcement learning. In ICML.

  • Verma, A., M. Le, H., & Yue, Y., et al. (2019). Imitation-projected programmatic reinforcement learning. In NeurIPS.

  • Vinyals, O., Ewalds, T., & Bartunov, S., et al. (2017). StarCraft II: A new challenge for reinforcement learning. arXiv:1708.04782

  • Vinyals, O., Babuschkin, I., Czarnecki, W. M., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350–354.

    Article  Google Scholar 

  • Viola, P., & Jones, M. (2001). Robust real-time object detection. In International journal of computer vision.

  • Walker, T., Shavlik, J., & Maclin, R. (2004). Relational reinforcement learning via sampling the space of first-order conjunctive features. In ICML workshop on relational reinforcement learning.

  • Walker, T., Torrey, L., & Shavlik, J., et al. (2008). Building relational world models for reinforcement learning. In LNCS.

  • Walsh, J. (2010). Efficient learning of relational models for sequential decision making. PhD thesis, Rutgers.

  • Wang, T., Liao, R., & Fidler, S. (2018). NerveNet: Learning Structured Policy with Graph Neural Networks. In: ICLR

  • Wang, W., & Pan, S. J. (2019). Integrating deep learning with logic fusion for information extraction. In AAAI.

  • Wang, Y., Mase, M., & Egi, M. (2020). Attribution-based salience method towards interpretable reinforcement learning. In Spring symposium on combining ml and knowledge engineering in practice.

  • Weng, P., Busa-Fekete, R., Hüllermeier, E. (2013). Interactive Q-learning with ordinal rewards and unreliable tutor. In ECML workshop on RL with generalized feedback.

  • Whittlestone, J., Arulkumaran, K., & Crosby, M. (2021). The societal implications of deep reinforcement learning. JAIR, 70, 1003–1030.

    Article  Google Scholar 

  • Wiegreffe, S., & Pinter, Y. (2019). Attention is not not Explanation. In EMNLP.

  • Wiener, N. (1954). The human use of human beings. Houghton Mifflin

  • Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP).

  • Wu, B., Gupta, J. K., & Kochenderfer, M. J. (2019a). Model primitive hierarchical lifelong reinforcement learning. In AAMAS.

  • Wu, M., Parbhoo, S., & Hughes, M. C., et al. (2019b). Optimizing for interpretability in deep neural networks with tree regularization. arXiv:1908.05254

  • Wu, Z., Geiger, A., & Potts, C., et al. (2023). Interpretability at scale: Identifying causal mechanisms in alpaca. In NeurIPS.

  • Xu, J., Zhang, Z., & Friedman, T., et al. (2018). A semantic loss function for deep learning with symbolic knowledge. In ICML.

  • Xu, Z., Gavran, I., & Ahmad, Y., et al. (2020). Joint inference of reward machines and policies for reinforcement learning. In ICAPS.

  • Yang, F., Yang, Z., & Cohen, W. W. (2017). Differentiable learning of logical rules for knowledge base reasoning. In NeurIPS.

  • Yang, F., Lyu, D., Liu, B., et al. (2018a). PEORL: Integrating symbolic planning and hierarchical reinforcement learning for robust decision-making. In IJCAI.

  • Yang, Y., & Song, L. (2019). Learn to explain efficiently via neural logic inductive learning. In ICLR.

  • Yang, Y., Morillo, I. G., & Hospedales, T. M. (2018b). Deep neural decision trees. In ICML workshop on human interpretability in ML.

  • Younes, L. (2004). PPDDL1.0: The language for the probabilistic part of IPC-4.

  • Yu, H., Shen, Z., & Miao, C., et al. (2018). Building ethics into artificial intelligence. In IJCAI.

  • Zahavy, T., Ben-Zrihem, N., & Mannor, S. (2016). Graying the black box: Understanding DQNs. In ICML.

  • Zambaldi, V., Raposo, D., & Santoro, A., et al. (2019). Deep reinforcement learning with relational inductive biases. In ICLR.

  • Zhang, A., Sukhbaatar, S., & Lerer, A., et al. (2018a). Composable planning with attributes. In ICML.

  • Zhang, C., Vinyals, O., & Munos, R., et al. (2018b). A Study on Overfitting in Deep Reinforcement Learning. arXiv:1804.06893

  • Zhang, H., Gao, Z., & Zhou, Y., et al. (2019). Faster and Safer Training by Embedding High-Level Knowledge into Deep Reinforcement Learning. arXiv:1910.09986

  • Zhang, S., & Sridharan, M. (2020). A Survey of Knowledge-based Sequential Decision Making under Uncertainty. arXiv:2008.08548

  • Zhang, Y., Lee, J. D., & Jordan, M. I. (2016). L1-regularized neural networks are improperly learnable in polynomial time. In ICML.

  • Zhu, G., Huang, Z., & Zhang, C. (2018). Object-oriented dynamics predictor. In NeurIPS.

  • Zhu, G., Wang, J., & Ren, Z., et al. (2020). Object-oriented dynamics learning through multi-level abstraction. In AAAI.

  • Zhu, H., Magill, S., & Xiong, Z., et al. (2019). An inductive synthesis framework for verifiable reinforcement learning. In ACM SIGPLAN conference on PLDI.

  • Zimmer, M., Viappiani, P., & Weng, P. (2014). Teacher-student framework: A reinforcement learning approach. In AAMAS workshop on autonomous robots and multirobot systems.

  • Zimmer, M., Feng, X., & Glanois, C., et al. (2021). Differentiable logic machines. arXiv:2102.11529

Download references


This research work was funded by Huawei Technology Ltd.

Author information

Authors and Affiliations



All the authors participated in the initial discussions to define the scope of the survey and find the relevant papers. Once defined, the first three authors wrote the major part of the survey. DL and TY helped with improving earlier versions of this manuscript.

Corresponding author

Correspondence to Paul Weng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Bo Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Glanois, C., Weng, P., Zimmer, M. et al. A survey on interpretable reinforcement learning. Mach Learn (2024).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: