Abstract
Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.
This is a preview of subscription content, access via your institution.
Buying options




References
Aksaray, D., Jones, A., Kong, Z., Schwager, M., Belta, C.: Q-learning for robust satisfaction of signal temporal logic specifications. In: IEEE CDC 2016, December 2016, pp. 6565–6570 (2016)
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: AAAI 2018 (2018)
Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017, pp. 166–175 (2017)
Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75(2), 87–106 (1987)
Baharisangari, N., Gaglione, J.R., Neider, D., Topcu, U., Xu, Z.: “Uncertainty-aware signal temporal logic inference" (2021). https://arxiv.org/abs/2105.11545
Bollig, B., Katoen, J.-P., Kern, C., Leucker, M., Neider, D., Piegdon, D.R.: The automata learning framework. In: Touili, T., Cook, B., Jackson, P. (eds.) CAV 2010. LNCS, vol. 6174, pp. 360–364. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14295-6_32
Bombara, G., Vasile, C.I., Penedo, F., Yasuoka, H., Belta, C.: A decision tree approach to data classification using signal temporal logic. In: Proceedings of the HSCC 2016, pp. 1–10 (2016)
Cai, M., Hasanbeig, M., Xiao, S., Abate, A., Kan, Z.: Modular deep reinforcement learning for continuous motion planning with temporal logic (2021)
Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. Robotics: Science and Systems. abs/1404.7073 (2014)
Furelos-Blanco, D., Law, M., Russo, A., Broda, K., Jonsson, A.: Induction of subgoal automata for reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3890–3897, April 2020. https://ojs.aaai.org/index.php/AAAI/article/view/5802
Gaglione, J.R., Neider, D., Roy, R., Topcu, U., Xu, Z.: Learning linear temporal properties from noisy data: a maxsat-based approach. In: ATVA 2021, Gold Coast, Australia, 18–22 October. Lecture Notes in Computer Science. Springer (2021)
Gaon, M., Brafman, R.: Reinforcement learning with non-markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3980–3987, April 2020. https://ojs.aaai.org/index.php/AAAI/article/view/5814
Holzinger, A., Malle, B., Saranti, A., Pfeifer, B.: Towards multi-modal causability with graph neural networks enabling information fusion for explainable AI. Inf. Fus. 71, 28–37 (2021). https://www.sciencedirect.com/science/article/pii/S1566253521000142
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2006)
Hoxha, B., Dokhanchi, A., Fainekos, G.: Mining parametric temporal logic properties in model-based design for cyber-physical systems. Int. J. Softw. Tools Technol. Transfer 20(1), 79–93 (2017). https://doi.org/10.1007/s10009-017-0447-4
Toro Icarte, R., Waldie, E., Klassen, T., Valenzano, R., Castro, M., McIlraith, S.: Learning reward machines for partially observable reinforcement learning. In: NeurIPS 2019 (2019)
Icarte, R.T., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018, pp. 2112–2121 (2018)
Kong, Z., Jones, A., Belta, C.: Temporal logics for learning and detection of anomalous behavior. IEEE TAC 62(3), 1210–1222 (2017)
Li, X., Vasile, C.-I., Belta, C.: Reinforcement learning with temporal logic rewards. In: Proceedings of the IROS 2017, September 2017, pp. 3834–3839 (2017)
Neider, D., Gaglione, J.R., Gavran, I., Topcu, U., Wu, B., Xu, Z.: Advice-guided reinforcement learning in a non-markovian environment. In: AAAI 2021 (2021)
Neider, D., Gavran, I.: Learning linear temporal properties. In: Formal Methods in Computer Aided Design (FMCAD) 2018, pp. 1–10 (2018)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: “Proximal policy optimization algorithms" (2017)
Shah, A., Kamath, P., Shah, J.A., Li, S.: Bayesian inference of temporal task specifications from demonstrations. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NeurIPS, Curran Associates Inc., 2018, pp. 3808–3817 (2018). http://papers.nips.cc/paper/7637-bayesian-inference-of-temporal-task-specifications-from-demonstrations.pdf
Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMAS 2018, Richland, SC, 2018, pp. 452–461 (2018)
Vazquez-Chanlatte, M., Jha, S., Tiwari, A., Ho, M.K., Seshia, S.A.: Learning task specifications from demonstrations. In: Proceedings of the NeurIPS 2018, pp. 5372–5382 (2018)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992). https://doi.org/10.1007/BF00992698
Wen, M., Papusha, I., Topcu, U.: Learning from demonstrations with high-level side information. In: Proceedings of the IJCAI 2017, pp. 3055–3061 (2017)
Wu, B., Lin, H.: Counterexample-guided permissive supervisor synthesis for probabilistic systems through learning. In: American Control Conference (ACC) 2015, pp. 2894–2899. IEEE (2015)
Wu, B., Zhang, X., Lin, H.: Permissive supervisor synthesis for markov decision processes through learning. IEEE Trans. Autom. Control 64(8), 3332–3338 (2018)
Zhang, X., Wu, B., Lin, H.: Supervisor synthesis of pomdp based on automata learning. Automatica (2021 to appear). https://arxiv.org/abs/1703.08262
Xu, Z., Birtwistle, M., Belta, C., Julius, A.: A temporal logic inference approach for model discrimination. IEEE Life Sci. Lett. 2(3), 19–22 (2016)
Xu, Z., Belta, C., Julius, A.: Temporal logic inference with prior information: an application to robot arm movements. In: Proceedings of the Analysis and Design of Hybrid Systems, vol. 48, no. 27, Atlanta, GA, USA, October 2015, pp. 141–146 (2015)
Xu, Z., et al.: Joint inference of reward machines and policies for reinforcement learning. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 590–598 (2020)
Xu, Z., Julius, A.: Census signal temporal logic inference for multiagent group behavior analysis. IEEE Trans. Autom. Sci. Eng. 15(1), 264–277 (2018)
Xu, Z., Ornik, M., Julius, A.A., Topcu, U.: Information-guided temporal logic inference with prior knowledge. In: Proceedings of the 2019 American control conference (ACC), pp. 1891–1897. IEEE (2019). https://arxiv.org/abs/1811.08846
Xu, Z., Saha, S., Hu, B., Mishra, S., Julius, A.: Advisory temporal logic inference and controller design for semiautonomous robots. IEEE Trans. Autom. Sci. Eng. 16, 1–19 (2018)
Xu, Z., Topcu, U.: Transfer of temporal logic formulas in reinforcement learning. In: IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, July 2019, pp. 4010–4018 (2019). https://doi.org/10.24963/ijcai.2019/557
Acknowledgment
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0032, ARL W911NF2020132, ARL ACC-APG-RTP W911NF, NSF 1646522, and Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - grant no. 434592664. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 IFIP International Federation for Information Processing
About this paper
Cite this paper
Xu, Z., Wu, B., Ojha, A., Neider, D., Topcu, U. (2021). Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2021. Lecture Notes in Computer Science(), vol 12844. Springer, Cham. https://doi.org/10.1007/978-3-030-84060-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-84060-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84059-4
Online ISBN: 978-3-030-84060-0
eBook Packages: Computer ScienceComputer Science (R0)
-
Published in cooperation with
http://www.ifip.org/