Skip to main content

Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

Part of the Lecture Notes in Computer Science book series (LNISA,volume 12844)

Abstract

Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-84060-0_8
  • Chapter length: 21 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-84060-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

References

  1. Aksaray, D., Jones, A., Kong, Z., Schwager, M., Belta, C.: Q-learning for robust satisfaction of signal temporal logic specifications. In: IEEE CDC 2016, December 2016, pp. 6565–6570 (2016)

    Google Scholar 

  2. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: AAAI 2018 (2018)

    Google Scholar 

  3. Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017, pp. 166–175 (2017)

    Google Scholar 

  4. Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75(2), 87–106 (1987)

    CrossRef  MathSciNet  Google Scholar 

  5. Baharisangari, N., Gaglione, J.R., Neider, D., Topcu, U., Xu, Z.: “Uncertainty-aware signal temporal logic inference" (2021). https://arxiv.org/abs/2105.11545

  6. Bollig, B., Katoen, J.-P., Kern, C., Leucker, M., Neider, D., Piegdon, D.R.: The automata learning framework. In: Touili, T., Cook, B., Jackson, P. (eds.) CAV 2010. LNCS, vol. 6174, pp. 360–364. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14295-6_32

    CrossRef  Google Scholar 

  7. Bombara, G., Vasile, C.I., Penedo, F., Yasuoka, H., Belta, C.: A decision tree approach to data classification using signal temporal logic. In: Proceedings of the HSCC 2016, pp. 1–10 (2016)

    Google Scholar 

  8. Cai, M., Hasanbeig, M., Xiao, S., Abate, A., Kan, Z.: Modular deep reinforcement learning for continuous motion planning with temporal logic (2021)

    Google Scholar 

  9. Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. Robotics: Science and Systems. abs/1404.7073 (2014)

    Google Scholar 

  10. Furelos-Blanco, D., Law, M., Russo, A., Broda, K., Jonsson, A.: Induction of subgoal automata for reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3890–3897, April 2020. https://ojs.aaai.org/index.php/AAAI/article/view/5802

  11. Gaglione, J.R., Neider, D., Roy, R., Topcu, U., Xu, Z.: Learning linear temporal properties from noisy data: a maxsat-based approach. In: ATVA 2021, Gold Coast, Australia, 18–22 October. Lecture Notes in Computer Science. Springer (2021)

    Google Scholar 

  12. Gaon, M., Brafman, R.: Reinforcement learning with non-markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3980–3987, April 2020. https://ojs.aaai.org/index.php/AAAI/article/view/5814

  13. Holzinger, A., Malle, B., Saranti, A., Pfeifer, B.: Towards multi-modal causability with graph neural networks enabling information fusion for explainable AI. Inf. Fus. 71, 28–37 (2021). https://www.sciencedirect.com/science/article/pii/S1566253521000142

  14. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2006)

    MATH  Google Scholar 

  15. Hoxha, B., Dokhanchi, A., Fainekos, G.: Mining parametric temporal logic properties in model-based design for cyber-physical systems. Int. J. Softw. Tools Technol. Transfer 20(1), 79–93 (2017). https://doi.org/10.1007/s10009-017-0447-4

    CrossRef  Google Scholar 

  16. Toro Icarte, R., Waldie, E., Klassen, T., Valenzano, R., Castro, M., McIlraith, S.: Learning reward machines for partially observable reinforcement learning. In: NeurIPS 2019 (2019)

    Google Scholar 

  17. Icarte, R.T., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018, pp. 2112–2121 (2018)

    Google Scholar 

  18. Kong, Z., Jones, A., Belta, C.: Temporal logics for learning and detection of anomalous behavior. IEEE TAC 62(3), 1210–1222 (2017)

    MathSciNet  MATH  Google Scholar 

  19. Li, X., Vasile, C.-I., Belta, C.: Reinforcement learning with temporal logic rewards. In: Proceedings of the IROS 2017, September 2017, pp. 3834–3839 (2017)

    Google Scholar 

  20. Neider, D., Gaglione, J.R., Gavran, I., Topcu, U., Wu, B., Xu, Z.: Advice-guided reinforcement learning in a non-markovian environment. In: AAAI 2021 (2021)

    Google Scholar 

  21. Neider, D., Gavran, I.: Learning linear temporal properties. In: Formal Methods in Computer Aided Design (FMCAD) 2018, pp. 1–10 (2018)

    Google Scholar 

  22. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: “Proximal policy optimization algorithms" (2017)

    Google Scholar 

  23. Shah, A., Kamath, P., Shah, J.A., Li, S.: Bayesian inference of temporal task specifications from demonstrations. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NeurIPS, Curran Associates Inc., 2018, pp. 3808–3817 (2018). http://papers.nips.cc/paper/7637-bayesian-inference-of-temporal-task-specifications-from-demonstrations.pdf

  24. Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMAS 2018, Richland, SC, 2018, pp. 452–461 (2018)

    Google Scholar 

  25. Vazquez-Chanlatte, M., Jha, S., Tiwari, A., Ho, M.K., Seshia, S.A.: Learning task specifications from demonstrations. In: Proceedings of the NeurIPS 2018, pp. 5372–5382 (2018)

    Google Scholar 

  26. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992). https://doi.org/10.1007/BF00992698

  27. Wen, M., Papusha, I., Topcu, U.: Learning from demonstrations with high-level side information. In: Proceedings of the IJCAI 2017, pp. 3055–3061 (2017)

    Google Scholar 

  28. Wu, B., Lin, H.: Counterexample-guided permissive supervisor synthesis for probabilistic systems through learning. In: American Control Conference (ACC) 2015, pp. 2894–2899. IEEE (2015)

    Google Scholar 

  29. Wu, B., Zhang, X., Lin, H.: Permissive supervisor synthesis for markov decision processes through learning. IEEE Trans. Autom. Control 64(8), 3332–3338 (2018)

    CrossRef  MathSciNet  Google Scholar 

  30. Zhang, X., Wu, B., Lin, H.: Supervisor synthesis of pomdp based on automata learning. Automatica (2021 to appear). https://arxiv.org/abs/1703.08262

  31. Xu, Z., Birtwistle, M., Belta, C., Julius, A.: A temporal logic inference approach for model discrimination. IEEE Life Sci. Lett. 2(3), 19–22 (2016)

    CrossRef  Google Scholar 

  32. Xu, Z., Belta, C., Julius, A.: Temporal logic inference with prior information: an application to robot arm movements. In: Proceedings of the Analysis and Design of Hybrid Systems, vol. 48, no. 27, Atlanta, GA, USA, October 2015, pp. 141–146 (2015)

    Google Scholar 

  33. Xu, Z., et al.: Joint inference of reward machines and policies for reinforcement learning. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 590–598 (2020)

    Google Scholar 

  34. Xu, Z., Julius, A.: Census signal temporal logic inference for multiagent group behavior analysis. IEEE Trans. Autom. Sci. Eng. 15(1), 264–277 (2018)

    CrossRef  Google Scholar 

  35. Xu, Z., Ornik, M., Julius, A.A., Topcu, U.: Information-guided temporal logic inference with prior knowledge. In: Proceedings of the 2019 American control conference (ACC), pp. 1891–1897. IEEE (2019). https://arxiv.org/abs/1811.08846

  36. Xu, Z., Saha, S., Hu, B., Mishra, S., Julius, A.: Advisory temporal logic inference and controller design for semiautonomous robots. IEEE Trans. Autom. Sci. Eng. 16, 1–19 (2018)

    Google Scholar 

  37. Xu, Z., Topcu, U.: Transfer of temporal logic formulas in reinforcement learning. In: IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, July 2019, pp. 4010–4018 (2019). https://doi.org/10.24963/ijcai.2019/557

Download references

Acknowledgment

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0032, ARL W911NF2020132, ARL ACC-APG-RTP W911NF, NSF 1646522, and Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - grant no. 434592664. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 IFIP International Federation for Information Processing

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Xu, Z., Wu, B., Ojha, A., Neider, D., Topcu, U. (2021). Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2021. Lecture Notes in Computer Science(), vol 12844. Springer, Cham. https://doi.org/10.1007/978-3-030-84060-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-84060-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-84059-4

  • Online ISBN: 978-3-030-84060-0

  • eBook Packages: Computer ScienceComputer Science (R0)