Abstract
Reinforcement Learning faces an important challenge in partially observable environments with long-term dependencies. In order to learn in an ambiguous environment, an agent has to keep previous perceptions in a memory. Earlier memory-based approaches use a fixed method to determine what to keep in the memory, which limits them to certain problems. In this study, we follow the idea of giving the control of the memory to the agent by allowing it to take memory-changing actions. Thus, the agent becomes more adaptive to the dynamics of an environment. Further, we formalize an intrinsic motivation to support this learning mechanism, which guides the agent to memorize distinctive events and enable it to disambiguate its state in the environment. Our overall approach is tested and analyzed on several partial observable tasks with long-term dependencies. The experiments show a clear improvement in terms of learning performance compared to other memory based methods.
Similar content being viewed by others
References
Astrom KJ (1965) Optimal control of markov processes with incomplete state information. J Math Anal Appl 10(1):174–205
Bakker B (2001) Reinforcement learning with LSTM in non-Markovian tasks with longterm dependencies. Memory 1–18
Barto AG (2013) Intrinsic motivation and reinforcement learning. In: Intrinsically motivated learning in natural and artificial systems. Springer, pp 17–47
Bellemare MG, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying Count-Based exploration and intrinsic motivation. NIPS
Bȯhmer W., Springenberg JT, Boedecker J, Riedmiller MA, Obermayer K (2015) Autonomous learning of state representations for control: an emerging field aims to autonomously learn state representations for reinforcement learning agents from their real-world sensor observations. Künstliche Intell 29(4):353–362. https://doi.org/10.1007/s13218-015-0356-1
Chrisman L (1992) Reinforcement learning with perceptual aliasing: the perceptual distinctions approach. In: AAAI’92. AAAI Press, pp 183–188
Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. In: AAAI 2018, pp 3207–3214
Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y (2018) Stable baselines https://github.com/hill-a/stable-baselines
Icarte RT, Valenzano RA, Klassen TQ, Christoffersen P, Farahmand A, McIlraith SA (2020) The act of remembering: a study in partially observable reinforcement learning. arXiv:2010.01753
Icarte RT, Waldie E, Klassen TQ, Valenzano RA, Castro MP, McIlraith SA (2019) Learning reward machines for partially observable reinforcement learning. In: NeurIPS 2019, pp 15497–15508
James MR, Singh S (2004) Learning and discovery of predictive state representations in dynamical systems with reset. In: Proceedings of the twenty-first international conference on Machine learning, p 53
Kulkarni TD (2016) Deep reinforcement learning with temporal abstraction and intrinsic motivation. In: Advances in neural information processing systems, vol 48, pp 3675–3683
Lanzi PL (2000) Adaptive agents with reinforcement learning and interal memory. From Animals to Animats 6
Le TP, Vien NA, Chung T (2018) A deep hierarchical reinforcement learning algorithm in partially observable markov decision processes. IEEE Access 6:49089–49102
Li R, Cai Z, Huang T, Zhu W (2021) Anchor: The achieved goal to replace the subgoal for hierarchical reinforcement learning. Knowl-Based Syst 225:107128. https://doi.org/10.1016/j.knosys.2021.107128
Lin LJ, Mitchell TM (1992) Memory approaches to reinforcement learning in non-Markovian domains. Citeseer
Littman ML (1994) Memoryless policies: theoretical limitations and practical results. In: From animals to animats 3, pp 238–245
Littman ML, Sutton RS (2002) Predictive representations of state. In: Advances in neural information processing systems, pp 1555–1561
Loch J, Singh SP (1998) Using eligibility traces to find the best memoryless policy in partially observable markov decision processes. In: ICML’98. Morgan Kaufmann Publishers Inc, pp 323–331
McCallum A (1996) Reinforcement Learning with Selective Perception and Hidden State. Ph.D. thesis, University of Rochester NY
Meuleau N, Kim KE, Kaelbling LP, Cassandra AR (1999) Solving POMDPs by Searching the Space of Finite Policies. UAI’99 pp 417–426
Meuleau N, Peshkin L, Kim K, Kaelbling LP (1999) Learning finite-state controllers for partially observable environments. In: Laskey KB , Prade H (eds) UAI ’99. Morgan Kaufmann, pp 427–436
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: ICML 2016. PMLR, pp 1928–1937
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. arXiv:1312.5602
Ostrovski G, Bellemare MG, Van Den Oord A, Munos R (2017) Count-based exploration with neural density models. In: ICML 2017, vol 6, pp 4161–4175. JMLR.org
Oudeyer PY, Kaplan F (2009) What is intrinsic motivation? a typology of computational approaches. Front Neurorobot 3(NOV):6. https://doi.org/10.3389/neuro.12.006.2007
Peshkin L, Meuleau N, Kaelbling LP (1999) Learning policies with external memory. In: Bratko I, Dzeroski S (eds) ICML 1999. Morgan Kaufmann, pp 307–314
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Singh S, Lewis RL, Barto AG, Sorg J (2010) Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Trans Auto Mental Dev 2(2):70–82
Singh SP, Littman ML, Jong NK, Pardoe D, Stone P (2003) Learning predictive state representations. In: ICML-03, pp 712–719
Steckelmacher D, Roijers DM, Harutyunyan A, Vrancx P, Plisnier H, Nowé A (2018) Reinforcement learning in POMDPs with memoryless options and option-observation initiation sets. In: AAAI 2018, pp 4099–4106
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press
Wang Z, Bapst V, Heess N, Mnih V, Munos R, Kavukcuoglu K, de Freitas N (2016) Sample efficient actor-critic with experience replay. arXiv:1611.01224
Whitehead SD, Ballard DH (1991) Learning to perceive and act by trial and error. Mach Learn 7(1):45–83
Zheng L, Cho SY (2011) A modified memory-based reinforcement learning method for solving POMDP problems. Neural Process Lett 33(2):187–200
Acknowledgements
This work is supported by the Scientific and Technological Research Council of Turkey under Grant No. 120E427. Authors would also like to thank Hüseyin Aydın, Erkin Çilden and Faruk Polat for their support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Demir, A. Learning what to memorize: Using intrinsic motivation to form useful memory in partially observable reinforcement learning. Appl Intell 53, 19074–19092 (2023). https://doi.org/10.1007/s10489-022-04328-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04328-z