Lucid dreaming for experience replay: refreshing past states with the current policy

Abstract

Experience replay (ER) improves the data efficiency of off-policy reinforcement learning (RL) algorithms by allowing an agent to store and reuse its past experiences in a replay buffer. While many techniques have been proposed to enhance ER by biasing how experiences are sampled from the buffer, thus far they have not considered strategies for refreshing experiences inside the buffer. In this work, we introduce L uc i d D reaming for E xperience R eplay (LiDER), a conceptually new framework that allows replay experiences to be refreshed by leveraging the agent’s current policy. LiDER consists of three steps: First, LiDER moves an agent back to a past state. Second, from that state, LiDER then lets the agent execute a sequence of actions by following its current policy—as if the agent were “dreaming” about the past and can try out different behaviors to encounter new experiences in the dream. Third, LiDER stores and reuses the new experience if it turned out better than what the agent previously experienced, i.e., to refresh its memories. LiDER is designed to be easily incorporated into off-policy, multi-worker RL algorithms that use ER; we present in this work a case study of applying LiDER to an actor–critic-based algorithm. Results show LiDER consistently improves performance over the baseline in six Atari 2600 games. Our open-source implementation of LiDER and the data used to generate all plots in this work are available at https://github.com/duyunshu/lucid-dreaming-for-exp-replay.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    A one-step reward r is usually stored instead of the cumulative return (e.g., Mnih et al. [27]). In this work, we follow Oh et al. [32] and store the Monte-Carlo return \({\mathrm{G}}\); we fully describe the buffer structure in Sect. 3.

  2. 2.

    The implementation of A3CTBSIL is open-sourced at https://github.com/gabrieledcjr/DeepRL. In de la Cruz Jr et al. [6], we also considered using demonstrations to improve A3CTBSIL, which is not the baseline used in this work.

  3. 3.

    Note that while the A3C algorithm is on-policy, integrating A3C with SIL makes it an off-policy algorithm (as in Oh et al. [32]).

  4. 4.

    Note the performance in Montezuma’s Revenge differs between A3CTBSIL [6] and the original SIL algorithm [32]—see the discussion in “Appendix 4.”

  5. 5.

    Note that the baseline A3CTBSIL represents the scenario of SampleD, i.e., always sample from buffer \({{\,\mathrm{\mathcal {D}}\,}}\).

  6. 6.

    The data is publicly available: https://github.com/gabrieledcjr/atari_human_demo

  7. 7.

    The policy-based Go-Explore algorithm is an extension of the Go-Explore without a policy framework, which was presented in an earlier pre-print [9]. Go-Explore without a policy framework also leverages the simulator reset feature.

  8. 8.

    Ecoffet et al. [10] made a detailed comparison between the policy-based Go-Explore and DTSIL. We refer the interested readers to Ecoffet et al. [10] for further reading.

References

  1. 1.

    Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W (2017) Hindsight experience replay. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., 30:5048–5058. https://proceedings.neurips.cc/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf

  2. 2.

    Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., vol 29, pp 1471–1479. https://proceedings.neurips.cc/paper/2016/file/afda332245e2af431fb7b672a68b659d-Paper.pdf

  3. 3.

    Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47(1):253–279

    Article  Google Scholar 

  4. 4.

    Chan H, Wu Y, Kiros J, Fidler S, Ba J (2019) ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv:190204546

  5. 5.

    de la Cruz GV, Du Y, Taylor ME (2019) Pre-training with non-expert human demonstration for deep reinforcement learning. Knowl Eng Rev 34:e10. https://doi.org/10.1017/S0269888919000055

    Article  Google Scholar 

  6. 6.

    de la Cruz Jr GV, Du Y, Taylor ME (2019) Jointly pre-training with supervised, autoencoder, and value losses for deep reinforcement learning. In: Adaptive and learning agents workshop, AAMAS

  7. 7.

    Dao G, Lee M (2019) Relevant experiences in replay buffer. In: 2019 IEEE symposium series on computational intelligence (SSCI), pp 94–101. https://doi.org/10.1109/SSCI44817.2019.9002745

  8. 8.

    De Bruin T, Kober J, Tuyls K, Babuška R (2015) The importance of experience replay database composition in deep reinforcement learning. In: Deep reinforcement learning workshop, NIPS

  9. 9.

    Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:190110995

  10. 10.

    Ecoffet A, Huizinga J, Lehman J, Stanley KO, Clune J (2020) First return then explore. arXiv preprint arXiv:200412919

  11. 11.

    Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, Firoiu V, Harley T, Dunning I, Legg S, Kavukcuoglu K (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of Machine learning research 80:1407–1416. http://proceedings.mlr.press/v80/espeholt18a.html

  12. 12.

    Fedus W, Ramachandran P, Agarwal R, Bengio Y, Larochelle H, Rowland M, Dabney W (2020) Revisiting fundamentals of experience replay. In: Proceedings of the 37th international conference on machine learning, PMLR. https://proceedings.icml.cc/paper/2020/hash/5460b9ea1986ec386cb64df22dff37be-Abstract.html

  13. 13.

    Florensa C, Held D, Wulfmeier M, Zhang M, Abbeel P (2017) Reverse curriculum generation for reinforcement learning. In: Levine S, Vanhoucke V, Goldberg K (eds) Proceedings of machine learning research, PMLR, 78:482–495. http://proceedings.mlr.press/v78/florensa17a.html

  14. 14.

    Gangwani T, Liu Q, Peng J (2019) Learning self-imitating diverse policies. In: International conference on learning representations. https://openreview.net/forum?id=HyxzRsR9Y7

  15. 15.

    Gruslys A, Dabney W, Azar MG, Piot B, Bellemare M, Munos R (2018) The reactor: a fast and sample-efficient actor-critic agent for reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=rkHVZWZAZ

  16. 16.

    Guo Y, Choi J, Moczulski M, Feng S, Bengio S, Norouzi M, Lee H (2020) Memory based trajectory-conditioned policies for learning from sparse rewards. In: Advances in neural information processing systems. https://papers.nips.cc/paper/2020/hash/2df45244f09369e16ea3f9117ca45157-Abstract.html

  17. 17.

    He FS, Liu Y, Schwing AG, Peng J (2017) Learning to play in a day: faster deep reinforcement learning by optimality tightening. In: International conference on learning representations. https://openreview.net/forum?id=rJ8Je4clg

  18. 18.

    Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G, Agapiou J, Leibo JZ, Gruslys A (2018) Deep Q-learning from demonstrations. In: Annual meeting of the association for the advancement of artificial intelligence (AAAI), New Orleans (USA)

  19. 19.

    Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, van Hasselt H, Silver D (2018) Distributed prioritized experience replay. In: International conference on learning representations. https://openreview.net/forum?id=H1Dy---0Z

  20. 20.

    Hosu IA, Rebedea T (2016) Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:160705077

  21. 21.

    Kapturowski S, Ostrovski G, Dabney W, Quan J, Munos R (2019) Recurrent experience replay in distributed reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=r1lyTjAqYX

  22. 22.

    Le L, Patterson A, White M (2018) Supervised autoencoders: improving generalization performance with unsupervised regularizers. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., 31:107–117. https://proceedings.neurips.cc/paper/2018/file/2a38a4a9316c49e5a833517c45d31070-Paper.pdf

  23. 23.

    Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=tX_O8O-8Zl

  24. 24.

    Lin LJ (1992) Self-improving reactive agents based on reinforcement learning. Planning and teaching. Mach Learn 8(3–4):293–321

    Google Scholar 

  25. 25.

    Liu R, Zou J (2018) The effects of memory replay in reinforcement learning. In: The 56th annual allerton conference on communication, control, and computing, pp 478–485

  26. 26.

    Mihalkova L, Mooney R (2006) Using active relocation to aid reinforcement learning. In: Proceedings of the 19th international FLAIRS conference (FLAIRS-2006), Melbourne Beach, FL, pp 580–585. http://www.cs.utexas.edu/users/ai-lab?mihalkova:flairs06

  27. 27.

    Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529

    Article  Google Scholar 

  28. 28.

    Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of machine learning research, PMLR, New York, New York, USA, vol 48, pp 1928–1937. http://proceedings.mlr.press/v48/mniha16.html

  29. 29.

    Munos R, Stepleton T, Harutyunyan A, Bellemare M (2016) Safe and efficient off-policy reinforcement learning. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., vol 29, pp 1054–1062. https://proceedings.neurips.cc/paper/2016/file/c3992e9a68c5ae12bd18488bc579b30d-Paper.pdf

  30. 30.

    Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation (ICRA), pp 6292–6299. https://doi.org/10.1109/ICRA.2018.8463162

  31. 31.

    Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of machine learning research, PMLR, Long Beach, California, USA, vol 97, pp 4851–4860. http://proceedings.mlr.press/v97/novati19a.html

  32. 32.

    Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: Dy J, Krause A (eds) Proceedings of machine learning research, PMLR, Stockholmsmässan, Stockholm Sweden, vol 80, pp 3878–3887. http://proceedings.mlr.press/v80/oh18b.html

  33. 33.

    Pohlen T, Piot B, Hester T, Azar MG, Horgan D, Budden D, Barth-Maron G, van Hasselt H, Quan J, Večerík M, et al. (2018) Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:180511593

  34. 34.

    Resnick C, Raileanu R, Kapoor S, Peysakhovich A, Cho K, Bruna J (2018) Backplay:” Man muss immer umkehren”. In: Workshop on reinforcement learning in games, AAAI

  35. 35.

    Ross S, Bagnell D (2010) Efficient reductions for imitation learning. In: Teh YW, Titterington M (eds) Proceedings of machine learning research, JMLR workshop and conference proceedings, Chia Laguna Resort, Sardinia, Italy, 9:661–668. http://proceedings.mlr.press/v9/ross10a.html

  36. 36.

    Salimans T, Chen R (2018) Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:181203381

  37. 37.

    Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay. In: International conference on learning representations. arXiv:1511.05952

  38. 38.

    Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, et al. (2019) Mastering Atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:191108265

  39. 39.

    Sinha S, Song J, Garg A, Ermon S (2020) Experience replay with likelihood-free importance weights. arXiv preprint arXiv:200613169

  40. 40.

    Sovrano F (2019) Combining experience replay with exploration by random network distillation. In: 2019 IEEE conference on games (CoG), pp 1–8. https://doi.org/10.1109/CIG.2019.8848046

  41. 41.

    Stumbrys T, Erlacher D, Schredl M (2016) Effectiveness of motor practice in lucid dreams: a comparison with physical and mental practice. J Sports Sci 34:27–34

    Article  Google Scholar 

  42. 42.

    Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, Cambridge

    MATH  Google Scholar 

  43. 43.

    Tang Y (2020) Self-imitation learning via generalized lower bound Q-learning. In: Advances in neural information processing systems, vol 33. https://papers.nips.cc/paper/2020/file/a0443c8c8c3372d662e9173c18faaa2c-Paper.pdf

  44. 44.

    Tavakoli A, Levdik V, Islam R, Smith CM, Kormushev P (2018) Exploring restart distributions. arXiv:181111298

  45. 45.

    Wang Z, Bapst V, Heess NMO, Mnih V, Munos R, Kavukcuoglu K, de Freitas N (2017) Sample efficient actor-critic with experience replay. In: International conference on learning representations. https://openreview.net/pdf?id=HyM25Mqel

  46. 46.

    Wawrzyński P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Netw 22(10):1484–1497

    Article  Google Scholar 

  47. 47.

    Zha D, Lai KH, Zhou K, Hu X (2019) Experience replay optimization. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, international joint conferences on artificial intelligence organization, pp 4243–4249. https://doi.org/10.24963/ijcai.2019/589, https://doi.org/10.24963/ijcai.2019/589

  48. 48.

    Zhang S, Sutton RS (2017) A deeper look at experience replay. arXiv preprint arXiv:171201275

  49. 49.

    Zhang X, Bharti SK, Ma Y, Singla A, Zhu X (2020) The teaching Dimension of Q-learning. arXiv preprint arXiv:200609324

Download references

Acknowledgements

We thank Gabriel V. de la Cruz Jr. for helpful discussions; his open-source code at https://github.com/gabrieledcjr/DeepRL is used for training the behavior cloning models in this work. This research used resources of Kamiak, Washington State University’s high-performance computing cluster. Assefaw Gebremedhin is supported by the NSF award IIS-1553528. Part of this work has taken place in the Intelligent Robot Learning (IRL) Lab at the University of Alberta, which is supported in part by research grants from the Alberta Machine Intelligence Institute (Amii), CIFAR, and NSERC. Part of this work has taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (CPS-1739964, IIS-1724157, NRI-1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARL, DARPA, Lockheed Martin, GM, and Bosch. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yunshu Du.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices for “lucid dreaming for experience replay: refreshing past states with the current policy”

We provide further details of our work in the following six appendices:

  • “Appendix 1” contains the implementation details of LiDER, including neural network architecture, hyperparameters, and computation resources used for all experiments.

  • “Appendix 2” presents the pseudocode for the A3C and SIL workers. Both follow the original work of Mnih et al. [28] and Oh et al. [32], respectively, we add them here for completeness.

  • “Appendix 3” provides detailed statistics of the one-tailed independent-samples t tests: 1) A3CTBSIL compared to LiDER, 2) A3CTBSIL compared to the three ablation studies of LiDER, 3) A3CTBSIL compared to the two extensions of LiDER, 4) LiDER compared to the three ablation studies of LiDER, and 5) LiDER compared to the two extensions of LiDER.

  • “Appendix 4” discusses the differences between the A3CTBSIL algorithm in Jr et al. [6] and the original SIL algorithm in Oh et al. [32] (as mentioned in Sect. 4.1).

  • “Appendix 5” presents the performance of the trained agents (TA) used in LiDER-TA.

  • “Appendix 6” details the pre-training process for obtaining the BC models used in LiDER-BC, including the statistics of the demonstration collected by de la Cruz et al. [5], the network architecture, the hyperparameters used for pre-training, and the performance of the trained BC models.

Appendix 1: Implementation details

We use the same neural network architecture as in the original A3C algorithm [28] for all A3C, SIL, and refresher workers (the blue, orange, and green components in Fig. 1, respectively). The network consists of three convolutional layers, one fully connected layer, followed by two branches of a fully connected layer: a policy function output layer and a value function output layer. Atari images are converted to grayscale and resized to 88\(\times \)88 with 4 images stacked as the input.

We run each experiment for eight trials due to computation limitations. Each experiment uses one GPU (Tesla K80 or TITAN V), five CPU cores, and 40 GB of memory (each LiDER-OneBuffer experiment uses 64 GB of memory since the buffer size was doubled). The refresher worker runs on GPU to generate data as quickly as possible; the A3C and SIL workers run distributively on CPU cores. In all games, the wall-clock time is roughly 0.8 to 1 million steps per hour and around 50 to 60 hours to complete one trial of 50 million steps.

The baseline A3CTBSIL is trained with 17 parallel workers; 16 A3C workers and 1 SIL worker. The RMSProp optimizer is used with a learning rate = 0.0007. We use \(t_{\mathrm{max}}=20\) for n-step bootstrap \(Q^{(n)}\) (\(n \le t_{\mathrm{max}}\)). The SIL worker performs \(M=4\) SIL policy updates (Equation (2)) per step t with minibatch size 32 (i.e., 32\(\times \)4=128 total samples per step). Buffer \({{\,\mathrm{\mathcal {D}}\,}}\) is of size \(10^5\). The SIL loss weight \(\beta ^{\mathrm{sil}}=0.5\).

LiDER is also trained with 17 parallel workers: 15 A3C workers, 1 SIL worker, and 1 refresher worker—we keep the total number of workers in A3CTBSIL and LiDER the same to ensure a fair performance comparison. The SIL worker in LiDER also uses a minibatch size of 32; samples are taken from buffer \({{\,\mathrm{\mathcal {D}}\,}}\) and \({{\,\mathrm{\mathcal {R}}\,}}\) as described in Sect. 3. All other parameters are identical to that of A3CTBSIL. We summarize the details of the network architecture and experiment parameters in Table 2.

Table 2 Hyperparameters for all experiments. We train each game for 50 million steps with a frame skip of 4, i.e., 200 million game frames were consumed for training

Appendix 2: Pseudocode for the A3C and SIL workers

figureb
figurec

Appendix 3: One-tailed independent-samples t tests

We conducted one-tailed independent-samples t tests (equal variances not assumed) in all games to compare the differences in the mean episodic reward among all methods in this paper. For each game, we restored the best model checkpoint from each trial (eight trials per method) and executed the model in the game following a deterministic policy for 100 episodes (an episode ends when the agent loses all its lives) and recorded the reward per episode. This gives us 800 data points for each method in each game. We use a significance level \(\alpha =0.001\) for all tests.

First, we check the statistical significance of the baseline A3CTBSIL compared to LiDER (Sect. 4.1), the main framework proposed in this paper. We report the detailed statistics in Table 3. Results show that the mean episodic reward of LiDER is significantly higher than A3CTBSIL (\(p \ll 0.001\)) in all games.

Table 3 One-tailed independent-samples t test for the differences of the mean episodic reward between A3CTBSIL and LiDER. Equal variances are not assumed

Second, we compare A3CTBSIL to the three ablation studies, LiDER-AddAll, LiDER-OneBuffer, and LiDER-SampleR (Sect. 5). Table 4 shows that all ablations were helpful in Freeway and Montezuma’s Revenge, in which the mean episodic rewards of the ablations are significantly higher than the baseline (\(p \ll 0.001\)). LiDER-AddAll also performed significantly better than A3CTBSIL in all games (\(p \ll 0.001\)). LiDER-OneBuffer outperformed A3CTBSIL in Freeway and Montezuma’s Revenge (\(p \ll 0.001\)), but it performed worse than the other four games (\(p \ll 0.001\)). LiDER-SampleR outperformed A3CTBSIL in Ms. Pac-Man, Freeway, and Montezuma’s Revenge (\(p \ll 0.001\)), but under-performed A3CTBSIL in Gopher, NameThisGame, and Alien (\(p \ll 0.001\)).

Table 4 One-tailed independent-samples t test for the differences of the mean episodic reward between A3CTBSIL and LiDER-AddALL, between A3CTBSIL and LiDER-OneBuffer, and between A3CTBSIL and LiDER-SampleR. Equal variances are not assumed

Third, we compare A3CTBSIL to the two extensions, LiDER-BC and LiDER-TA (Sect. 6). Table 5 shows that the two extensions outperformed the baseline significantly in all games (\(p \ll 0.001\)).

Table 5 One-tailed independent-samples t test for the differences of the mean episodic reward between A3CTBSIL and LiDER-BC, and between A3CTBSIL and LiDER-TA. Equal variances are not assumed

Fourth, we check the statistical significance of LiDER compared to the three ablation studies, LiDER-AddAll, LiDER-OneBuffer, and LiDER-SampleR (Sect. 5). Results in Table 6 show that most of the ablations significantly under-performed LiDER (\(p \ll 0.001\)) in terms of the mean episodic reward. Except for Gopher and NameThisGame, in which LiDER-AddAll performs at the same level as LiDER (\(p > 0.001\)).

Table 6 One-tailed independent-samples t test for the differences of the mean episodic reward between LiDER and LiDER-AddAll, between LiDER and LiDER-OneBuffer, and between LiDER and LiDER-SampleR. Equal variances are not assumed. Methods in bold are not significant at level \(\alpha =0.001\)

Lastly, we compare LiDER to the two extensions, LiDER-TA and LiDER-BC (Sect. 6). Results in Table 7 show that LiDER-TA always outperforms LiDER (\(p \ll 0.001\)). LiDER-BC outperformed LiDER in Gopher, Alien, Ms. Pac-Man, and Montezuma’s Revenge. In Freeway, LiDER-BC performs the same as LiDER (\(p > 0.001\)), while in NameThisGame LiDER-BC performed worse than LiDER (\(p \ll 0.001\)).

Table 7 One-tailed independent-samples t test for the differences of the mean episodic reward between LiDER and LiDER-TA, and between LiDER and LiDER-BC. Equal variances are not assumed. Methods in bold are not significant at level \(\alpha =0.001\)

Appendix 4: Differences between A3CTBSIL and SIL

There is a performance difference in Montezuma’s Revenge between the A3CTBSIL algorithm (our previous work in de la Cruz Jr et al. [6], which is used as the baseline method in this article) and the original SIL algorithm (by Oh et al. [32]). The A3CTBSIL agent fails to achieve any reward while the SIL agent can achieve a score of 1100 (Table 5 in [32]).

We hypothesize that the difference is due to the different number of SIL updates (Equation (2)) that can be performed in A3CTBSIL and SIL; lower numbers of SIL updates would decrease the performance. In particular, Oh et al. [32] proposed to add the “Perform self-imitation learning” step in each A3C worker (Algorithm 1 of Oh et al. [32]). That is, when running with 16 A3C workers, the SIL agent is actually using 16 SIL workers to update the policy. However, A3CTBSIL only has one SIL worker, which means A3CTBSIL performs strictly fewer SIL updates compared to that of the original SIL algorithm, and thus resulting in lower performance.

We empirically validate the above hypothesis by conducting an experiment in the game of Ms. Pac-Man by modifying the A3CTBSIL algorithm from our previous work [6]. Instead of performing a SIL update whenever the SIL worker can, we force the SIL worker to only perform an update at even global steps; this setting reduces the total number of SIL updates by half. We denote this experiment as A3CTBSIL-ReduceSIL.

Figure 8 shows that A3CTBSIL-ReduceSIL under-performed A3CTBSIL, which provides preliminary evidence that the number of SIL updates is positively correlated to performance. More experiments will be performed in future work to further validate this correlation.

Fig. 8
figure8

A3CTBSIL-ReduceSIL compared to A3CTBSIL in the game of Ms. Pac-Man. The x-axis is the total number of environmental steps. The y-axis is the average testing score over five trials. We ran A3CTBSIL-ReduceSIL for five trials due to limited computing resources; we plot the first five trials out of eight for A3CTBSIL for a fair comparison to the number of trials in A3CTBSIL-ReduceSIL. Shaded regions show the standard deviation

Appendix 5: The performance of trained agents used in LiDER-TA

Section 6 shows that LiDER can leverage knowledge from a trained agent (TA). While the TA could come from any source, we use the best checkpoint of a fully trained LiDER agent. Table 8 shows the average performance of the TA used in each game. The score is estimated by executing the TA greedily in the game for 50 episodes. An episode ends when the agent loses all its lives.

Table 8 The performance of trained agents used in LiDER-TA, shown as the purple dotted line in Fig. 7. The score is estimated by executing the TA greedily in the game for 50 episodes

Appendix 6: Pre-training the behavior cloning model for LiDER-BC

In Sect. 6, we demonstrated that a BC model can be incorporated into LiDER to improve learning. The BC model is pre-trained using a publicly available human demonstration dataset. Dataset statistics are shown in Table 9.

Table 9 Demonstration size and quality, collected in de la Cruz et al. [5]. All games are limited to 20 min of demonstration time per episode

The BC model uses the same network architecture as the A3C algorithm [28] and pre-training a BC model for A3C requires a few more steps than just using supervised learning as to how it is normally done in standard imitation learning (e.g., Ross and Bagnell [35]). A3C has two output layers: a policy output layer and a value output layer. The policy output is what we usually train a supervised classifier for. However, the value output layer is usually initialized randomly without being pre-trained. Our previous work [6] observed this inconsistency and leveraged demonstration data to also pre-train the value output layer. In particular, since the demonstration data contains the true return \({\mathrm{G}}\), we can obtain a value loss that is almost identical to A3C’s value loss \(L^{a3c}_{\mathrm{value}}\): instead of using the n-step bootstrap value \(Q^{(n)}\) to compute the advantage, the true return \({\mathrm{G}}\) is used.

Inspired by the supervised autoencoder (SAE) framework [22], our previous work [6] also blended in an unsupervised loss for pre-training. In SAE, an image reconstruction loss is incorporated with the supervised loss to help extract better feature representations and achieve better performance. A BC model pre-trained jointly with supervised, value, and unsupervised losses can lead to better performance after fine-tuning with RL, compared to pre-training with the supervised loss only.

We copy this approach by jointly pre-training the BC model for 50,000 steps with a minibatch of size 32. Adam optimizer is used with a learning rate = 0.0005. After training, we perform testing for 50 episodes by executing the model greedily in the game and record the average episodic reward (an episode ends when the agent loses all its lives). For each set of demonstration data, we train five models and use the one with the highest average episodic reward as the BC model in LiDER-BC. The performance of the trained BC models is present in Table 10. All parameters are based on those from our previous work [6] and we summarize them in Table 11.

Table 10 The performance of behavior cloning models used in LiDER-BC, shown as the black dashed line in Fig. 7. The score is estimated by executing the BC greedily in the game for 50 episodes
Table 11 Hyperparameters for pre-training the behavior cloning (BC) model used in LiDER-BC

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Du, Y., Warnell, G., Gebremedhin, A. et al. Lucid dreaming for experience replay: refreshing past states with the current policy. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-06104-5

Download citation

Keywords

  • Deep reinforcement learning
  • Experience replay
  • Self-imitation learning
  • Behavior cloning