Efficient Exploration in Side-Scrolling Video Games with Trajectory Replay

  • I-Huan Chiang
  • Chung-Ming Huang
  • Nien-Hu Cheng
  • Hsin-Yu Liu
  • Shi-Chun TsaiEmail author


Deep Reinforcement learning agent has outperformed human players in many games, such as the Atari 2600 games. In more complicated games, previous related works proposed a curiosity-driven exploration for learning. Nevertheless, it generally requires substantial computational resources to train the agent. We attempt to design a method to assist with our agent to explore the environment. By utilizing prior learned experience more effectively, we develop a new memory replay mechanism, which consists of two modules: Trajectory Replay Module to record the agent moving trajectory information with much less space, and the Trajectory Optimization Module to formulate the state information as a reward. We evaluate our approach with two popular side-scrolling video games: Super Mario Bros and Sonic the Hedgehog. The experiment results show that our method can help the agent explore the environment efficiently, pass through various tough scenarios and successfully reach the goal in most of the testing game levels with merely four workers and of ordinary CPU computational resources for training. The demo videos are at Super Mario Bros and Sonic the Hedgehog.


Deep reinforcement learning Intrinsic reward Self-imitation learning 


  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. URL Software available from
  2. Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., & Hochreiter, S. (2018). Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857.
  3. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems (pp. 1471–1479).Google Scholar
  4. de Bruin, T., Kober, J., Tuyls, K., & Babuška, R. (2015). The importance of experience replay database composition in deep reinforcement learning. In Deep reinforcement learning workshop, NIPS.Google Scholar
  5. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., & Efros, A.A. (2018). Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355.
  6. Clark, J., & Amodei, D. (2016). Faulty reward functions in the wild. URL
  7. Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.
  8. Conti, E., Madhavan, V., Such, F. P., Lehman, J., Stanley, K., & Clune, J. (2018). Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in neural information processing systems (pp. 5027–5038).Google Scholar
  9. Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., & Clune, J. (2019). Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995.
  10. Fu, J., Co-Reyes, J., & Levine, S. (2017). Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in neural information processing systems (pp. 2577–2587).Google Scholar
  11. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Scholar
  12. Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks. arXiv preprint arxiv.1605.09674.
  13. Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., & Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.
  14. Kelly, M. (2017). An introduction to trajectory optimization: How to do your own direct collocation. SIAM Review, 59(4), 849–904.MathSciNetCrossRefGoogle Scholar
  15. Kimura, D., Chaudhury, S., Tachibana, R., & Dasgupta, S. (2018). Internal model from observations for reward shaping. arXiv preprint arXiv:1806.01267.
  16. Kompella, V. R., Stollenga, M., Luciw, M., & Schmidhuber, J. (2017). Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots. Artificial Intelligence, 247, 313–335.MathSciNetCrossRefGoogle Scholar
  17. Liu, R., & Zou, J. (2017). The effects of memory replay in reinforcement learning. arXiv preprint arXiv:1710.06574.
  18. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807–814).Google Scholar
  19. Nichol, A., Pfau, V., Hesse, C., Klimov, O., & Schulman, J. (2018). Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720.
  20. Oh, J., Guo, Y., Singh, S., & Lee, H. (2018). Self-imitation learning. arXiv preprint arXiv:1806.05635.
  21. OpenAI: Openai five. (2018).
  22. Pardo, F., Levdik, V., & Kormushev, P. (2018). Goal-oriented trajectories for efficient exploration. arXiv preprint arXiv:1807.02078.
  23. Pardo, F., Levdik, V., & Kormushev, P. (2018). Q-map: a convolutional approach for goal-oriented reinforcement learning. arXiv preprint arXiv:1810.02927.
  24. Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning (ICML) (vol. 2017).Google Scholar
  25. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.
  26. Schulman, J., Klimov, O., Wolski, F., Dhariwal, P., & Radford, A. (2017). Proximal policy optimization. URL
  27. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  28. Simonini, T. (2018). Sonic the hedgehog. in openai gym. github:simoninithomas/Deep\_reinforcement\_learning\_Course.Google Scholar
  29. Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., & Fergus, R. (2017). Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407.
  30. Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O.X., Duan, Y., Schulman, J., DeTurck, F., & Abbeel, P. (2017). # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems (pp. 2753–2762).Google Scholar
  31. Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., & Silver, D. (2019). Alphastar: Mastering the real-time strategy game starcraft ii.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceNational Chiao Tung UniversityHsinchuTaiwan

Personalised recommendations