Advertisement

Efficient Online Hyperparameter Adaptation for Deep Reinforcement Learning

  • Yinda Zhou
  • Weiming Liu
  • Bin LiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11454)

Abstract

Deep Reinforcement Learning (DRL) has shown its extraordinary performance on a variety of challenging learning tasks, especially those in games. It has been recognized that DRL process is a high-dynamic and non-stationary optimization process even in the static environments, their performance is notoriously sensitive to the hyperparameter configuration which includes learning rate, discount coefficient, and step size, etc. The situation will be more serious when DRL is conducting in a changing environment. The most ideal state of hyperparameter configuration in DRL is that the hyperparameter can self-adapt to the best values promptly for their current learning state, rather than using a fixed set of hyperparameters for the whole course of training like most previous works did. In this paper, an efficient online hyperparameter adaptation method is presented, which is an improved version of Population-based Training (PBT) method on the promptness of adaptation. A recombination operation inspired by GA is introduced into the population adaptation to accelerating the convergence of the population towards the better hyperparameter configurations. Experiment results have shown that in four test environments, the presented method has achieved 92%, 70%, 2% and 15% performance improvement over PBT.

Keywords

Reinforcement learning Hyperparameter adaptation Game 

Notes

Acknowledgment

The work is supported by the National Natural Science Foundation of China under grand No. 61836011 and No. 61473271.

References

  1. 1.
    Li, Y.: Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 (2017)
  2. 2.
    Sutton, R.S., Barto, A.G., Bach, F., et al.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  3. 3.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
  4. 4.
    Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484 (2016)CrossRefGoogle Scholar
  5. 5.
    Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)CrossRefGoogle Scholar
  6. 6.
    Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)CrossRefGoogle Scholar
  7. 7.
    Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  8. 8.
    Justesen, N., Bontrager, P., Togelius, J., Risi, S.: Deep learning for video game playing. arXiv preprint arXiv:1708.07902 (2017)
  9. 9.
    Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Mirowski, P., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016)
  11. 11.
    Yoo, S., Yun, K., Choi, J.Y.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR, pp. 2711–2720 (2017)Google Scholar
  12. 12.
    Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:1704.03899 (2017)
  13. 13.
    Zhang, J., Wang, N., Zhang, L.: Multi-shot pedestrian re-identification via sequential decision making. arXiv preprint arXiv:1712.07257 (2017)
  14. 14.
    Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560 (2017)
  15. 15.
    Islam, R., Henderson, P., Gomrokchi, M., Precup, D.: Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133 (2017)
  16. 16.
    Elfwing, S., Uchibe, E., Doya, K.: Online meta-learning by parallel algorithm competition. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 426–433. ACM (2018)Google Scholar
  17. 17.
    Melo, F.S., Meyn, S.P., Ribeiro, M.I.: An analysis of reinforcement learning with function approximation. In: Proceedings of the 25th international conference on Machine learning, pp. 664–671. ACM (2008)Google Scholar
  18. 18.
    François-Lavet, V., Fonteneau, R., Ernst, D.: How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv preprint arXiv:1512.02011 (2015)
  19. 19.
    Downey, C., Sanner, S., et al.: Temporal difference bayesian model averaging: A bayesian perspective on adapting lambda. In: ICML, pp. 311–318. Citeseer (2010)Google Scholar
  20. 20.
    Ishii, S., Yoshida, W., Yoshimoto, J.: Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15(4–6), 665–687 (2002)CrossRefGoogle Scholar
  21. 21.
    Mann, T.A., Penedones, H., Mannor, S., Hester, T.: Adaptive lambda least-squares temporal difference learning. arXiv preprint arXiv:1612.09465 (2016)
  22. 22.
    Jaderberg, M., et al.: Population based training of neural networks. arXiv preprint arXiv:1711.09846 (2017)
  23. 23.
    Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)Google Scholar
  24. 24.
    Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. AAAI 16, 2094–2100 (2016)Google Scholar
  25. 25.
    Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
  26. 26.
    Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.: Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 (2015)
  27. 27.
    Anschel, O., Baram, N., Shimkin, N.: Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. arXiv preprint arXiv:1611.01929 (2016)
  28. 28.
    Fortunato, M., et al.: Noisy networks for exploration. arXiv preprint arXiv:1706.10295 (2017)
  29. 29.
    Hessel, M., et al.: Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298 (2017)
  30. 30.
    Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
  31. 31.
    Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)Google Scholar
  32. 32.
    Wu, Y., Mansimov, E., Grosse, R.B., Liao, S., Ba, J.: Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In: Advances in Neural Information Processing Systems, pp. 5285–5294 (2017)Google Scholar
  33. 33.
    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  34. 34.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp. 2951–2959 (2012)Google Scholar
  36. 36.
    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-25566-3_40CrossRefGoogle Scholar
  37. 37.
    Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)Google Scholar
  38. 38.
    Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning, pp. 115–123 (2013)Google Scholar
  39. 39.
    Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application SystemsUniversity of Science and Technology of ChinaHefeiChina

Personalised recommendations