Abstract
In recent years, a variety of tasks have been accomplished by deep reinforcement learning (DRL). However, when applying DRL to tasks in a real-world environment, designing an appropriate reward is difficult. Rewards obtained via actual hardware sensors may include noise, misinterpretation, or failed observations. The learning instability caused by these unstable signals is a problem that remains to be solved in DRL. In this work, we propose an approach that extends existing DRL models by adding a subtask to directly estimate the variance contained in the reward signal. The model then takes the feature map learned by the subtask in a critic network and sends it to the actor network. This enables stable learning that is robust to the effects of potential noise. The results of experiments in the Atari game domain with unstable reward signals show that our method stabilizes training convergence. We also discuss the extensibility of the model by visualizing feature maps. This approach has the potential to make DRL more practical for use in noisy, real-world scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anschel, O., Baram, N., Shimkin, N.: Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: The International Conference on Machine Learning (2017)
Ba, J., Kingma, D.P.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (2015)
Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Everitt, T., Krakovna, V., Orseau, L., Hutter, M., Legg, S.: Reinforcement learning with a corrupted reward channel. In: 26th International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 4705–4713 (2017)
Fukui, H., Hirakawa, T., Yamashita, T., Fujiyoshi, H.: Attention branch network: learning of attention mechanism for visual explanation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: Proceedings of the 13th AAAI Conference on Artificial Intelligence (2016)
Jang, E., Devin, C., Vanhoucke, V., Levine, S.: Grasp2Vec: learning object representations from self-supervised grasping. In: Conference on Robot Learning (CoRL) (2018)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Kase, K., Suzuki, K., Yang, P.C., Mori, H., Ogata, T.: Put-in-box task generated from multiple discrete tasks by a humanoid robot using deep learning. In: Proceedings of the IEEE International Conference on Robots and Automation (2018)
Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37(4–5), 421–436 (2017)
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(39), 1–40 (2016)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Mnih, V.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning (ICML) (2016)
Murata, S., Namikawa, J., Arie, H., Sugano, S., Tani, J.: Learning to reproduce fluctuating time series by inferring their time-dependent stochastic properties: application in robot learning via tutoring. IEEE Trans. Auton. Ment. Dev. 5, 298–310 (2013)
Romoff, J., Henderson, P., Piché, A., F-Lavet, V., Pineau, J.: Reward Estimation for Variance Reduction in Deep Reinforcement Learning. arXiv preprint arXiv:1805.03359 (2018)
Roux, N.L., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. Adv. Neural Inf. Process. Syst. 25, 2663–2671 (2012)
Shwars, S.S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016)
Suzuki, K., Mori, H., Ogata, T.: Motion switching with sensory and instruction signals by designing dynamical systems using deep neural network. IEEE Robot. Autom. Lett. 3(4), 3481–3488 (2018)
Suzuki, K., Yokota, Y., Kanazawa, Y., Takebayashi, T.: Online self-supervised learning for object picking: detecting optimum grasping position using a metric learning approach. In: Proceedings of International Symposium on System Integrations (2020)
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–30 (2012)
Wang, F., et al.: Residual attention network for image classification. In: Conference on Computer Vision and Pattern Recognition (2017)
Zhao, W.Y., Guan, X.Y., Liu, Y., Zhao, X., Peng, J.: Stochastic Variance Reduction for Deep Q-learning. arXiv preprint arXiv:1905.08152 (2019)
Acknowledgment
This work was supported by JST, ACT-X Grant Number JPMJAX190I, Japan.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Suzuki, K., Ogata, T. (2020). Stable Deep Reinforcement Learning Method by Predicting Uncertainty in Rewards as a Subtask. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12533. Springer, Cham. https://doi.org/10.1007/978-3-030-63833-7_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-63833-7_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63832-0
Online ISBN: 978-3-030-63833-7
eBook Packages: Computer ScienceComputer Science (R0)