Information-Loss-Bounded Policy Optimization

Song, Yunlong

doi:10.1007/978-3-030-41188-6_8

Yunlong Song⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 883))

3040 Accesses

Abstract

Proximal and trust-region policy optimization methods (PPO and TRPO) belong to the standard reinforcement learning toolbox. Notably, PPO can be viewed as transforming the constrained TRPO problem into an unconstrained one, either via turning the constraint into a penalty or via objective clipping. In this chapter, an alternative problem reformulation is studied, where the information loss is bounded using a novel transformation of the KullbackLeibler (KL) divergence constraint. In contrast to PPO, the considered method does not require tuning of the regularization parameter, which is known to be hard due to its sensitivity to the reward scaling. The resulting algorithm, termed information-loss-bounded policy optimization (ILBPO), both enjoys the benefits of the first-order methods, being straightforward to implement using automatic differentiation, and maintains the advantages of the quasi-second order methods. It performs competitively in simulated OpenAI MuJoCo environments and achieves robust performance on a real robotic task of the Furuta pendulum swing-up and stabilization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Bagnell, J.A., Schneider, J.: Covariant policy search. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1019–1024. Morgan Kaufmann Publishers Inc. (2003)
Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., Zhokhov, P.: Openai Baselines (2017)
Google Scholar
Fantoni, I., Lozano, R.: Non-linear Control for Underactuated Mechanical Systems. Springer Science & Business Media (2001)
Google Scholar
Kakade, S.M.: A natural policy gradient. In: Advances in Neural Information Processing Systems (NIPS), pp. 1531–1538 (2002)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Minsky, M.: Steps toward artificial intelligence. Proc. IRE 49(1), 8–30 (1961)
Article MathSciNet Google Scholar
Peters, J., Mülling, K., Altun, Y.: Relative entropy policy search. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI), pp. 1607–1612 (2010)
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning (ICML), pp. 1889–1897 (2015)
Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems (NIPS), pp. 1057–1063 (2000)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033 (2012)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Technische Universität Darmstadt, Darmstadt, Germany
Yunlong Song

Authors

Yunlong Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunlong Song .

Editor information

Editors and Affiliations

Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Boris Belousov
Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Hany Abdulsamad
Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Pascal Klink
Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Simone Parisi
Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Jan Peters

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Song, Y. (2021). Information-Loss-Bounded Policy Optimization. In: Belousov, B., Abdulsamad, H., Klink, P., Parisi, S., Peters, J. (eds) Reinforcement Learning Algorithms: Analysis and Applications. Studies in Computational Intelligence, vol 883. Springer, Cham. https://doi.org/10.1007/978-3-030-41188-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-41188-6_8
Published: 03 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41187-9
Online ISBN: 978-3-030-41188-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics