Abstract
Stochastic Value Gradient (SVG) methods underlie many recent achievements of model-based Reinforcement Learning agents in continuous state-action spaces. Despite their practical significance, many algorithm design choices still lack rigorous theoretical or empirical justification. In this work, we analyze one such design choice: the gradient estimator formula. We conduct our analysis on randomized Linear Quadratic Gaussian environments, allowing us to empirically assess gradient estimation quality relative to the actual SVG. Our results justify a widely used gradient estimator by showing it induces a favorable bias-variance tradeoff, which could explain the lower sample complexity of recent SVG methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Our formula differs slightly from the original in that it considers a deterministic policy instead of a stochastic one.
- 2.
We use the make_spd_matrix function.
- 3.
We use the scipy.signal.place_poles function.
- 4.
We use the same 10 random seeds for experiments across values of K.
- 5.
We use seaborn.lineplot to produce the aggregated curves.
- 6.
Learning rate of \(10^{-2}\), \(B=200\), and \(K=8\).
- 7.
Recall from Sect. 3 that LQG allows us to compute the optimal policy analytically.
- 8.
We found that the computation times for both estimators were equivalent.
- 9.
We only clip the gradient norm at a maximum of 100 to avoid numerical errors.
References
Amos, B., Stanton, S., Yarats, D., Wilson, A.G.: On the model-based stochastic value gradient for continuous reinforcement learning. CoRR arXiv:2008.1 (2020)
Byravan, A., et al.: Imagined value gradients: model-based policy optimization with transferable latent dynamics models. In: CoRL. Proceedings of Machine Learning Research, vol. 100, pp. 566–589. PMLR (2019)
Chan, S.C.Y., Fishman, S., Korattikara, A., Canny, J., Guadarrama, S.: Measuring the reliability of reinforcement learning algorithms. In: ICLR. OpenReview.net (2020)
Clavera, I., Fu, Y., Abbeel, P.: Model-augmented actor-critic: backpropagating through paths. In: ICLR. OpenReview.net (2020). https://openreview.net/forum?id=Skln2A4YDB
Deisenroth, M.P., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, 28 June–2 July 2011, pp. 465–472. Omnipress (2011). https://icml.cc/2011/papers/323_icmlpaper.pdf
Engstrom, L., et al.: Implementation matters in deep RL: a case study on PPO and TRPO. In: ICLR. OpenReview.net (2020). https://github.com/implementation-matters/code-for-paper
Goodfellow, I.J., Bengio, Y., Courville, A.C.: Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016)
Hafner, D., Lillicrap, T.P., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: ICLR. OpenReview.net (2020)
Heess, N., Wayne, G., Silver, D., Lillicrap, T.P., Erez, T., Tassa, Y.: Learning continuous control policies by stochastic value gradients. In: NIPS, pp. 2944–2952 (2015). http://papers.nips.cc/paper/5796-learning-continuous-control-policies-by-stochastic-value-gradients
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: AAAI, pp. 3207–3214. AAAI Press (2018)
Ilyas, A., et al.: A closer look at deep policy gradients. In: ICLR. OpenReview.net (2020)
Islam, R., Henderson, P., Gomrokchi, M., Precup, D.: Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR arXiv:1708.04133 (2017)
Liu, Z., Li, X., Kang, B., Darrell, T.: Regularization matters for policy optimization - an empirical study on continuous control. In: International Conference on Learning Representations (2021). https://github.com/xuanlinli17/iclr2021_rlreg
Lovatto, A.G., Bueno, T.P., Mauá, D.D., de Barros, L.N.: Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings on “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. Proceedings of Machine Learning Research, vol. 137, pp. 76–86. PMLR, December 2020. http://proceedings.mlr.press/v137/lovatto20a.html
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey. In: Proceedings of the International Conference on Electronic Business (ICEB) 2018-December, pp. 421–429 (2020). http://arxiv.org/abs/2006.16712
Mohamed, S., Rosca, M., Figurnov, M., Mnih, A.: Monte Carlo gradient estimation in machine learning. J. Mach. Learn. Res. 21, 132:1–132:62 (2020)
Paszke, A., et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Polydoros, A.S., Nalpantidis, L.: Survey of model-based reinforcement learning: applications on robotics. J. Intell. Robot. Syst. 86(2), 153–173 (2017). https://doi.org/10.1007/s10846-017-0468-y
Recht, B.: A tour of reinforcement learning: the view from continuous control. Ann. Rev. Control Robot. Auton. Syst. 2(1), 253–279 (2019). https://doi.org/10.1146/annurev-control-053018-023825, http://arxiv.org/abs/1806.09460
Ruder, S.: An overview of gradient descent optimization algorithms. CoRR arXiv:1609.04747 (2016)
Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochastic computation graphs. In: NIPS, pp. 3528–3536 (2015)
Silver, D., Lever, G., Technologies, D., Lever, G.U.Y., Ac, U.C.L.: Deterministic Policy Gradient (DPG). In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, no. 1, pp. 387–395 (2014). http://proceedings.mlr.press/v32/silver14.html
Szepesvári, C.: Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2010). https://doi.org/10.2200/S00268ED1V01Y201005AIM009
Tits, A.L., Yang, Y.: Globally convergent algorithms for robust pole assignment by state feedback. IEEE Trans. Autom. Control 41(10), 1432–1452 (1996). https://doi.org/10.1109/9.539425
Todorov, E.: Optimal Control Theory. Bayesian Brain: Probabilistic Approaches to Neural Coding, pp. 269–298 (2006)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: IEEE International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012). https://doi.org/10.1109/IROS.2012.6386109, http://ieeexplore.ieee.org/document/6386109/
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Meth. 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Acknowledgments
This work was partly supported by the CAPES grant 88887.339578/2019-00 (first author), FAPESP grant 2016/22900-1 (second author), and CNPq scholarship 307979/2018-0 (third author).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lovatto, Â.G., Bueno, T.P., de Barros, L.N. (2021). Gradient Estimation in Model-Based Reinforcement Learning: A Study on Linear Quadratic Environments. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13073. Springer, Cham. https://doi.org/10.1007/978-3-030-91702-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-91702-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91701-2
Online ISBN: 978-3-030-91702-9
eBook Packages: Computer ScienceComputer Science (R0)