Abstract
Offline reinforcement learning (RL) seeks to train agents in sequential decision-making tasks using only previously collected data and without directly interacting with the environment. As the agent tries to improve on the policy present in the dataset, it can introduce distributional shift between the training data and the suggested agent’s policy which can lead to poor performance. To avoid the agent assigning high values to out-of-distribution actions, successful offline RL requires some form of conservatism to be introduced. Here we present a model-free inference framework that encodes this conservatism in the prior belief of the value function: by carrying out policy evaluation with a pessimistic prior, we ensure that only the actions that are directly supported by the offline dataset will be modelled as having a high value. In contrast to other methods, we do not need to introduce heuristic policy constraints, value regularisation or uncertainty penalties to achieve successful offline RL policies in a toy environment. An additional consequence of our work is a principled quantification of Bayesian uncertainty in off-policy returns in model-free RL. While we are able to present an implementation of this framework to verify its behaviour in the exact inference setting with Gaussian processes on a toy problem, the scalability issues that it suffers as the central avenue for further work. We address in more detail these limitations and consider future directions to improve the scalability of this framework beyond the vanilla Gaussian process implementation, proposing a path towards improving offline RL algorithms in a principled way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
An, G., Moon, S., Kim, J.-H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7436–7447 (2021)
Bachtiger, P., et al.: Artificial intelligence, data sensors and interconnectivity: future opportunities for heart failure. Card. Fail. Rev. 6 (2020)
Brandfonbrener, D., Whitney, W., Ranganath, R., Bruna, J.: Offline RL without off-policy evaluation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 4933–4946 (2021)
Burt, D.R., Ober, S.W., Garriga-Alonso, A., van der Wilk, M.: Understanding variational inference in function-space. arXiv preprint: arXiv:2011.09421 (2020)
Dasari, S., et al.: RoboNet: Large-scale multi-robot learning. In: Conference on Robot Learning, pp. 885–897. PMLR (2020)
Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012)
D’Angelo, F., Fortuin, V.: Repulsive deep ensembles are Bayesian. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3451–3465. Curran Associates, Inc. (2021)
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 201–208 (2005)
Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)
Hafner, D., Tran, D., Lillicrap, T., Irpan, A., Davidson, J.: Noise contrastive priors for functional uncertainty. In: Uncertainty in Artificial Intelligence, pp. 905–914. PMLR (2020)
Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. arXiv preprint: arXiv:1309.6835 (2013)
Huang, Z., Wu, J., Lv, C.: Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673. PMLR (2018)
Kendall, A., et a.: Learning to drive in a day. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254. IEEE (2019)
Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21810–21823 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint: arXiv:1412.6980 (2014)
Komorowski, M., Celi, L.A., Badawi, O., Gordon, A.C., Faisal, A.A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24(11), 1716–1720 (2018)
Kostrikov, I., Fergus, R., Tompson, J., Nachum, O.: Offline reinforcement learning with fisher divergence critic regularization. In: International Conference on Machine Learning, pp. 5774–5783. PMLR (2021)
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit Q-learning. In: International Conference on Learning Representations (2022)
Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191 (2020)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Ma, C., Hernández-Lobato, J.M.: Functional variational inference based on stochastic process generators. In: Advances in Neural Information Processing Systems, vol. 34, pp. 21795–21807 (2021)
Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., Gu, S.: Deployment-efficient reinforcement learning via model-based offline optimization. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Ober, S.W., Rasmussen, C.E., van der Wilk, M.: The promises and pitfalls of deep kernel learning. In: Uncertainty in Artificial Intelligence, pp. 1206–1216. PMLR (2021)
Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems, vol. 32, (2019)
Rasmussen, C.E., et al.: Gaussian Processes for Machine Learning, vol. 1. Springer, Cham (2006)
Shi, L., Li, G., Wei, Y., Chen, Y., Chi, Y.: Pessimistic Q-learning for offline reinforcement learning: towards optimal sample complexity. In: International Conference on Machine Learning, pp. 19967–20025. PMLR (2022)
Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)
Sun, S., Zhang, G., Shi, J., Grosse, R.: Functional variational Bayesian neural networks. arXiv preprint: arXiv:1903.05779 (2019)
Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp. 567–574. PMLR (2009)
Touati, A., Satija, H., Romoff, J., Pineau, J., Vincent, P.: Randomized value functions via multiplicative normalizing flows. In: Uncertainty in Artificial Intelligence, pp. 422–432. PMLR (2020)
Van Amersfoort, J., Smith, L., Jesson, A., Key, O., Gal, Y.: On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint: arXiv:2102.11409 (2021)
Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep kernel learning. In: Artificial Intelligence and Statistics, pp. 370–378. PMLR (2016)
Xiao, T., Wang, D.: A general offline reinforcement learning framework for interactive recommendation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4512–4520 (2021)
Yu, T., et al.: MOPO: model-based offline policy optimization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14129–14142 (2020)
Acknowledgments
FV was funded by the Department of Computing, Imperial College London. AF was supported by a UKRI Turing AI Fellowship (EP/V025499/1)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
Prior mean and covariance derivation
The prior mean and covariance of the observed rewards can be deduced by writing \(R_i = \sum _k A_{ik}Q(x_i^{(k)})\), with the same definition of A and x as given in Eq. 4 and 5 and then considering \(\mathbb {E}(R_i)\) and \(\text {cov}(R_i,R_j)\).
For the mean, we have
and for covariance
as required.
Validity of resulting kernel
To show that k is a valid kernel for any policy, we will deduce that the matrix with elements i, j given by
is positive semidefinite (and, trivially, symmetric in i and j) for any given valid kernel \(k_Q\). This is equivalent to the statement that for any vector with elements \(v_i\), the dot product \(\sum _{i,j}v_i k_{ij} v_j \ge 0\) given that for any \(x'\) and \(v'\), \(\sum _{m,n}v'_mk_Q(x'_{m},x'_{n})v'_n \ge 0\).
as required, where we have ‘unrolled’ the matrix \(v_i A_{ik}\) into the vector \(v'_m\) with \(x_i^{(k)}\) the corresponding \(x'_m\) values and replaced summation over i, k with summation over m and correspondingly summation over j, l with n.
Appendix 2
The GP employed has prior mean equal to 0 and base covariance for the latent (value) variables that factors into an RBF function term in state space and a Kronecker-delta function in action-space (as actions are discrete we assume independence in the value across the different actions) analogously to Engel et al. [2005]:
where we chose \(\sigma _p^2 = 1\) and \(l=0.25\) for the experiments presented.
The DQN trained to produce Fig1b has two hidden layers of 256 neurons, batch size 128, Adam optimiser [Kingma and Ba, 2014] with learning rate of 0.001 and was trained for 100k gradient steps.
Here we report the same results from Figs. 1c and 1d but for a wider range of the state-space, where we observe that even outside the main region of interest, in the regions where no data is present the posterior value mean and standard deviation tend to the prior mean and standard deviation, i.e. low value and higher standard deviation, as is desirable in the offline RL setting for unsupported regions.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Valdettaro, F., Faisal, A.A. (2024). Towards Offline Reinforcement Learning with Pessimistic Value Priors. In: Cuzzolin, F., Sultana, M. (eds) Epistemic Uncertainty in Artificial Intelligence . Epi UAI 2023. Lecture Notes in Computer Science(), vol 14523. Springer, Cham. https://doi.org/10.1007/978-3-031-57963-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-57963-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57962-2
Online ISBN: 978-3-031-57963-9
eBook Packages: Computer ScienceComputer Science (R0)