Towards Offline Reinforcement Learning with Pessimistic Value Priors

Valdettaro, Filippo; Faisal, A. Aldo

doi:10.1007/978-3-031-57963-9_7

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14523))

Included in the following conference series:

International Workshop on Epistemic Uncertainty in Artificial Intelligence

76 Accesses

Abstract

Offline reinforcement learning (RL) seeks to train agents in sequential decision-making tasks using only previously collected data and without directly interacting with the environment. As the agent tries to improve on the policy present in the dataset, it can introduce distributional shift between the training data and the suggested agent’s policy which can lead to poor performance. To avoid the agent assigning high values to out-of-distribution actions, successful offline RL requires some form of conservatism to be introduced. Here we present a model-free inference framework that encodes this conservatism in the prior belief of the value function: by carrying out policy evaluation with a pessimistic prior, we ensure that only the actions that are directly supported by the offline dataset will be modelled as having a high value. In contrast to other methods, we do not need to introduce heuristic policy constraints, value regularisation or uncertainty penalties to achieve successful offline RL policies in a toy environment. An additional consequence of our work is a principled quantification of Bayesian uncertainty in off-policy returns in model-free RL. While we are able to present an implementation of this framework to verify its behaviour in the exact inference setting with Gaussian processes on a toy problem, the scalability issues that it suffers as the central avenue for further work. We address in more detail these limitations and consider future directions to improve the scalability of this framework beyond the vanilla Gaussian process implementation, proposing a path towards improving offline RL algorithms in a principled way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

An, G., Moon, S., Kim, J.-H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7436–7447 (2021)
Google Scholar
Bachtiger, P., et al.: Artificial intelligence, data sensors and interconnectivity: future opportunities for heart failure. Card. Fail. Rev. 6 (2020)
Google Scholar
Brandfonbrener, D., Whitney, W., Ranganath, R., Bruna, J.: Offline RL without off-policy evaluation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 4933–4946 (2021)
Google Scholar
Burt, D.R., Ober, S.W., Garriga-Alonso, A., van der Wilk, M.: Understanding variational inference in function-space. arXiv preprint: arXiv:2011.09421 (2020)
Dasari, S., et al.: RoboNet: Large-scale multi-robot learning. In: Conference on Robot Learning, pp. 885–897. PMLR (2020)
Google Scholar
Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012)
Google Scholar
D’Angelo, F., Fortuin, V.: Repulsive deep ensembles are Bayesian. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3451–3465. Curran Associates, Inc. (2021)
Google Scholar
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 201–208 (2005)
Google Scholar
Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)
Google Scholar
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)
Google Scholar
Hafner, D., Tran, D., Lillicrap, T., Irpan, A., Davidson, J.: Noise contrastive priors for functional uncertainty. In: Uncertainty in Artificial Intelligence, pp. 905–914. PMLR (2020)
Google Scholar
Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. arXiv preprint: arXiv:1309.6835 (2013)
Huang, Z., Wu, J., Lv, C.: Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Google Scholar
Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673. PMLR (2018)
Google Scholar
Kendall, A., et a.: Learning to drive in a day. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254. IEEE (2019)
Google Scholar
Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21810–21823 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint: arXiv:1412.6980 (2014)
Komorowski, M., Celi, L.A., Badawi, O., Gordon, A.C., Faisal, A.A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24(11), 1716–1720 (2018)
Article Google Scholar
Kostrikov, I., Fergus, R., Tompson, J., Nachum, O.: Offline reinforcement learning with fisher divergence critic regularization. In: International Conference on Machine Learning, pp. 5774–5783. PMLR (2021)
Google Scholar
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit Q-learning. In: International Conference on Learning Representations (2022)
Google Scholar
Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191 (2020)
Google Scholar
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Ma, C., Hernández-Lobato, J.M.: Functional variational inference based on stochastic process generators. In: Advances in Neural Information Processing Systems, vol. 34, pp. 21795–21807 (2021)
Google Scholar
Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., Gu, S.: Deployment-efficient reinforcement learning via model-based offline optimization. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Ober, S.W., Rasmussen, C.E., van der Wilk, M.: The promises and pitfalls of deep kernel learning. In: Uncertainty in Artificial Intelligence, pp. 1206–1216. PMLR (2021)
Google Scholar
Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems, vol. 32, (2019)
Google Scholar
Rasmussen, C.E., et al.: Gaussian Processes for Machine Learning, vol. 1. Springer, Cham (2006)
Google Scholar
Shi, L., Li, G., Wei, Y., Chen, Y., Chi, Y.: Pessimistic Q-learning for offline reinforcement learning: towards optimal sample complexity. In: International Conference on Machine Learning, pp. 19967–20025. PMLR (2022)
Google Scholar
Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)
Google Scholar
Sun, S., Zhang, G., Shi, J., Grosse, R.: Functional variational Bayesian neural networks. arXiv preprint: arXiv:1903.05779 (2019)
Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp. 567–574. PMLR (2009)
Google Scholar
Touati, A., Satija, H., Romoff, J., Pineau, J., Vincent, P.: Randomized value functions via multiplicative normalizing flows. In: Uncertainty in Artificial Intelligence, pp. 422–432. PMLR (2020)
Google Scholar
Van Amersfoort, J., Smith, L., Jesson, A., Key, O., Gal, Y.: On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint: arXiv:2102.11409 (2021)
Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep kernel learning. In: Artificial Intelligence and Statistics, pp. 370–378. PMLR (2016)
Google Scholar
Xiao, T., Wang, D.: A general offline reinforcement learning framework for interactive recommendation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4512–4520 (2021)
Google Scholar
Yu, T., et al.: MOPO: model-based offline policy optimization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14129–14142 (2020)
Google Scholar

Download references

Acknowledgments

FV was funded by the Department of Computing, Imperial College London. AF was supported by a UKRI Turing AI Fellowship (EP/V025499/1)

Author information

Authors and Affiliations

Brain and Behaviour Lab, Department of Computing, Imperial College London, London, SW7 2AZ, UK
Filippo Valdettaro & A. Aldo Faisal
Chair in Digital Health and Data Science, University of Bayreuth, 95447, Bayreuth, Germany
A. Aldo Faisal

Authors

Filippo Valdettaro
View author publications
You can also search for this author in PubMed Google Scholar
A. Aldo Faisal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filippo Valdettaro .

Editor information

Editors and Affiliations

Dept. of Computing & Mathematics, Oxford Brookes University, Oxford, UK
Fabio Cuzzolin
Oxford Brookes University, Oxford, UK
Maryam Sultana

Appendices

Appendix 1

Prior mean and covariance derivation

The prior mean and covariance of the observed rewards can be deduced by writing $R_i = \sum _k A_{ik}Q(x_i^{(k)})$, with the same definition of A and x as given in Eq. 4 and 5 and then considering $\mathbb {E}(R_i)$ and $\text {cov}(R_i,R_j)$.

For the mean, we have

$$\begin{aligned} \mathbb {E}(R_i) &= \mathbb {E}\sum _k A_{ik}Q(x_i^{(k)}) \end{aligned}$$

(11)

$$\begin{aligned} & = \sum _k A_{ik}\mathbb {E}Q(x_i^{(k)}) \end{aligned}$$

(12)

$$\begin{aligned} & = 0, \end{aligned}$$

(13)

and for covariance

$$\begin{aligned} \text {cov}(R_i,R_j) &= \text {cov}\left( \sum _k A_{ik}Q(x_i^{(k)}),\sum _jA_{jl}Q(x_j^{(l)})\right) \end{aligned}$$

(14)

$$\begin{aligned} &= \sum _{k,l}A_{ik}A_{jl}\text {cov}(Q(x_i^{(k)}),Q(x_j^{(l)})) \end{aligned}$$

(15)

$$\begin{aligned} &= \sum _{k,l}A_{ik}A_{jl} k_Q(x_i^{(k)},x_j^{(l)}), \end{aligned}$$

(16)

as required.

Validity of resulting kernel

To show that k is a valid kernel for any policy, we will deduce that the matrix with elements i, j given by

$$\begin{aligned} k(\hat{x}_i,\hat{x}_j) = k_{ij} = \sum _{0\le k,l \le n} A_{ik}A_{jl} k_Q(x_{i}^{(k)},x_{j}^{(l)}) \end{aligned}$$

(17)

is positive semidefinite (and, trivially, symmetric in i and j) for any given valid kernel $k_Q$. This is equivalent to the statement that for any vector with elements $v_i$, the dot product $\sum _{i,j}v_i k_{ij} v_j \ge 0$ given that for any $x'$ and $v'$, $\sum _{m,n}v'_mk_Q(x'_{m},x'_{n})v'_n \ge 0$.

$$\begin{aligned} \sum _{i,j}v_i k_{ij} v_j &= \sum _{i,j, k,l}v_i A_{ik}v_jA_{jl} k_Q(x_{i}^{(k)},x_{j}^{(l)}) \end{aligned}$$

(18)

$$\begin{aligned} &= \sum _{(i,j), (k,l)}(v_i A_{ik})(v_jA_{jl}) k_Q(x_{i}^{(k)},x_{j}^{(l)}) \end{aligned}$$

(19)

$$\begin{aligned} &= \sum _{m,n}v'_m v'_n k_Q(x'_{m},x'_{n}) \end{aligned}$$

(20)

$$\begin{aligned} &\ge 0 \end{aligned}$$

(21)

as required, where we have ‘unrolled’ the matrix $v_i A_{ik}$ into the vector $v'_m$ with $x_i^{(k)}$ the corresponding $x'_m$ values and replaced summation over i, k with summation over m and correspondingly summation over j, l with n.

Appendix 2

The GP employed has prior mean equal to 0 and base covariance for the latent (value) variables that factors into an RBF function term in state space and a Kronecker-delta function in action-space (as actions are discrete we assume independence in the value across the different actions) analogously to Engel et al. [2005]:

$$\begin{aligned} \text {cov}(Q(s_i,a_i),Q(s_j,a_j)) = k_Q((s_i,a_i),(s_j,a_j)) = \sigma _p^2\exp (\frac{(s_i-s_j)^2}{2l^2})\delta _{a_i a_j}, \end{aligned}$$

(22)

where we chose $\sigma _p^2 = 1$ and $l=0.25$ for the experiments presented.

The DQN trained to produce Fig1b has two hidden layers of 256 neurons, batch size 128, Adam optimiser [Kingma and Ba, 2014] with learning rate of 0.001 and was trained for 100k gradient steps.

Here we report the same results from Figs. 1c and 1d but for a wider range of the state-space, where we observe that even outside the main region of interest, in the regions where no data is present the posterior value mean and standard deviation tend to the prior mean and standard deviation, i.e. low value and higher standard deviation, as is desirable in the offline RL setting for unsupported regions.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valdettaro, F., Faisal, A.A. (2024). Towards Offline Reinforcement Learning with Pessimistic Value Priors. In: Cuzzolin, F., Sultana, M. (eds) Epistemic Uncertainty in Artificial Intelligence . Epi UAI 2023. Lecture Notes in Computer Science(), vol 14523. Springer, Cham. https://doi.org/10.1007/978-3-031-57963-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-57963-9_7
Published: 24 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57962-2
Online ISBN: 978-3-031-57963-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Offline Reinforcement Learning with Pessimistic Value Priors