Skip to main content

Towards Offline Reinforcement Learning with Pessimistic Value Priors

  • Conference paper
  • First Online:
Epistemic Uncertainty in Artificial Intelligence (Epi UAI 2023)

Abstract

Offline reinforcement learning (RL) seeks to train agents in sequential decision-making tasks using only previously collected data and without directly interacting with the environment. As the agent tries to improve on the policy present in the dataset, it can introduce distributional shift between the training data and the suggested agent’s policy which can lead to poor performance. To avoid the agent assigning high values to out-of-distribution actions, successful offline RL requires some form of conservatism to be introduced. Here we present a model-free inference framework that encodes this conservatism in the prior belief of the value function: by carrying out policy evaluation with a pessimistic prior, we ensure that only the actions that are directly supported by the offline dataset will be modelled as having a high value. In contrast to other methods, we do not need to introduce heuristic policy constraints, value regularisation or uncertainty penalties to achieve successful offline RL policies in a toy environment. An additional consequence of our work is a principled quantification of Bayesian uncertainty in off-policy returns in model-free RL. While we are able to present an implementation of this framework to verify its behaviour in the exact inference setting with Gaussian processes on a toy problem, the scalability issues that it suffers as the central avenue for further work. We address in more detail these limitations and consider future directions to improve the scalability of this framework beyond the vanilla Gaussian process implementation, proposing a path towards improving offline RL algorithms in a principled way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • An, G., Moon, S., Kim, J.-H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. In: Advances in Neural Information Processing Systems, vol. 34, pp. 7436–7447 (2021)

    Google Scholar 

  • Bachtiger, P., et al.: Artificial intelligence, data sensors and interconnectivity: future opportunities for heart failure. Card. Fail. Rev. 6 (2020)

    Google Scholar 

  • Brandfonbrener, D., Whitney, W., Ranganath, R., Bruna, J.: Offline RL without off-policy evaluation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 4933–4946 (2021)

    Google Scholar 

  • Burt, D.R., Ober, S.W., Garriga-Alonso, A., van der Wilk, M.: Understanding variational inference in function-space. arXiv preprint: arXiv:2011.09421 (2020)

  • Dasari, S., et al.: RoboNet: Large-scale multi-robot learning. In: Conference on Robot Learning, pp. 885–897. PMLR (2020)

    Google Scholar 

  • Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012 (2012)

    Google Scholar 

  • D’Angelo, F., Fortuin, V.: Repulsive deep ensembles are Bayesian. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3451–3465. Curran Associates, Inc. (2021)

    Google Scholar 

  • Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 201–208 (2005)

    Google Scholar 

  • Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)

    Google Scholar 

  • Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019)

    Google Scholar 

  • Hafner, D., Tran, D., Lillicrap, T., Irpan, A., Davidson, J.: Noise contrastive priors for functional uncertainty. In: Uncertainty in Artificial Intelligence, pp. 905–914. PMLR (2020)

    Google Scholar 

  • Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. arXiv preprint: arXiv:1309.6835 (2013)

  • Huang, Z., Wu, J., Lv, C.: Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans. Neural Netw. Learn. Syst. (2022)

    Google Scholar 

  • Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673. PMLR (2018)

    Google Scholar 

  • Kendall, A., et a.: Learning to drive in a day. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254. IEEE (2019)

    Google Scholar 

  • Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: MOReL: model-based offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21810–21823 (2020)

    Google Scholar 

  • Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint: arXiv:1412.6980 (2014)

  • Komorowski, M., Celi, L.A., Badawi, O., Gordon, A.C., Faisal, A.A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24(11), 1716–1720 (2018)

    Article  Google Scholar 

  • Kostrikov, I., Fergus, R., Tompson, J., Nachum, O.: Offline reinforcement learning with fisher divergence critic regularization. In: International Conference on Machine Learning, pp. 5774–5783. PMLR (2021)

    Google Scholar 

  • Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit Q-learning. In: International Conference on Learning Representations (2022)

    Google Scholar 

  • Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  • Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191 (2020)

    Google Scholar 

  • Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  • Ma, C., Hernández-Lobato, J.M.: Functional variational inference based on stochastic process generators. In: Advances in Neural Information Processing Systems, vol. 34, pp. 21795–21807 (2021)

    Google Scholar 

  • Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., Gu, S.: Deployment-efficient reinforcement learning via model-based offline optimization. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021)

    Google Scholar 

  • Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  • Ober, S.W., Rasmussen, C.E., van der Wilk, M.: The promises and pitfalls of deep kernel learning. In: Uncertainty in Artificial Intelligence, pp. 1206–1216. PMLR (2021)

    Google Scholar 

  • Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  • Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems, vol. 32, (2019)

    Google Scholar 

  • Rasmussen, C.E., et al.: Gaussian Processes for Machine Learning, vol. 1. Springer, Cham (2006)

    Google Scholar 

  • Shi, L., Li, G., Wei, Y., Chen, Y., Chi, Y.: Pessimistic Q-learning for offline reinforcement learning: towards optimal sample complexity. In: International Conference on Machine Learning, pp. 19967–20025. PMLR (2022)

    Google Scholar 

  • Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)

    Google Scholar 

  • Sun, S., Zhang, G., Shi, J., Grosse, R.: Functional variational Bayesian neural networks. arXiv preprint: arXiv:1903.05779 (2019)

  • Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, pp. 567–574. PMLR (2009)

    Google Scholar 

  • Touati, A., Satija, H., Romoff, J., Pineau, J., Vincent, P.: Randomized value functions via multiplicative normalizing flows. In: Uncertainty in Artificial Intelligence, pp. 422–432. PMLR (2020)

    Google Scholar 

  • Van Amersfoort, J., Smith, L., Jesson, A., Key, O., Gal, Y.: On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint: arXiv:2102.11409 (2021)

  • Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep kernel learning. In: Artificial Intelligence and Statistics, pp. 370–378. PMLR (2016)

    Google Scholar 

  • Xiao, T., Wang, D.: A general offline reinforcement learning framework for interactive recommendation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4512–4520 (2021)

    Google Scholar 

  • Yu, T., et al.: MOPO: model-based offline policy optimization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14129–14142 (2020)

    Google Scholar 

Download references

Acknowledgments

FV was funded by the Department of Computing, Imperial College London. AF was supported by a UKRI Turing AI Fellowship (EP/V025499/1)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filippo Valdettaro .

Editor information

Editors and Affiliations

Appendices

Appendix 1

Prior mean and covariance derivation

The prior mean and covariance of the observed rewards can be deduced by writing \(R_i = \sum _k A_{ik}Q(x_i^{(k)})\), with the same definition of A and x as given in Eq. 4 and 5 and then considering \(\mathbb {E}(R_i)\) and \(\text {cov}(R_i,R_j)\).

For the mean, we have

$$\begin{aligned} \mathbb {E}(R_i) &= \mathbb {E}\sum _k A_{ik}Q(x_i^{(k)}) \end{aligned}$$
(11)
$$\begin{aligned} & = \sum _k A_{ik}\mathbb {E}Q(x_i^{(k)}) \end{aligned}$$
(12)
$$\begin{aligned} & = 0, \end{aligned}$$
(13)

and for covariance

$$\begin{aligned} \text {cov}(R_i,R_j) &= \text {cov}\left( \sum _k A_{ik}Q(x_i^{(k)}),\sum _jA_{jl}Q(x_j^{(l)})\right) \end{aligned}$$
(14)
$$\begin{aligned} &= \sum _{k,l}A_{ik}A_{jl}\text {cov}(Q(x_i^{(k)}),Q(x_j^{(l)})) \end{aligned}$$
(15)
$$\begin{aligned} &= \sum _{k,l}A_{ik}A_{jl} k_Q(x_i^{(k)},x_j^{(l)}), \end{aligned}$$
(16)

as required.

Validity of resulting kernel

To show that k is a valid kernel for any policy, we will deduce that the matrix with elements ij given by

$$\begin{aligned} k(\hat{x}_i,\hat{x}_j) = k_{ij} = \sum _{0\le k,l \le n} A_{ik}A_{jl} k_Q(x_{i}^{(k)},x_{j}^{(l)}) \end{aligned}$$
(17)

is positive semidefinite (and, trivially, symmetric in i and j) for any given valid kernel \(k_Q\). This is equivalent to the statement that for any vector with elements \(v_i\), the dot product \(\sum _{i,j}v_i k_{ij} v_j \ge 0\) given that for any \(x'\) and \(v'\), \(\sum _{m,n}v'_mk_Q(x'_{m},x'_{n})v'_n \ge 0\).

$$\begin{aligned} \sum _{i,j}v_i k_{ij} v_j &= \sum _{i,j, k,l}v_i A_{ik}v_jA_{jl} k_Q(x_{i}^{(k)},x_{j}^{(l)}) \end{aligned}$$
(18)
$$\begin{aligned} &= \sum _{(i,j), (k,l)}(v_i A_{ik})(v_jA_{jl}) k_Q(x_{i}^{(k)},x_{j}^{(l)}) \end{aligned}$$
(19)
$$\begin{aligned} &= \sum _{m,n}v'_m v'_n k_Q(x'_{m},x'_{n}) \end{aligned}$$
(20)
$$\begin{aligned} &\ge 0 \end{aligned}$$
(21)

as required, where we have ‘unrolled’ the matrix \(v_i A_{ik}\) into the vector \(v'_m\) with \(x_i^{(k)}\) the corresponding \(x'_m\) values and replaced summation over ik with summation over m and correspondingly summation over jl with n.

Appendix 2

The GP employed has prior mean equal to 0 and base covariance for the latent (value) variables that factors into an RBF function term in state space and a Kronecker-delta function in action-space (as actions are discrete we assume independence in the value across the different actions) analogously to Engel et al. [2005]:

$$\begin{aligned} \text {cov}(Q(s_i,a_i),Q(s_j,a_j)) = k_Q((s_i,a_i),(s_j,a_j)) = \sigma _p^2\exp (\frac{(s_i-s_j)^2}{2l^2})\delta _{a_i a_j}, \end{aligned}$$
(22)

where we chose \(\sigma _p^2 = 1\) and \(l=0.25\) for the experiments presented.

The DQN trained to produce Fig1b has two hidden layers of 256 neurons, batch size 128, Adam optimiser [Kingma and Ba, 2014] with learning rate of 0.001 and was trained for 100k gradient steps.

Here we report the same results from Figs. 1c and 1d but for a wider range of the state-space, where we observe that even outside the main region of interest, in the regions where no data is present the posterior value mean and standard deviation tend to the prior mean and standard deviation, i.e. low value and higher standard deviation, as is desirable in the offline RL setting for unsupported regions.

figure a

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Valdettaro, F., Faisal, A.A. (2024). Towards Offline Reinforcement Learning with Pessimistic Value Priors. In: Cuzzolin, F., Sultana, M. (eds) Epistemic Uncertainty in Artificial Intelligence . Epi UAI 2023. Lecture Notes in Computer Science(), vol 14523. Springer, Cham. https://doi.org/10.1007/978-3-031-57963-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57963-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57962-2

  • Online ISBN: 978-3-031-57963-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics