Skip to main content

Safe Policy Improvement with Soft Baseline Bootstrapping

  • Conference paper
  • First Online:
Book cover Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11908))

Abstract

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood (Laroche et al. 2019). Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPS and on infinite MDPs with a neural network function approximation.

K. Nadjahi and R. Laroche—Equal contribution.

K. Nadjahi—Work done while interning at Microsoft Research Montréal.

Finite MDPs code available at https://github.com/RomainLaroche/SPIBB.

SPIBB-DQN code available at https://github.com/rems75/SPIBB-DQN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS) (2016)

    Google Scholar 

  • Borgwardt, K.H.: The Simplex Method: A Probabilistic Analysis. Springer, Heidelberg (1987). https://doi.org/10.1007/978-3-642-61578-8

    Book  MATH  Google Scholar 

  • Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  • Dantzig, G.: Linear Programming and Extensions. Rand Corporation Research Study. Princeton Univ. Press, Princeton (1963)

    Book  Google Scholar 

  • Dantzig, G.B., Thapa, M.N.: Linear Programming 2: Theory and Extensions. Springer, New York (2003). https://doi.org/10.1007/b97283

    Book  MATH  Google Scholar 

  • Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6, 503–556 (2005)

    MathSciNet  MATH  Google Scholar 

  • Fox, L., Choshen, L., Loewenstein, Y.: Dora the explorer: directed outreaching reinforcement action-selection. In: Proceedings of the 6th International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  • Geist, M., Scherrer, B., Pietquin, O.: A theory of regularized Markov decision processes. In: Proceedings of the 36th International Conference on Machine Learning (ICML) (2019)

    Google Scholar 

  • Gondzio, J.: Interior point methods 25 years later. Eur. J. Oper. Res. 218(3), 587–601 (2012)

    Article  MathSciNet  Google Scholar 

  • Guez, A., Vincent, R.D., Avoli, M., Pineau, J.: Adaptive treatment of epilepsy via batch-mode reinforcement learning. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 1671–1678 (2008)

    Google Scholar 

  • He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852 (2015)

  • Iyengar, G.N.: Robust dynamic programming. Math. Oper. Res. 30(2), 257–280 (2005)

    Article  MathSciNet  Google Scholar 

  • Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning (ICML), vol. 2, pp. 267–274 (2002)

    Google Scholar 

  • Klee, V., Minty, G.J.: How good is the simplex algorithm? In: Shisha, O. (ed.) Inequalities, vol. III, pp. 159–175. Academic Press, New York (1972)

    Google Scholar 

  • Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12, pp. 45–73. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3_2

    Chapter  Google Scholar 

  • Laroche, R., Trichelair, P., Tachet des Combes, R.: Safe policy improvement with baseline bootstrapping. In: Proceedings of the 36th International Conference on Machine Learning (ICML) (2019)

    Google Scholar 

  • Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)

    Article  Google Scholar 

  • Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005)

    Article  MathSciNet  Google Scholar 

  • Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)

    Google Scholar 

  • Petrik, M., Ghavamzadeh, M., Chow, Y.: Safe policy improvement by minimizing robust baseline regret. In: Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS) (2016)

    Google Scholar 

  • Riedmiller, M.: Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_32

    Chapter  Google Scholar 

  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  • Simão, T.D., Spaan, M.T.J.: Safe policy improvement with baseline bootstrapping in factored environments. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  • Singh, S.P., Kearns, M.J., Litman, D.J., Walker, M.A.: Reinforcement learning for spoken dialogue systems. In: Proceedings of the 13th Advances in Neural Information Processing Systems (NIPS), pp. 956–962 (1999)

    Google Scholar 

  • Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press, Cambridge (1998)

    Google Scholar 

  • Thomas, P.S.: Safe reinforcement learning. Ph.D. thesis, Stanford university (2015)

    Google Scholar 

  • Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  • van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461 (2015)

    Google Scholar 

  • Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., Weinberger, M.J.: Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Technical report (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kimia Nadjahi or Romain Laroche .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 924 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nadjahi, K., Laroche, R., Tachet des Combes, R. (2020). Safe Policy Improvement with Soft Baseline Bootstrapping. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11908. Springer, Cham. https://doi.org/10.1007/978-3-030-46133-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46133-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46132-4

  • Online ISBN: 978-3-030-46133-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics