Model Selection in Reinforcement Learning with General Function Approximations

Ghosh, Avishek; Chowdhury, Sayak Ray

doi:10.1007/978-3-031-26412-2_10

Avishek Ghosh¹³ &
Sayak Ray Chowdhury¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13716))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

561 Accesses

Abstract

We consider model selection for classic Reinforcement Learning (RL) environments – Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) – under general function approximations. In the model selection framework, we do not know the function classes, denoted by \(\mathcal {F}\) and \(\mathcal {M}\), where the true models – reward generating function for MABs and transition kernel for MDPs – lie, respectively. Instead, we are given M nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that adapt to the smallest function class (among the nested M classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., \(\mathcal {F}\) and \(\mathcal {M}\)) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon T.

A. Ghosh and S. R. Chowdhury contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Online Learning in Markov Decision Processes with Continuous Actions

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

Linear Bandits in Unknown Environments

Notes

1.
Here the roles of \(x_1\) and \(x_2\) are interchangeable without loss of generality.
2.
We assume that the action set \(\mathcal {X}\) is compact and continuous, and so such action pairs \((x_1,x_2)\) always exist, i.e., given any \(x_1 \in \mathcal {X}\), an action \(x_2\) such that \(D^*(x_1,x_2) \le \eta \) always exists.
3.
This can be found using standard trick like doubling.
4.
For any \(\alpha > 0\), we call \(\mathcal {F}^\alpha \) an \((\alpha ,\left\Vert \cdot \right\Vert _{\infty })\) cover of the function class \(\mathcal {F}\) if for any \(f \in \mathcal {F}\) there exists an \(f'\) in \(\mathcal {F}^\alpha \) such that \(\left\Vert f' - f\right\Vert _{\infty }:=\sup _{x \in \mathcal {X}}|f'(x)-f(x)|\le \alpha \).
5.
We can extent the range to [0, c] without loss of generality.
6.
One can choose \(\delta = 1/\text {poly}(M)\) to obtain a high-probability bound which only adds an extra \(\log M\) factor.
7.
For any \(\alpha > 0\), \(\mathcal {P}^\alpha \) is an \((\alpha ,\left\Vert \cdot \right\Vert _{\infty ,1})\) cover of \(\mathcal {P}\) if for any \(P \in \mathcal {P}\) there exists an \(P'\) in \(\mathcal {P}^\alpha \) such that \(\left\Vert P' - P\right\Vert _{\infty ,1}:=\sup _{s,a}\int _{\mathcal {S}}|P'(s'|s,a)-P(s'|s,a)|ds' \le \alpha \).

References

Abbasi-Yadkori, Y., Pál, D., Szepesvári, C.: Improved algorithms for linear stochastic bandits. In: Advances in Neural Information Processing Systems, pp. 2312–2320 (2011)
Google Scholar
Agarwal, A., Luo, H., Neyshabur, B., Schapire, R.E.: Corralling a band of bandit algorithms. In: Conference on Learning Theory, pp. 12–38. PMLR (2017)
Google Scholar
Arora, R., Marinov, T.V., Mohri, M.: Corralling stochastic bandit algorithms. In: International Conference on Artificial Intelligence and Statistics, pp. 2116–2124. PMLR (2021)
Google Scholar
Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., Yang, L.F.: Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107 (2020)
Balakrishnan, S., Wainwright, M.J., Yu, B., et al.: Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45(1), 77–120 (2017)
Article MathSciNet MATH Google Scholar
Besbes, O., Zeevi, A.: Dynamic pricing without knowing the demand function: risk bounds and near-optimal algorithms. Oper. Res. 57(6), 1407–1420 (2009)
Article MathSciNet MATH Google Scholar
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Book MATH Google Scholar
Chang, F., Lai, T.L.: Optimal stopping and dynamic allocation. Adv. Appl. Probabil. 19(4), 829–853 (1987). http://www.jstor.org/stable/1427104
Chatterji, N.S., Muthukumar, V., Bartlett, P.L.: Osom: a simultaneously optimal algorithm for multi-armed and linear contextual bandits. arXiv preprint arXiv:1905.10040 (2019)
Chowdhury, S.R., Gopalan, A.: Online learning in kernelized Markov decision processes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3197–3205 (2019)
Google Scholar
Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7(39), 1079–1105 (2006). http://jmlr.org/papers/v7/evendar06a.html
Foster, D.J., Krishnamurthy, A., Luo, H.: Model selection for contextual bandits. In: Advances in Neural Information Processing Systems, pp. 14714–14725 (2019)
Google Scholar
Ghosh, A., Chowdhury, S.R., Gopalan, A.: Misspecified linear bandits. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Ghosh, A., Pananjady, A., Guntuboyina, A., Ramchandran, K.: Max-affine regression: Provable, tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255 (2019)
Ghosh, A., Sankararaman, A., Kannan, R.: Problem-complexity adaptive model selection for stochastic linear bandits. In: International Conference on Artificial Intelligence and Statistics, pp. 1396–1404. PMLR (2021)
Google Scholar
Ghosh, A., Sankararaman, A., Ramchandran, K.: Model selection for generic contextual bandits. arXiv preprint arXiv:2107.03455 (2021)
Jin, C., Yang, Z., Wang, Z., Jordan, M.I.: Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388 (2019)
Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., Sun, W.: Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466 (2020)
Krishnamurthy, S.K., Athey, S.: Optimal model selection in contextual bandits with many classes via offline oracles. arXiv preprint arXiv:2106.06483 (2021)
Lee, J.N., Pacchiano, A., Muthukumar, V., Kong, W., Brunskill, E.: Online model selection for reinforcement learning with function approximation. CoRR abs/2011.09750 (2020). https://arxiv.org/abs/2011.09750
Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Osband, I., Van Roy, B.: Model-based reinforcement learning and the eluder dimension. In: Advances in Neural Information Processing Systems 27 (NIPS), pp. 1466–1474 (2014)
Google Scholar
Pacchiano, A., Dann, C., Gentile, C., Bartlett, P.: Regret bound balancing and elimination for model selection in bandits and RL. arXiv preprint arXiv:2012.13045 (2020)
Pacchiano, A., et al.: Model selection in contextual stochastic bandit problems. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 10328–10337. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/751d51528afe5e6f7fe95dece4ed32ba-Paper.pdf
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (2014)
MATH Google Scholar
Russo, D., Van Roy, B.: Eluder dimension and the sample complexity of optimistic exploration. In: Advances in Neural Information Processing Systems, pp. 2256–2264 (2013)
Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012)
Article MathSciNet MATH Google Scholar
Wang, R., Salakhutdinov, R., Yang, L.F.: Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804 (2020)
Wang, Y., Wang, R., Du, S.S., Krishnamurthy, A.: Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136 (2019)
Williams, G., Aldrich, A., Theodorou, E.A.: Model predictive path integral control: From theory to parallel computation. J. Guid. Control. Dyn. 40(2), 344–357 (2017)
Article Google Scholar
Yang, L.F., Wang, M.: Reinforcement leaning in feature space: matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389 (2019)
Yang, Z., Jin, C., Wang, Z., Wang, M., Jordan, M.: Provably efficient reinforcement learning with kernel and neural function approximations. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Yi, X., Caramanis, C., Sanghavi, S.: Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. CoRR abs/1608.05749 (2016). http://arxiv.org/abs/1608.05749
Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., Lazaric, A.: Frequentist regret bounds for randomized least-squares value iteration. In: International Conference on Artificial Intelligence and Statistics, pp. 1954–1964 (2020)
Google Scholar

Download references

Acknowledgements

We thank anonymous reviewers for their useful comments. Moreover, we would like to thank Prof. Kannan Ramchandran (EECS, UC Berkeley) for insightful discussions regarding the topic of model selection. SRC is grateful to a CISE postdoctoral fellowship of Boston University.

Author information

Authors and Affiliations

Halıcıoğlu Data Science Institute (HDSI), UC San Diego, San Diego, USA
Avishek Ghosh
Boston University, Boston, USA
Sayak Ray Chowdhury

Authors

Avishek Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Sayak Ray Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Avishek Ghosh .

Editor information

Editors and Affiliations

Grenoble Alpes University, Saint Martin d’Hères, France
Massih-Reza Amini
INSA Rouen Normandy, Saint Etienne du Rouvray, France
Stéphane Canu
Ruhr-Universität Bochum, Bochum, Germany
Asja Fischer
KU Leuven, Leuven, Belgium
Tias Guns
Central European University, Vienna, Austria
Petra Kralj Novak
Aristotle University of Thessaloniki, Thessaloniki, Greece
Grigorios Tsoumakas

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 347 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghosh, A., Chowdhury, S.R. (2023). Model Selection in Reinforcement Learning with General Function Approximations. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13716. Springer, Cham. https://doi.org/10.1007/978-3-031-26412-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-26412-2_10
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26411-5
Online ISBN: 978-3-031-26412-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Model Selection in Reinforcement Learning with General Function Approximations

Abstract

Access this chapter

Similar content being viewed by others

Online Learning in Markov Decision Processes with Continuous Actions

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

Linear Bandits in Unknown Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 347 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Model Selection in Reinforcement Learning with General Function Approximations

Abstract

Access this chapter

Similar content being viewed by others

Online Learning in Markov Decision Processes with Continuous Actions

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

Linear Bandits in Unknown Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 347 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation