A Framework for Computing Bounds for the Return of a Policy
We present a framework for computing bounds for the return of a policy in finite-horizon, continuous-state Markov Decision Processes with bounded state transitions. The state transition bounds can be based on either prior knowledge alone, or on a combination of prior knowledge and data. Our framework uses a piecewise-constant representation of the return bounds and a backwards iteration process. We instantiate this framework for a previously investigated type of prior knowledge – namely, Lipschitz continuity of the transition function. In this context, we show that the existing bounds of Fonteneau et al. (2009, 2010) can be expressed as a particular instantiation of our framework, by bounding the immediate rewards using Lipschitz continuity and choosing a particular form for the regions in the piecewise-constant representation. We also show how different instantiations of our framework can improve upon their bounds.
KeywordsTransition Function Markov Decision Process Lipschitz Constant Reward Function Lipschitz Continuity
Unable to display preview. Download preview PDF.
- 1.Brunskill, E., Leffler, B., Li, L., Littman, M., Roy, N.: CORL: A continuous-state offset-dynamics reinforcement learner. In: Proceedings of the International Conference on Uncertainty in Artificial Intelligence, pp. 53–61 (2008)Google Scholar
- 3.Ermon, S., Conrad, J., Gomes, C., Selman, B.: Playing games against nature: optimal policies for renewable resource allocation. In: Proceedings of The 26th Conference on Uncertainty in Artificial Intelligence (2010)Google Scholar
- 4.Fonteneau, R., Murphy, S., Wehenkel, L., Ernst, D.: Inferring bounds on the performance of a control policy from a sample of trajectories. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 117–123 (2009)Google Scholar
- 6.Kaelbling, L.P.: Learning in embedded systems. MIT Press (1993)Google Scholar
- 7.Kakade, S., Kearns, M., Langford, J.: Exploration in Metric State Spaces. In: International Conference on Machine Learning, vol. 20, p. 306 (2003)Google Scholar