Batch mode reinforcement learning based on the synthesis of artificial trajectories

Fonteneau, Raphael; Murphy, Susan A.; Wehenkel, Louis; Ernst, Damien

doi:10.1007/s10479-012-1248-5

Batch mode reinforcement learning based on the synthesis of artificial trajectories

Published: 15 November 2012

Volume 208, pages 383–416, (2013)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Raphael Fonteneau¹,
Susan A. Murphy²,
Louis Wehenkel¹ &
…
Damien Ernst¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

Safe Policy Improvement with Soft Baseline Bootstrapping

Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey

Article 09 May 2015

Notes

Here the fundamental assumption is that w _t is independent of w _t−1,w _t−2,…,w ₀ given x _t and u _t; to simplify all notations and derivations, we furthermore impose that the process is time-invariant and does not depend on the states and actions x _t,u _t.
We have chosen to represent the average results obtained over 50 runs for both sampling methods rather the results obtained over one single run since (i) the variance of the results obtained by uniform sampling is high and (ii) the variance of the results obtained by the bound-based approach is also significant since the procedures for approximating the \(\mathop{\arg\min}_{(x,u) \in\mathcal{X} \times\mathcal{U}}\) and \(\max_{ (r,y) \in\mathbb{R} \times \mathcal{X} \, \mathrm{s}.\mathrm{t}.\, (x,u,r,y) \in\mathcal{C}(\mathcal{F}_{m}) } \) operators rely on a random number generator.

References

Antos, A., Munos, R., & Szepesvári, C. (2007). Fitted Q-iteration in continuous action space MDPs. In Advances in neural information processing systems (NIPS) (Vol. 20).
Google Scholar
Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press.
Google Scholar
Bonarini, A., Caccia, C., Lazaric, A., & Restelli, M. (2008). Batch reinforcement learning for controlling a mobile wheeled pendulum robot. In Artificial intelligence in theory and practice II (pp. 151–160).
Chapter Google Scholar
Boyan, J. (2005). Technical update: least-squares temporal difference learning. Machine Learning, 49, 233–246.
Article Google Scholar
Boyan, J., & Moore, A. (1995). Generalization in reinforcement learning: safely approximating the value function. In Advances in neural information processing systems (NIPS) (Vol. 7, pp. 369–376). Denver: MIT Press.
Google Scholar
Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
Google Scholar
Busoniu, L., Babuska, R., De Schutter, B., & Ernst, D. (2010). Reinforcement learning and dynamic programming using function approximators. London: Taylor & Francis/CRC Press.
Book Google Scholar
Castelletti, A., de Rigo, D., Rizzoli, A., Soncini-Sessa, R., & Weber, E. (2007). Neuro-dynamic programming for designing water reservoir network management policies. Control Engineering Practice, 15(8), 1031–1038.
Article Google Scholar
Castelletti, A., Galelli, S., Restelli, M., & Soncini-Sessa, R. (2010). Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research, 46, W09507.
Google Scholar
Chakraborty, B., Strecher, V., & Murphy, S. (2008). Bias correction and confidence intervals for fitted Q-iteration. In Workshop on model uncertainty and risk in reinforcement learning (NIPS), Whistler, Canada.
Defourny, B., Ernst, D., & Wehenkel, L. (2008). Risk-aware decision making and dynamic programming. In Workshop on model uncertainty and risk in reinforcement learning (NIPS), Whistler, Canada.
Ernst, D., Geurts, P., & Wehenkel, L. (2003). Iteratively extending time horizon reinforcement learning. In European conference on machine learning (ECML) (pp. 96–107).
Google Scholar
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
Google Scholar
Ernst, D., Marée, R., & Wehenkel, L. (2006a). Reinforcement learning with raw image pixels as state input (IWICPAS). In Lecture notes in computer science: Vol. 4153. International workshop on intelligent computing in pattern analysis/synthesis (pp. 446–454).
Google Scholar
Ernst, D., Stan, G., Goncalves, J., & Wehenkel, L. (2006b). Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Machine learning conference of Belgium and the Netherlands (BeNeLearn) (pp. 65–72).
Google Scholar
Ernst, D., Glavic, M., Capitanescu, F., & Wehenkel, L. (2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man and Cybernetics. Part B. Cybernetics, 39, 517–529.
Article Google Scholar
Farahmand, A., Ghavamzadeh, M., Szepesvári, C., & Mannor, S. (2008). Regularized fitted q-iteration: application to planning. In S. Girgin, M. Loth, R. Munos, P. Preux, & D. Ryabko (Eds.), Lecture notes in computer science: Vol. 5323. Recent advances in reinforcement learning (pp. 55–68). Berlin/Heidelberg: Springer.
Chapter Google Scholar
Fonteneau, R. (2011). Contributions to batch mode reinforcement learning. Ph.D. thesis, University of Liège.
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2009). Inferring bounds on the performance of a control policy from a sample of trajectories. In IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), Nashville, TN, USA.
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010a). A cautious approach to generalization in reinforcement learning. In Second international conference on agents and artificial intelligence (ICAART), Valencia, Spain.
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010b). Generating informative trajectories by using bounds on the return of control policies. In Workshop on active learning and experimental design 2010 (in conjunction with AISTATS 2010).
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010c). Model-free Monte Carlo-like policy evaluation. In JMLR: W&CP: Vol. 9. Thirteenth international conference on artificial intelligence and statistics (AISTATS) (pp. 217–224). Laguna: Chia.
Google Scholar
Fonteneau, R., Murphy, S. A., Wehenkel, L., & Ernst, D. (2010d). Towards min max generalization in reinforcement learning. In Communications in computer and information science (CCIS): Vol. 129. Revised selected papers. agents and artificial intelligence: international conference (ICAART 2010), Valencia, Spain (pp. 61–77). Heidelberg: Springer.
Google Scholar
Gordon, G. (1995). Stable function approximation in dynamic programming. In Twelfth international conference on machine learning (ICML) (pp. 261–268).
Google Scholar
Gordon, G. (1999). Approximate solutions to Markov decision processes. Ph.D. thesis, Carnegie Mellon University.
Guez, A., Vincent, R., Avoli, M., & Pineau, J. (2008). Adaptive treatment of epilepsy via batch-mode reinforcement learning. In Innovative applications of artificial intelligence (IAAI).
Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Google Scholar
Lange, S., & Riedmiller, M. (2010). Deep learning of visual control policies. In European symposium on artificial neural networks, computational intelligence and machine learning (ESANN), Brugge, Belgium.
Lazaric, A., Ghavamzadeh, M., & Munos, R. (2010a). Finite-sample analysis of least-squares policy iteration (Tech. Rep.). SEQUEL (INRIA) Lille–Nord Europe.
Lazaric, A., Ghavamzadeh, M., & Munos, R. (2010b). Finite-sample analysis of LSTD. In International conference on machine learning (ICML) (pp. 615–622).
Google Scholar
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010a). Nonparametric return density estimation for reinforcement learning. In 27th international conference on machine learning (ICML), Haifa, Israel, June 21–25.
Google Scholar
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010b). Parametric return density estimation for reinforcement learning. In 26th conference on uncertainty in artificial intelligence (UAI), Catalina Island, California, USA, Jul. 8–11 (pp. 368–375).
Google Scholar
Munos, R., & Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9, 815–857.
Google Scholar
Murphy, S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B, 65(2), 331–366.
Article Google Scholar
Murphy, S., Van Der Laan, M., & Robins, J. (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456), 1410–1423.
Article Google Scholar
Nedi, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13, 79–110. doi:10.1023/A:1022192903948.
Article Google Scholar
Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2–3), 161–178.
Article Google Scholar
Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Third IEEE-RAS international conference on humanoid robots (pp. 1–20). Citeseer.
Google Scholar
Pietquin, O., Tango, F., & Aras, R. (2011). Batch reinforcement learning for optimizing longitudinal driving assistance strategies. In Computational intelligence in vehicles and transportation systems (CIVTS), 2011 IEEE Symposium on (pp. 73–79). Los Alamitos: IEEE Comput. Soc.
Chapter Google Scholar
Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In Sixteenth European conference on machine learning (ECML), Porto, Portugal (pp. 317–328).
Google Scholar
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12), 1393–1512.
Article Google Scholar
Sutton, R. (1996). Generalization in reinforcement learning: successful examples using sparse coding. In Advances in neural information processing systems (NIPS) (Vol. 8, pp. 1038–1044). Denver: MIT Press.
Google Scholar
Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge: MIT Press.
Google Scholar
Timmer, S., & Riedmiller, M. (2007). Fitted Q iteration with CMACs. In IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL) (pp. 1–8). Los Alamitos: IEEE Comput. Soc.
Chapter Google Scholar
Tognetti, S., Savaresi, S., Spelta, C., & Restelli, M. (2009). Batch reinforcement learning for semi-active suspension control. In Control applications (CCA) & intelligent control (ISIC) (pp. 582–587). Los Alamitos: IEEE Comput. Soc.
Google Scholar

Download references

Acknowledgements

Raphael Fonteneau is a Post-doctoral Fellow of the F.R.S.-FNRS. This paper presents research results of the European excellence network PASCAL2 and of the Belgian Network DYSCO, funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.

Author information

Authors and Affiliations

University of Liège, Liège, Belgium
Raphael Fonteneau, Louis Wehenkel & Damien Ernst
University of Michigan, Ann Arbor, USA
Susan A. Murphy

Authors

Raphael Fonteneau
View author publications
You can also search for this author in PubMed Google Scholar
Susan A. Murphy
View author publications
You can also search for this author in PubMed Google Scholar
Louis Wehenkel
View author publications
You can also search for this author in PubMed Google Scholar
Damien Ernst
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raphael Fonteneau.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fonteneau, R., Murphy, S.A., Wehenkel, L. et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Ann Oper Res 208, 383–416 (2013). https://doi.org/10.1007/s10479-012-1248-5

Download citation

Published: 15 November 2012
Issue Date: September 2013
DOI: https://doi.org/10.1007/s10479-012-1248-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Batch mode reinforcement learning based on the synthesis of artificial trajectories

Abstract

Access this article

Similar content being viewed by others

Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

Safe Policy Improvement with Soft Baseline Bootstrapping

Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Batch mode reinforcement learning based on the synthesis of artificial trajectories

Abstract

Access this article

Similar content being viewed by others

Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

Safe Policy Improvement with Soft Baseline Bootstrapping

Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation