Abstract
In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.
Similar content being viewed by others
Notes
Here the fundamental assumption is that w t is independent of w t−1,w t−2,…,w 0 given x t and u t ; to simplify all notations and derivations, we furthermore impose that the process is time-invariant and does not depend on the states and actions x t ,u t .
We have chosen to represent the average results obtained over 50 runs for both sampling methods rather the results obtained over one single run since (i) the variance of the results obtained by uniform sampling is high and (ii) the variance of the results obtained by the bound-based approach is also significant since the procedures for approximating the \(\mathop{\arg\min}_{(x,u) \in\mathcal{X} \times\mathcal{U}}\) and \(\max_{ (r,y) \in\mathbb{R} \times \mathcal{X} \, \mathrm{s}.\mathrm{t}.\, (x,u,r,y) \in\mathcal{C}(\mathcal{F}_{m}) } \) operators rely on a random number generator.
References
Antos, A., Munos, R., & Szepesvári, C. (2007). Fitted Q-iteration in continuous action space MDPs. In Advances in neural information processing systems (NIPS) (Vol. 20).
Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press.
Bonarini, A., Caccia, C., Lazaric, A., & Restelli, M. (2008). Batch reinforcement learning for controlling a mobile wheeled pendulum robot. In Artificial intelligence in theory and practice II (pp. 151–160).
Boyan, J. (2005). Technical update: least-squares temporal difference learning. Machine Learning, 49, 233–246.
Boyan, J., & Moore, A. (1995). Generalization in reinforcement learning: safely approximating the value function. In Advances in neural information processing systems (NIPS) (Vol. 7, pp. 369–376). Denver: MIT Press.
Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.
Busoniu, L., Babuska, R., De Schutter, B., & Ernst, D. (2010). Reinforcement learning and dynamic programming using function approximators. London: Taylor & Francis/CRC Press.
Castelletti, A., de Rigo, D., Rizzoli, A., Soncini-Sessa, R., & Weber, E. (2007). Neuro-dynamic programming for designing water reservoir network management policies. Control Engineering Practice, 15(8), 1031–1038.
Castelletti, A., Galelli, S., Restelli, M., & Soncini-Sessa, R. (2010). Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research, 46, W09507.
Chakraborty, B., Strecher, V., & Murphy, S. (2008). Bias correction and confidence intervals for fitted Q-iteration. In Workshop on model uncertainty and risk in reinforcement learning (NIPS), Whistler, Canada.
Defourny, B., Ernst, D., & Wehenkel, L. (2008). Risk-aware decision making and dynamic programming. In Workshop on model uncertainty and risk in reinforcement learning (NIPS), Whistler, Canada.
Ernst, D., Geurts, P., & Wehenkel, L. (2003). Iteratively extending time horizon reinforcement learning. In European conference on machine learning (ECML) (pp. 96–107).
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
Ernst, D., Marée, R., & Wehenkel, L. (2006a). Reinforcement learning with raw image pixels as state input (IWICPAS). In Lecture notes in computer science: Vol. 4153. International workshop on intelligent computing in pattern analysis/synthesis (pp. 446–454).
Ernst, D., Stan, G., Goncalves, J., & Wehenkel, L. (2006b). Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Machine learning conference of Belgium and the Netherlands (BeNeLearn) (pp. 65–72).
Ernst, D., Glavic, M., Capitanescu, F., & Wehenkel, L. (2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man and Cybernetics. Part B. Cybernetics, 39, 517–529.
Farahmand, A., Ghavamzadeh, M., Szepesvári, C., & Mannor, S. (2008). Regularized fitted q-iteration: application to planning. In S. Girgin, M. Loth, R. Munos, P. Preux, & D. Ryabko (Eds.), Lecture notes in computer science: Vol. 5323. Recent advances in reinforcement learning (pp. 55–68). Berlin/Heidelberg: Springer.
Fonteneau, R. (2011). Contributions to batch mode reinforcement learning. Ph.D. thesis, University of Liège.
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2009). Inferring bounds on the performance of a control policy from a sample of trajectories. In IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), Nashville, TN, USA.
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010a). A cautious approach to generalization in reinforcement learning. In Second international conference on agents and artificial intelligence (ICAART), Valencia, Spain.
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010b). Generating informative trajectories by using bounds on the return of control policies. In Workshop on active learning and experimental design 2010 (in conjunction with AISTATS 2010).
Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010c). Model-free Monte Carlo-like policy evaluation. In JMLR: W&CP: Vol. 9. Thirteenth international conference on artificial intelligence and statistics (AISTATS) (pp. 217–224). Laguna: Chia.
Fonteneau, R., Murphy, S. A., Wehenkel, L., & Ernst, D. (2010d). Towards min max generalization in reinforcement learning. In Communications in computer and information science (CCIS): Vol. 129. Revised selected papers. agents and artificial intelligence: international conference (ICAART 2010), Valencia, Spain (pp. 61–77). Heidelberg: Springer.
Gordon, G. (1995). Stable function approximation in dynamic programming. In Twelfth international conference on machine learning (ICML) (pp. 261–268).
Gordon, G. (1999). Approximate solutions to Markov decision processes. Ph.D. thesis, Carnegie Mellon University.
Guez, A., Vincent, R., Avoli, M., & Pineau, J. (2008). Adaptive treatment of epilepsy via batch-mode reinforcement learning. In Innovative applications of artificial intelligence (IAAI).
Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Lange, S., & Riedmiller, M. (2010). Deep learning of visual control policies. In European symposium on artificial neural networks, computational intelligence and machine learning (ESANN), Brugge, Belgium.
Lazaric, A., Ghavamzadeh, M., & Munos, R. (2010a). Finite-sample analysis of least-squares policy iteration (Tech. Rep.). SEQUEL (INRIA) Lille–Nord Europe.
Lazaric, A., Ghavamzadeh, M., & Munos, R. (2010b). Finite-sample analysis of LSTD. In International conference on machine learning (ICML) (pp. 615–622).
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010a). Nonparametric return density estimation for reinforcement learning. In 27th international conference on machine learning (ICML), Haifa, Israel, June 21–25.
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010b). Parametric return density estimation for reinforcement learning. In 26th conference on uncertainty in artificial intelligence (UAI), Catalina Island, California, USA, Jul. 8–11 (pp. 368–375).
Munos, R., & Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9, 815–857.
Murphy, S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B, 65(2), 331–366.
Murphy, S., Van Der Laan, M., & Robins, J. (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456), 1410–1423.
Nedi, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13, 79–110. doi:10.1023/A:1022192903948.
Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2–3), 161–178.
Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Third IEEE-RAS international conference on humanoid robots (pp. 1–20). Citeseer.
Pietquin, O., Tango, F., & Aras, R. (2011). Batch reinforcement learning for optimizing longitudinal driving assistance strategies. In Computational intelligence in vehicles and transportation systems (CIVTS), 2011 IEEE Symposium on (pp. 73–79). Los Alamitos: IEEE Comput. Soc.
Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In Sixteenth European conference on machine learning (ECML), Porto, Portugal (pp. 317–328).
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12), 1393–1512.
Sutton, R. (1996). Generalization in reinforcement learning: successful examples using sparse coding. In Advances in neural information processing systems (NIPS) (Vol. 8, pp. 1038–1044). Denver: MIT Press.
Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge: MIT Press.
Timmer, S., & Riedmiller, M. (2007). Fitted Q iteration with CMACs. In IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL) (pp. 1–8). Los Alamitos: IEEE Comput. Soc.
Tognetti, S., Savaresi, S., Spelta, C., & Restelli, M. (2009). Batch reinforcement learning for semi-active suspension control. In Control applications (CCA) & intelligent control (ISIC) (pp. 582–587). Los Alamitos: IEEE Comput. Soc.
Acknowledgements
Raphael Fonteneau is a Post-doctoral Fellow of the F.R.S.-FNRS. This paper presents research results of the European excellence network PASCAL2 and of the Belgian Network DYSCO, funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fonteneau, R., Murphy, S.A., Wehenkel, L. et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Ann Oper Res 208, 383–416 (2013). https://doi.org/10.1007/s10479-012-1248-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1248-5