Skip to main content
Log in

Batch mode reinforcement learning based on the synthesis of artificial trajectories

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 2
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. Here the fundamental assumption is that w t is independent of w t−1,w t−2,…,w 0 given x t and u t ; to simplify all notations and derivations, we furthermore impose that the process is time-invariant and does not depend on the states and actions x t ,u t .

  2. We have chosen to represent the average results obtained over 50 runs for both sampling methods rather the results obtained over one single run since (i) the variance of the results obtained by uniform sampling is high and (ii) the variance of the results obtained by the bound-based approach is also significant since the procedures for approximating the \(\mathop{\arg\min}_{(x,u) \in\mathcal{X} \times\mathcal{U}}\) and \(\max_{ (r,y) \in\mathbb{R} \times \mathcal{X} \, \mathrm{s}.\mathrm{t}.\, (x,u,r,y) \in\mathcal{C}(\mathcal{F}_{m}) } \) operators rely on a random number generator.

References

  • Antos, A., Munos, R., & Szepesvári, C. (2007). Fitted Q-iteration in continuous action space MDPs. In Advances in neural information processing systems (NIPS) (Vol. 20).

    Google Scholar 

  • Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press.

    Google Scholar 

  • Bonarini, A., Caccia, C., Lazaric, A., & Restelli, M. (2008). Batch reinforcement learning for controlling a mobile wheeled pendulum robot. In Artificial intelligence in theory and practice II (pp. 151–160).

    Chapter  Google Scholar 

  • Boyan, J. (2005). Technical update: least-squares temporal difference learning. Machine Learning, 49, 233–246.

    Article  Google Scholar 

  • Boyan, J., & Moore, A. (1995). Generalization in reinforcement learning: safely approximating the value function. In Advances in neural information processing systems (NIPS) (Vol. 7, pp. 369–376). Denver: MIT Press.

    Google Scholar 

  • Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.

    Google Scholar 

  • Busoniu, L., Babuska, R., De Schutter, B., & Ernst, D. (2010). Reinforcement learning and dynamic programming using function approximators. London: Taylor & Francis/CRC Press.

    Book  Google Scholar 

  • Castelletti, A., de Rigo, D., Rizzoli, A., Soncini-Sessa, R., & Weber, E. (2007). Neuro-dynamic programming for designing water reservoir network management policies. Control Engineering Practice, 15(8), 1031–1038.

    Article  Google Scholar 

  • Castelletti, A., Galelli, S., Restelli, M., & Soncini-Sessa, R. (2010). Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research, 46, W09507.

    Google Scholar 

  • Chakraborty, B., Strecher, V., & Murphy, S. (2008). Bias correction and confidence intervals for fitted Q-iteration. In Workshop on model uncertainty and risk in reinforcement learning (NIPS), Whistler, Canada.

  • Defourny, B., Ernst, D., & Wehenkel, L. (2008). Risk-aware decision making and dynamic programming. In Workshop on model uncertainty and risk in reinforcement learning (NIPS), Whistler, Canada.

  • Ernst, D., Geurts, P., & Wehenkel, L. (2003). Iteratively extending time horizon reinforcement learning. In European conference on machine learning (ECML) (pp. 96–107).

    Google Scholar 

  • Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.

    Google Scholar 

  • Ernst, D., Marée, R., & Wehenkel, L. (2006a). Reinforcement learning with raw image pixels as state input (IWICPAS). In Lecture notes in computer science: Vol. 4153. International workshop on intelligent computing in pattern analysis/synthesis (pp. 446–454).

    Google Scholar 

  • Ernst, D., Stan, G., Goncalves, J., & Wehenkel, L. (2006b). Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Machine learning conference of Belgium and the Netherlands (BeNeLearn) (pp. 65–72).

    Google Scholar 

  • Ernst, D., Glavic, M., Capitanescu, F., & Wehenkel, L. (2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man and Cybernetics. Part B. Cybernetics, 39, 517–529.

    Article  Google Scholar 

  • Farahmand, A., Ghavamzadeh, M., Szepesvári, C., & Mannor, S. (2008). Regularized fitted q-iteration: application to planning. In S. Girgin, M. Loth, R. Munos, P. Preux, & D. Ryabko (Eds.), Lecture notes in computer science: Vol. 5323. Recent advances in reinforcement learning (pp. 55–68). Berlin/Heidelberg: Springer.

    Chapter  Google Scholar 

  • Fonteneau, R. (2011). Contributions to batch mode reinforcement learning. Ph.D. thesis, University of Liège.

  • Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2009). Inferring bounds on the performance of a control policy from a sample of trajectories. In IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), Nashville, TN, USA.

  • Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010a). A cautious approach to generalization in reinforcement learning. In Second international conference on agents and artificial intelligence (ICAART), Valencia, Spain.

  • Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010b). Generating informative trajectories by using bounds on the return of control policies. In Workshop on active learning and experimental design 2010 (in conjunction with AISTATS 2010).

  • Fonteneau, R., Murphy, S., Wehenkel, L., & Ernst, D. (2010c). Model-free Monte Carlo-like policy evaluation. In JMLR: W&CP: Vol. 9. Thirteenth international conference on artificial intelligence and statistics (AISTATS) (pp. 217–224). Laguna: Chia.

    Google Scholar 

  • Fonteneau, R., Murphy, S. A., Wehenkel, L., & Ernst, D. (2010d). Towards min max generalization in reinforcement learning. In Communications in computer and information science (CCIS): Vol. 129. Revised selected papers. agents and artificial intelligence: international conference (ICAART 2010), Valencia, Spain (pp. 61–77). Heidelberg: Springer.

    Google Scholar 

  • Gordon, G. (1995). Stable function approximation in dynamic programming. In Twelfth international conference on machine learning (ICML) (pp. 261–268).

    Google Scholar 

  • Gordon, G. (1999). Approximate solutions to Markov decision processes. Ph.D. thesis, Carnegie Mellon University.

  • Guez, A., Vincent, R., Avoli, M., & Pineau, J. (2008). Adaptive treatment of epilepsy via batch-mode reinforcement learning. In Innovative applications of artificial intelligence (IAAI).

  • Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.

    Google Scholar 

  • Lange, S., & Riedmiller, M. (2010). Deep learning of visual control policies. In European symposium on artificial neural networks, computational intelligence and machine learning (ESANN), Brugge, Belgium.

  • Lazaric, A., Ghavamzadeh, M., & Munos, R. (2010a). Finite-sample analysis of least-squares policy iteration (Tech. Rep.). SEQUEL (INRIA) Lille–Nord Europe.

  • Lazaric, A., Ghavamzadeh, M., & Munos, R. (2010b). Finite-sample analysis of LSTD. In International conference on machine learning (ICML) (pp. 615–622).

    Google Scholar 

  • Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010a). Nonparametric return density estimation for reinforcement learning. In 27th international conference on machine learning (ICML), Haifa, Israel, June 21–25.

    Google Scholar 

  • Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010b). Parametric return density estimation for reinforcement learning. In 26th conference on uncertainty in artificial intelligence (UAI), Catalina Island, California, USA, Jul. 8–11 (pp. 368–375).

    Google Scholar 

  • Munos, R., & Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9, 815–857.

    Google Scholar 

  • Murphy, S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B, 65(2), 331–366.

    Article  Google Scholar 

  • Murphy, S., Van Der Laan, M., & Robins, J. (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456), 1410–1423.

    Article  Google Scholar 

  • Nedi, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13, 79–110. doi:10.1023/A:1022192903948.

    Article  Google Scholar 

  • Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2–3), 161–178.

    Article  Google Scholar 

  • Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Third IEEE-RAS international conference on humanoid robots (pp. 1–20). Citeseer.

    Google Scholar 

  • Pietquin, O., Tango, F., & Aras, R. (2011). Batch reinforcement learning for optimizing longitudinal driving assistance strategies. In Computational intelligence in vehicles and transportation systems (CIVTS), 2011 IEEE Symposium on (pp. 73–79). Los Alamitos: IEEE Comput. Soc.

    Chapter  Google Scholar 

  • Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In Sixteenth European conference on machine learning (ECML), Porto, Portugal (pp. 317–328).

    Google Scholar 

  • Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9–12), 1393–1512.

    Article  Google Scholar 

  • Sutton, R. (1996). Generalization in reinforcement learning: successful examples using sparse coding. In Advances in neural information processing systems (NIPS) (Vol. 8, pp. 1038–1044). Denver: MIT Press.

    Google Scholar 

  • Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge: MIT Press.

    Google Scholar 

  • Timmer, S., & Riedmiller, M. (2007). Fitted Q iteration with CMACs. In IEEE symposium on approximate dynamic programming and reinforcement learning (ADPRL) (pp. 1–8). Los Alamitos: IEEE Comput. Soc.

    Chapter  Google Scholar 

  • Tognetti, S., Savaresi, S., Spelta, C., & Restelli, M. (2009). Batch reinforcement learning for semi-active suspension control. In Control applications (CCA) & intelligent control (ISIC) (pp. 582–587). Los Alamitos: IEEE Comput. Soc.

    Google Scholar 

Download references

Acknowledgements

Raphael Fonteneau is a Post-doctoral Fellow of the F.R.S.-FNRS. This paper presents research results of the European excellence network PASCAL2 and of the Belgian Network DYSCO, funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raphael Fonteneau.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fonteneau, R., Murphy, S.A., Wehenkel, L. et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Ann Oper Res 208, 383–416 (2013). https://doi.org/10.1007/s10479-012-1248-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-012-1248-5

Keywords

Navigation