Boosted Bellman Residual Minimization Handling Expert Demonstrations

  • Bilal Piot
  • Matthieu Geist
  • Olivier Pietquin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8725)


This paper addresses the problem of batch Reinforcement Learning with Expert Demonstrations (RLED). In RLED, the goal is to find an optimal policy of a Markov Decision Process (MDP), using a data set of fixed sampled transitions of the MDP as well as a data set of fixed expert demonstrations. This is slightly different from the batch Reinforcement Learning (RL) framework where only fixed sampled transitions of the MDP are available. Thus, the aim of this article is to propose algorithms that leverage those expert data. The idea proposed here differs from the Approximate Dynamic Programming methods in the sense that we minimize the Optimal Bellman Residual (OBR), where the minimization is guided by constraints defined by the expert demonstrations. This choice is motivated by the the fact that controlling the OBR implies controlling the distance between the estimated and optimal quality functions. However, this method presents some difficulties as the criterion to minimize is non-convex, non-differentiable and biased. Those difficulties are overcome via the embedding of distributions in a Reproducing Kernel Hilbert Space (RKHS) and a boosting technique which allows obtaining non-parametric algorithms. Finally, our algorithms are compared to the only state of the art algorithm, Approximate Policy Iteration with Demonstrations (APID) algorithm, in different experimental settings.


Optimal Policy Markov Decision Process Reproduce Kernel Hilbert Space Policy Iteration Expert Policy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abbeel, P., Ng, A.: Apprenticeship learning via inverse reinforcement learning. In: Proc. of ICML (2004)Google Scholar
  2. 2.
    Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning (2008)Google Scholar
  3. 3.
    Archibald, T., McKinnon, K., Thomas, L.: On the generation of markov decision processes. Journal of the Operational Research Society (1995)Google Scholar
  4. 4.
    Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathematical Society (1950)Google Scholar
  5. 5.
    Bertsekas, D.: Dynamic programming and optimal control, vol. 1. Athena Scientific, Belmont (1995)zbMATHGoogle Scholar
  6. 6.
    Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Machine Learning (1996)Google Scholar
  7. 7.
    Breiman, L.: Classification and regression trees. CRC Press (1993)Google Scholar
  8. 8.
    Clarke, F.: Generalized gradients and applications. Transactions of the American Mathematical Society (1975)Google Scholar
  9. 9.
    Farahmand, A., Munos, R., Szepesvári, C.: Error propagation for approximate policy and value iteration. In: Proc. of NIPS (2010)Google Scholar
  10. 10.
    Grubb, A., Bagnell, J.: Generalized boosting algorithms for convex optimization. In: Proc. of ICML (2011)Google Scholar
  11. 11.
    Judah, K., Fern, A., Dietterich, T.: Active imitation learning via reduction to iid active learning. In: Proc. of UAI (2012)Google Scholar
  12. 12.
    Kim, B., Farahmand, A., Pineau, J., Precup, D.: Learning from limited demonstrations. In: Proc. of NIPS (2013)Google Scholar
  13. 13.
    Klein, E., Geist, M., Piot, B., Pietquin, O.: Inverse reinforcement learning through structured classification. In: Proc. of NIPS (2012)Google Scholar
  14. 14.
    Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research (2003)Google Scholar
  15. 15.
    Lever, G., Baldassarre, L., Gretton, A., Pontil, M., Grünewälder, S.: Modelling transition dynamics in mdps with rkhs embeddings. In: Proc. of ICML (2012)Google Scholar
  16. 16.
    Munos, R.: Performance bounds in l_p-norm for approximate value iteration. SIAM Journal on Control and Optimization (2007)Google Scholar
  17. 17.
    Piot, B., Geist, M., Pietquin, O.: Learning from demonstrations: Is it worth estimating a reward function? In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 17–32. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  18. 18.
    Puterman, M.: Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons (1994)Google Scholar
  19. 19.
    Ratliff, N., Bagnell, J., Srinivasa, S.: Imitation learning for locomotion and manipulation. In: Proc. of IEEE-RAS International Conference on Humanoid Robots (2007)Google Scholar
  20. 20.
    Ratliff, N., Bagnell, J., Zinkevich, M.: Maximum margin planning. In: Proc. of ICML (2006)Google Scholar
  21. 21.
    Ross, S., Gordon, G., Bagnell, J.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proc. of AISTATS (2011)Google Scholar
  22. 22.
    Shor, N., Kiwiel, K., Ruszcaynski, A.: Minimization methods for non-differentiable functions. Springer (1985)Google Scholar
  23. 23.
    Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.: Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research (2010)Google Scholar
  24. 24.
    Syed, U., Bowling, M., Schapire, R.: Apprenticeship learning using linear programming. In: Proc. of ICML (2008)Google Scholar
  25. 25.
    Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Bilal Piot
    • 1
    • 2
  • Matthieu Geist
    • 1
    • 2
  • Olivier Pietquin
    • 3
  1. 1.Supélec, IMS-MaLIS Research groupFrance
  2. 2.UMI 2958 (GeorgiaTech-CNRS)France
  3. 3.University Lille 1, LIFL (UMR 8022 CNRS/Lille 1) - SequeL teamLilleFrance

Personalised recommendations