Bayesian Inference for Least Squares Temporal Difference Regularization

  • Nikolaos TziortziotisEmail author
  • Christos Dimitrakakis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10535)


This paper proposes a fully Bayesian approach for Least-Squares Temporal Differences (LSTD), resulting in fully probabilistic inference of value functions that avoids the overfitting commonly experienced with classical LSTD when the number of features is larger than the number of samples. Sparse Bayesian learning provides an elegant solution through the introduction of a prior over value function parameters. This gives us the advantages of probabilistic predictions, a sparse model, and good generalisation capabilities, as irrelevant parameters are marginalised out. The algorithm efficiently approximates the posterior distribution through variational inference. We demonstrate the ability of the algorithm in avoiding overfitting experimentally.



This work was partially supported by the École Polytechnique AXA Chair (DaSciS), the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007–2013) under REA grant agreement 608743, and the Future of Life Institute.


  1. 1.
    Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)CrossRefzbMATHGoogle Scholar
  2. 2.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)zbMATHGoogle Scholar
  3. 3.
    Bishop, C.M., Tipping, M.E.: Variational relevance vector machines. In: Uncertainty in Artificial Intelligence, pp. 46–53 (2000)Google Scholar
  4. 4.
    Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. CoRR (2016)Google Scholar
  5. 5.
    Boyan, J.: Technical update: least-squares temporal difference learning. Mach. Learn. 49(2), 233–246 (2002)CrossRefzbMATHGoogle Scholar
  6. 6.
    Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22(1), 33–57 (1996)zbMATHGoogle Scholar
  7. 7.
    Dann, C., Neumann, G., Peters, J.: Policy evaluation with temporal differences: a survey and comparison. J. Mach. Learn. Res. 15, 809–883 (2014)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Efron, B., Hastie, T., Johnstone, L., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian process. In: International Conference on Machine Learning, pp. 201–208 (2005)Google Scholar
  10. 10.
    Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy iteration. Adv. Neural Inf. Process. Syst. 21, 441–448 (2008)zbMATHGoogle Scholar
  11. 11.
    Geist, M., Scherrer, B.: \(\ell \)1-penalized projected Bellman residual. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS (LNAI), vol. 7188, pp. 89–101. Springer, Heidelberg (2012). CrossRefGoogle Scholar
  12. 12.
    Geist, M., Scherrer, B., Lazaric, A., Ghavamzadeh, M.: A Dantzig selector approach to temporal difference learning. In: International Conference on Machine Learning, pp. 1399–1406 (2012)Google Scholar
  13. 13.
    Geramifard, A., Bowling, M., Sutton, R.S.: Incremental least-square temporal difference learning. In: The Twenty-first National Conference on Artificial Intelligence (AAAI), pp. 356–361 (2006)Google Scholar
  14. 14.
    Ghavamzadeh, M., Lazaric, A., Munos, R., Hoffman, M.W.: Finite-sample analysis of Lasso-TD. In: International Conference on Machine Learning, pp. 1177–1184 (2011)Google Scholar
  15. 15.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining Inference and Prediction. Springer, Heidelberg (2009). CrossRefzbMATHGoogle Scholar
  16. 16.
    Hoffman, M.W., Lazaric, A., Ghavamzadeh, M., Munos, R.: Regularized least squares temporal difference learning with nested \(\ell \)2 and \(\ell \)1 penalization. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS (LNAI), vol. 7188, pp. 102–114. Springer, Heidelberg (2012). CrossRefGoogle Scholar
  17. 17.
    Johns, J., Painter-wakefield, C., Parr, R.: Linear complementarity for regularized policy evaluation and improvement. Adv. Neural Inf. Process. Syst. 23, 1009–1017 (2010)Google Scholar
  18. 18.
    Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)CrossRefzbMATHGoogle Scholar
  19. 19.
    Kolter, J.Z., Ng, A.Y.: Regularization and feature selection in least-squares temporal difference learning. In: International Conference on Machine Learning, pp. 521–528 (2009)Google Scholar
  20. 20.
    Lagoudakis, M., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Lazaric, A., Ghavamzadeh, M., Munos, R.: Finite-sample analysis of LSTD. In: International Conference on Machine Learning, pp. 615–622 (2010)Google Scholar
  22. 22.
    Liu, B., Zhang, L., Liu, J.: Dantzig selector with an approximately optimal denoising matrix and its application in sparse reinforcement learning. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI (2016)Google Scholar
  23. 23.
    Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance in value function estimation. In: International Conference on Machine Learning (2004)Google Scholar
  24. 24.
    Nedić, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Syst. 13(1), 79–110 (2003)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Painter-Wakefield, C., Parr, R.: Greedy algorithms for sparse reinforcement learning. In: International Conference on Machine Learning (2012)Google Scholar
  26. 26.
    Parisi, G.: Statistical field theory. In: Frontiers in Physics. Addison-Wesley, Boston (1988)Google Scholar
  27. 27.
    Pires, B.A.: Statistical analysis of l1-penalized linear estimation with applications (2011)Google Scholar
  28. 28.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New Jersey (2005)zbMATHGoogle Scholar
  29. 29.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  30. 30.
    Sutton, R., Maei, H., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: International Conference on Machine Learning, pp. 993–1000 (2009)Google Scholar
  31. 31.
    Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Tziortziotis, N.: Machine Learning for Intelligent Agents. Ph.D. thesis, Department of Computer Science and Engineering, University of Ioannina, Greece (2015)Google Scholar
  33. 33.
    Tziortziotis, N., Blekas, K.: Value function approximation through sparse Bayesian modeling. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS (LNAI), vol. 7188, pp. 128–139. Springer, Heidelberg (2012). CrossRefGoogle Scholar
  34. 34.
    Vlassis, N., Ghavamzadeh, M., Mannor, S., Poupart, P.: Reinforcement learning. In: Wiering, M., Van Otterlo, M. (eds.) Bayesian Reinforcement Learning, vol. 12, pp. 359–386. Springer, Heidelberg (2012). CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.LIX, École PolytechniquePalaiseauFrance
  2. 2.University of LilleVilleneuve-d’AscqFrance
  3. 3.SEASHarvard UniversityCambridgeUSA

Personalised recommendations