Machine Learning

, Volume 107, Issue 8–10, pp 1385–1429 | Cite as

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

  • Ajin George JosephEmail author
  • Shalabh Bhatnagar
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track


In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.


Markov decision process Prediction problem Reinforcement learning Stochastic approximation algorithm Cross entropy method Linear function approximation ODE method 


Supplementary material


  1. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning (pp. 30–37).Google Scholar
  2. Benveniste, A., Métivier, M., & Priouret, P. (2012). Adaptive Algorithms and Stochastic Approximations (Vol. 22). Berlin: Springer.zbMATHGoogle Scholar
  3. Bertsekas, D. P. (2013). Dynamic programming and optimal control (Vol. 2). Belmont: Athena Scientific.Google Scholar
  4. Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29(5), 291–294.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Borkar, V. S. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  6. Borkar, V. S. (2012). Probability theory: An advanced course. Berlin: Springer.zbMATHGoogle Scholar
  7. Borkar, V. S., & Meyn, S. P. (2000). The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2), 447–469.MathSciNetCrossRefzbMATHGoogle Scholar
  8. Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49(2–3), 233–246.CrossRefzbMATHGoogle Scholar
  9. Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1–3), 33–57.zbMATHGoogle Scholar
  10. Busoniu, L., Ernst, D., De Schutter, B., & Babuska, R. (2009). Policy search with cross-entropy optimization of basis functions. In IEEE symposium on adaptive dynamic programming and reinforcement learning, 2009. ADPRL’09 (pp. 153–160). IEEE.Google Scholar
  11. Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In Advances in neural information processing systems (pp. 1017–1023).Google Scholar
  12. Dann, C., Neumann, G., & Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1), 809–883.MathSciNetzbMATHGoogle Scholar
  13. De Boer, P. T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.MathSciNetCrossRefzbMATHGoogle Scholar
  14. Dorigo, M., & Gambardella, L. M. (1997). Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1(1), 53–66.CrossRefGoogle Scholar
  15. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245.CrossRefGoogle Scholar
  16. Eldracher, M., Staller, A., & Pompl, R. (1994). Function approximation with continuous valued activation functions in CMAC. Inst. für Informatik.Google Scholar
  17. Hu, J., Fu, M. C., & Marcus, S. I. (2007). A model reference adaptive search method for global optimization. Operations Research, 55(3), 549–568.MathSciNetCrossRefzbMATHGoogle Scholar
  18. Hu, J., Fu, M. C., & Marcus, S. I. (2008). A model reference adaptive search method for stochastic global optimization. Communications in Information & Systems, 8(3), 245–276.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Hu, J., & Hu, P. (2009). On the performance of the cross-entropy method. In Proceedings of the 2009 winter simulation conference (WSC) (pp. 459–468). IEEE.Google Scholar
  20. Joseph, A. G., & Bhatnagar, S. (2016). A randomized algorithm for continuous optimization. In Winter simulation conference, WSC 2016, Washington, DC, USA, (pp. 907–918).Google Scholar
  21. Joseph, A. G., & Bhatnagar, S. (2016). Revisiting the cross entropy method with applications in stochastic global optimization and reinforcement learning. Frontiers in artificial intelligence and applications (ECAI 2016) (Vol. 285, pp. 1026–1034).
  22. Joseph, A. G., & Bhatnagar, S. (2018). A cross entropy based optimization algorithm with global convergence guarantees. CoRR (arXiv:1801.10291).
  23. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.CrossRefGoogle Scholar
  24. Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238–1274.CrossRefGoogle Scholar
  25. Konidaris, G., Osentoski, S., & Thomas, P. S. (2011). Value function approximation in reinforcement learning using the Fourier basis. In Twenty-fifth AAAI conference on artificial intelligence.Google Scholar
  26. Kubrusly, C., & Gravier, J. (1973). Stochastic approximation algorithms and applications. In 1973 IEEE conference on decision and control including the 12th symposium on adaptive processes (Vol. 12, pp. 763–766).Google Scholar
  27. Kullback, S. (1959). Statistics and information theory. New York: Wiley.zbMATHGoogle Scholar
  28. Kushner, H. J., & Clark, D. S. (1978). Stochastic approximation for constrained and unconstrained systems. New York: Springer.CrossRefzbMATHGoogle Scholar
  29. Kveton, B., Hauskrecht, M., & Guestrin, C. (2006). Solving factored mdps with hybrid state and action variables. Journal of Artificial Intelligence Research (JAIR), 27, 153–201.MathSciNetCrossRefzbMATHGoogle Scholar
  30. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research, 4, 1107–1149.MathSciNetzbMATHGoogle Scholar
  31. Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4), 551–575.MathSciNetCrossRefzbMATHGoogle Scholar
  32. Maei, H. R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., & Sutton, R. S. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in neural information processing systems (pp. 1204–1212).Google Scholar
  33. Mannor, S., Rubinstein, R. Y., & Gat, Y. (2003). The cross entropy method for fast policy search. In International conference on machine learning-ICML 2003 (pp. 512–519).Google Scholar
  34. Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1), 215–238.MathSciNetCrossRefzbMATHGoogle Scholar
  35. Morris, C. N. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics 65–80.Google Scholar
  36. Mühlenbein, H., & Paass, G. (1996). From recombination of genes to the estimation of distributions i. binary parameters. In Parallel problem solving from naturePPSN IV (pp. 178–187). Springer.Google Scholar
  37. Nedić, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1–2), 79–110.MathSciNetzbMATHGoogle Scholar
  38. Perko, L. (2013). Differential equations and dynamical systems (Vol. 7). Berlin: Springer.zbMATHGoogle Scholar
  39. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics 400–407.Google Scholar
  40. Rubinstein, R. Y., & Kroese, D. P. (2013). The cross-entropy method: A unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  41. Scherrer, B. (2010). Should one compute the temporal difference fix point or minimize the Bellman residual? the unified oblique projection view. In 27th International conference on machine learning-ICML 2010.Google Scholar
  42. Schoknecht, R. (2002). Optimality of reinforcement learning algorithms with linear function approximation. In Advances in neural information processing systems (pp. 1555–1562).Google Scholar
  43. Schoknecht, R., & Merke, A. (2002) Convergent combinations of reinforcement learning with linear function approximation. In Advances in neural information processing systems (pp. 1579–1586).Google Scholar
  44. Schoknecht, R., & Merke, A. (2003). TD(0) converges provably faster than the residual gradient algorithm. In International conference on machine learning-ICML 2003 (pp. 680–687).Google Scholar
  45. Silver, D., Sutton, R. S., & Müller, M. (2007). Reinforcement learning of local shape in the game of go. In International joint conference on artificial intelligence (IJCAI) (Vol. 7, pp. 1053–1058).Google Scholar
  46. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.Google Scholar
  47. Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. New York: MIT Press.Google Scholar
  48. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., & Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning (pp. 993–1000). ACM.Google Scholar
  49. Sutton, R. S., Maei, H. R., & Szepesvári, C. (2009). A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems (pp. 1609–1616).Google Scholar
  50. Tesauro, G. (1995). Td-gammon: A self-teaching backgammon program. In Applications of neural networks (pp. 267–285). Springer.Google Scholar
  51. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.MathSciNetCrossRefzbMATHGoogle Scholar
  52. White, D. J. (1993). A survey of applications of Markov decision processes. Journal of the Operational Research Society, 44, 1073–1096.CrossRefzbMATHGoogle Scholar
  53. Williams, R. J., & Baird, L. C. (1993 Nov 24). Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Techical report NU-CCS-93-14, Northeastern University, College of Computer Science, Boston, MA.Google Scholar
  54. Zhou, E., Bhatnagar, S., & Chen, X. (2014). Simulation optimization via gradient-based stochastic search. In Winter simulation conference (WSC), 2014 (pp. 3869–3879.) IEEE.Google Scholar
  55. Zlochin, M., Birattari, M., Meuleau, N., & Dorigo, M. (2004). Model-based search for combinatorial optimization: A critical survey. Annals of Operations Research, 131(1–4), 373–395.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada
  2. 2.Department of Computer Science and AutomationIndian Institute of ScienceBangaloreIndia
  3. 3.Department of Computer Science and Automation and the Robert Bosch Centre for Cyber Physical SystemsIndian Institute of ScienceBangaloreIndia

Personalised recommendations