Machine Learning

, Volume 107, Issue 6, pp 969–1011 | Cite as

An incremental off-policy search in a model-free Markov decision process using a single sample path

  • Ajin George Joseph
  • Shalabh Bhatnagar


In this paper, we consider a modified version of the control problem in a model free Markov decision process (MDP) setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which maximizes the value function, i.e., the long run discounted reward of the MDP. The current settings also assume access to a generative model of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this paper, we consider a modified version, where the cost function is the expectation of a non-convex function of the value function without access to the generative model. Rather, we assume that a sample trajectory generated using a priori chosen behaviour policy is made available. In this restricted setting, we solve the modified control problem in its true sense, i.e., to find the best possible policy given this limited information. We propose a stochastic approximation algorithm based on the well-known cross entropy method which is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We provide a proof of convergence of our algorithm to a policy which is globally optimal relative to the behaviour policy. We also present experimental results to corroborate our claims and we demonstrate the superiority of the solution produced by our algorithm compared to the state-of-the-art algorithms under appropriately chosen behaviour policy.


Markov decision process Off-policy prediction Control problem Stochastic approximation method Cross entropy method Linear function approximation ODE method Global optimization 


  1. Alon, G., Kroese, D. P., Raviv, T., & Rubinstein, R. Y. (2005). Application of the cross-entropy method to the buffer allocation problem in a simulation-based environment. Annals of Operations Research, 134(1), 137–151.MathSciNetzbMATHCrossRefGoogle Scholar
  2. Antos, A., Szepesvári, C., & Munos, R. (2007). Value-iteration based fitted policy iteration: Learning with a single trajectory. In 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning (pp. 330–337).Google Scholar
  3. Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129.zbMATHCrossRefGoogle Scholar
  4. Bagnell, J. A., & Schneider, J. G. (2001). Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings 2001 ICRA. IEEE international conference on robotics and automation, vol. 2 (pp. 1615–1620).Google Scholar
  5. Balleine, B. W., & Dickinson, A. (1998). Goal-directed instrumental action: Contingency and incentive learning and their cortical substrates. Neuropharmacology, 37(4), 407–419.CrossRefGoogle Scholar
  6. Barreto, A. D. M. S., Pineau, J., & Precup, D. (2014). Policy iteration based on stochastic factorization. Journal of Artificial Intelligence Research, 50, 763–803.MathSciNetzbMATHGoogle Scholar
  7. Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81–138.CrossRefGoogle Scholar
  8. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.MathSciNetzbMATHGoogle Scholar
  9. Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1). Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
  10. Bertsekas, D. P., & Castanon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.MathSciNetzbMATHCrossRefGoogle Scholar
  11. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11), 2471–2482.MathSciNetzbMATHCrossRefGoogle Scholar
  12. Borkar, V. S. (2008). Stochastic approximation. Cambridge: Cambridge University Press.zbMATHCrossRefGoogle Scholar
  13. Chang, H. S., Hu, J., Fu, M. C., & Marcus, S. I. (2013). Simulation-based algorithms for Markov decision processes. Berlin: Springer.zbMATHCrossRefGoogle Scholar
  14. Dann, C., Neumann, G., & Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15(1), 809–883.MathSciNetzbMATHGoogle Scholar
  15. Deisenroth, M., & Rasmussen, C. E. (2011). Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th international conference on machine learning (ICML) (pp. 465–472).Google Scholar
  16. de Boer, P. T. (2000). Analysis and efficient simulation of queueing models of telecommunication systems. Centre for Telematics and Information Technology University of Twente.Google Scholar
  17. Ertin, E., Dean, A. N., Moore, M. L., & Priddy, K. L. (2001). Dynamic optimization for optimal control of water distribution systems. Applications and Science of Computational Intelligence IV, 4390, 142–149.CrossRefGoogle Scholar
  18. Feinberg, E. A., & Shwartz, A. (2012). Handbook of Markov decision processes: Methods and applications. Berlin: Springer.zbMATHGoogle Scholar
  19. Fracasso, P., Barnes, F., & Costa, A. (2014). Optimized control for water utilities. Procedia Engineering, 70, 678–687.CrossRefGoogle Scholar
  20. Glynn, P. W., & Iglehart, D. L. (1989). Importance sampling for stochastic simulations. Management Science, 35(11), 1367–1392.MathSciNetzbMATHCrossRefGoogle Scholar
  21. Helvik, B. E., & Wittner, O. (2001). Using the cross-entropy method to guide/govern mobile agents path finding in networks. In International Workshop on Mobile Agents for Telecommunication Applications (pp. 255–268). Springer.Google Scholar
  22. Higham, N. J. (1994). A survey of componentwise perturbation theory in numerical linear algebra. In W. Gautschi (Ed.), Mathematics of computation 1943–1993: A half century of computational mathematics (Proceedings of Symposia in Applied Mathematics) (Vol. 48, pp. 49–77). Providence, RI: American Mathematical Society.Google Scholar
  23. Hu, J., Fu, M. C., & Marcus, S. I. (2007). A model reference adaptive search method for global optimization. Operations Research, 55(3), 549–568.MathSciNetzbMATHCrossRefGoogle Scholar
  24. Hu, J., Hu, P., & Chang, H. S. (2012). A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Transactions on Automatic Control, 57(1), 165–178.MathSciNetzbMATHCrossRefGoogle Scholar
  25. Ikonen, E., & Bene, J. (2011). Scheduling and disturbance control of a water distribution network. IFAC Proceedings Volumes, 44(1), 7138–7143.CrossRefGoogle Scholar
  26. Joseph, A. G., & Bhatnagar, S. (2016a). A randomized algorithm for continuous optimization. In Winter simulation conference, WSC 2016, Washington, DC, USA, December 11–14 (pp. 907–918).Google Scholar
  27. Joseph, A. G., & Bhatnagar, S. (2016b). A cross entropy based stochastic approximation algorithm for reinforcement learning with linear function approximation. CoRR abs/1207.0016.Google Scholar
  28. Joseph, A. G., & Bhatnagar, S. (2016c). Revisiting the cross entropy method with applications in stochastic global optimization and reinforcement learning. Frontiers in Artificial Intelligence and Applications, 285(ECAI 2016), 1026–1034.
  29. Keith, J., & Kroese, D. P. (2002). Rare event simulation and combinatorial optimization using cross entropy: Sequence alignment by rare event simulation. In Proceedings of the 34th conference on winter simulation: Exploring new frontiers, winter simulation conference (pp. 320–327).Google Scholar
  30. Koller, D., & Parr, R. (2000). Policy iteration for factored MDPs. In Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 326–334). Morgan Kaufmann Publishers Inc.Google Scholar
  31. Konda, V. R., & Tsitsiklis, J. N. (2003). Actor-critic algorithms. SIAM journal on Control and Optimization, 42(4), 1143–1166.MathSciNetzbMATHCrossRefGoogle Scholar
  32. Kroese, D. P., Porotsky, S., & Rubinstein, R. Y. (2006). The cross-entropy method for continuous multi-extremal optimization. Methodology and Computing in Applied Probability, 8(3), 383–407.MathSciNetzbMATHCrossRefGoogle Scholar
  33. Kumar, P., & Lin, W. (1982). Optimal adaptive controllers for unknown Markov chains. IEEE Transactions on Automatic Control, 27(4), 765–774.MathSciNetzbMATHCrossRefGoogle Scholar
  34. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.MathSciNetzbMATHGoogle Scholar
  35. Lee, S. W., Shimojo, S., & O’Doherty, J. P. (2014). Neural computations underlying arbitration between model-based and model-free learning. Neuron, 81(3), 687–699.CrossRefGoogle Scholar
  36. Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th international conference on machine learning (ICML) (pp. 719–726).Google Scholar
  37. Mannor, S., Rubinstein, R. Y., & Gat, Y.(2003). The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML) (pp. 512–519).Google Scholar
  38. Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1), 215–238.MathSciNetzbMATHCrossRefGoogle Scholar
  39. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103–130.Google Scholar
  40. Mühlenbein, H., & Paass, G. (1996). From recombination of genes to the estimation of distributions i. Binary parameters. In International conference on parallel problem solving from nature (pp. 178–187). Springer.Google Scholar
  41. O’Doherty, J. P., Lee, S. W., & McNamee, D. (2015). The structure of reinforcement-learning mechanisms in the human brain. Current Opinion in Behavioral Sciences, 1, 94–100.CrossRefGoogle Scholar
  42. Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.MathSciNetzbMATHCrossRefGoogle Scholar
  43. Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.zbMATHGoogle Scholar
  44. Rubinstein, R. (1999). The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1(2), 127–190.MathSciNetzbMATHCrossRefGoogle Scholar
  45. Rubinstein, R. Y. (2002). Cross-entropy and rare events for maximal cut and partition problems. ACM Transactions on Modeling and Computer Simulation (TOMACS), 12(1), 27–53.CrossRefGoogle Scholar
  46. Rubinstein, R. Y., & Kroese, D. P. (2013). The cross-entropy method: A unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Berlin: Springer.zbMATHCrossRefGoogle Scholar
  47. Sato, M., Abe, K., & Takeda, H. (1982). Learning control of finite Markov chains with unknown transition probabilities. IEEE Transactions on Automatic Control, 27(2), 502–505.zbMATHCrossRefGoogle Scholar
  48. Sato, M., Abe, K., & Takeda, H. (1988). Learning control of finite Markov chains with an explicit trade-off between estimation and control. IEEE Transactions on Systems, Man, and Cybernetics, 18(5), 677–684.zbMATHCrossRefGoogle Scholar
  49. Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3), 123–158.zbMATHGoogle Scholar
  50. Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332–341.MathSciNetzbMATHCrossRefGoogle Scholar
  51. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.Google Scholar
  52. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.Google Scholar
  53. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.MathSciNetzbMATHCrossRefGoogle Scholar
  54. Varga, R. S. (1976). On diagonal dominance arguments for bounding \(\Vert A^{-1}\Vert _{\infty }\). Linear Algebra and its Applications, 14(3), 211–217.MathSciNetzbMATHGoogle Scholar
  55. Wang, B., & Enright, W. (2013). Parameter estimation for ODEs using a cross-entropy approach. SIAM Journal on Scientific Computing, 35(6), A2718–A2737.MathSciNetzbMATHCrossRefGoogle Scholar
  56. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge England.Google Scholar
  57. Xue, J. (1997). A note on entrywise perturbation theory for Markov chains. Linear Algebra and its Applications, 260, 209–213.MathSciNetzbMATHCrossRefGoogle Scholar
  58. Yu, H. (2012). Least squares temporal difference methods: An analysis under general conditions. SIAM Journal on Control and Optimization, 50(6), 3310–3343.MathSciNetzbMATHCrossRefGoogle Scholar
  59. Yu, H. (2015). On convergence of emphatic temporal-difference learning. In Proceedings of the conference on computational learning theory.Google Scholar
  60. Zhou, E., Bhatnagar, S., Chen, X. (2014). Simulation optimization via gradient-based stochastic search. In Proceedings of the 2014 winter simulation conference (pp. 3869–3879). IEEE Press.Google Scholar
  61. Zlochin, M., Birattari, M., Meuleau, N., & Dorigo, M. (2004). Model-based search for combinatorial optimization: A critical survey. Annals of Operations Research, 131(1–4), 373–395.MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Indian Institute of ScienceBangaloreIndia

Personalised recommendations