Learning and planning in environments with delayed feedback

  • Thomas J. Walsh
  • Ali Nouri
  • Lihong Li
  • Michael L. Littman


This work considers the problems of learning and planning in Markovian environments with constant observation and reward delays. We provide a hardness result for the general planning problem and positive results for several special cases with deterministic or otherwise constrained dynamics. We present an algorithm, Model Based Simulation, for planning in such environments and use model-based reinforcement learning to extend this approach to the learning setting in both finite and continuous environments. Empirical comparisons show this algorithm holds significant advantages over others for decision making in delayed-observation environments.


Reinforcement learning Delayed feedback Markov decision processes 


  1. 1.
    Altman, E., & Nain, P. Closed-loop control with delayed information. In Proceedings of the ACM SIGMETRICS and Performance 1–5, pp. 193–204.Google Scholar
  2. 2.
    Atkeson C.G., Moore A.W., Schaal S. (1997) Locally weighted learning for control. Artificial Intelligence Review 11(1–5): 75–113CrossRefGoogle Scholar
  3. 3.
    Bander J.L., White C.C. III (1999) Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society 50: 660–668MATHCrossRefGoogle Scholar
  4. 4.
    Bertsekas, D. P. (2001). Dynamic programming and optimal control (2nd ed., Vol. 1/2). Athena Scientific.Google Scholar
  5. 5.
    Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems: Proceedings of the 1994 conference (pp. 369–376). Cambridge, MA: MIT Press.Google Scholar
  6. 6.
    Brafman R.I., Tennenholtz M. (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231CrossRefMathSciNetGoogle Scholar
  7. 7.
    Brooks D.M., Leondes C.T. (1972) Markov decision processes with state-information lag. Operations Research 20(4): 904–907CrossRefGoogle Scholar
  8. 8.
    Fox, R., & Tennenholtz, M. (2007). A reinforcement learning algorithm with polynomial interaction complexity for only-costly-observable MDPs. In Proceedings of the 22nd Conference on Artificial Intelligence, pp. 553–558.Google Scholar
  9. 9.
    Hoeffding W. (1963) Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301): 13–30MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Jong, N. K., & Stone, P. (2006). Kernel-based models for reinforcement learning. In Proceedings of the 2006 ICML Kernel Machines and Reinforcement Learning Workshop.Google Scholar
  11. 11.
    Kaelbling L.P., Littman M.L., Cassandra A.R. (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2): 99–134MATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Kakade, S. (2003). On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, UK.Google Scholar
  13. 13.
    Katsikopoulos K.V., Engelbrecht S.E. (2003) Markov decision processes with delays and asynchronous cost collection. IEEE Transactions on Automatic Control 48: 568–574CrossRefMathSciNetGoogle Scholar
  14. 14.
    Lin, L.-J. (1993). Reinforcement Learning for Robots using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
  15. 15.
    Littman, M. L. (1996). Algorithms for sequential decision making. PhD thesis, Brown University, Providence, RI, 1996.Google Scholar
  16. 16.
    Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the 15th International Conference on Machine Learning, pp. 323–331.Google Scholar
  17. 17.
    Munos, R., & Moore, A. W. (2000). Rates of convergence for variable resolution schemes in optimal control. In Proceedings of the 17th International Conference on Machine Learning, pp. 647–654.Google Scholar
  18. 18.
    Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178MATHCrossRefGoogle Scholar
  19. 19.
    Papadimitriou C.H., Tsitsiklis J.N. (1987) The complexity of Markov decision processes. Mathematics of Operations Research 12(3): 441–450MATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Puterman M.L. (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New YorkMATHGoogle Scholar
  21. 21.
    Singh S.P., Sutton R.S. (1996) Reinforcement learning with replacing eligibility traces. Machine Learning 22(1–3): 123–158MATHGoogle Scholar
  22. 22.
    Singh S.P., Yee R.C. (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learning 16(3): 227–233MATHGoogle Scholar
  23. 23.
    Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). PAC model-free reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888.Google Scholar
  24. 24.
    Sutton R.S. (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky D.S., Mozer M.C., HasselmoM. E. (Eds) Advances in neural information processing systems 8. MIT Press, Cambridge, MA, pp 1038–1045Google Scholar
  25. 25.
    Sutton R.S., Barto A.G. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MAGoogle Scholar
  26. 26.
    Vijayakumar, S., & Schaal, S. (2000). Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space. In Proceedings of the 17th International Conference on Machine Learning, pp. 1079–1086.Google Scholar
  27. 27.
    Zubek, V. B., & Dietterich, T. G. (2000). A POMDP approximation algorithm that anticipates the need to observe. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, pp. 521–532.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Thomas J. Walsh
    • 1
  • Ali Nouri
    • 1
  • Lihong Li
    • 1
  • Michael L. Littman
    • 1
  1. 1.Department of Computer ScienceRutgers UniversityPiscatawayUSA

Personalised recommendations