Learning and planning in environments with delayed feedback
Purchase on Springer.com
$39.95 / €34.95 / £29.95*
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.
This work considers the problems of learning and planning in Markovian environments with constant observation and reward delays. We provide a hardness result for the general planning problem and positive results for several special cases with deterministic or otherwise constrained dynamics. We present an algorithm, Model Based Simulation, for planning in such environments and use model-based reinforcement learning to extend this approach to the learning setting in both finite and continuous environments. Empirical comparisons show this algorithm holds significant advantages over others for decision making in delayed-observation environments.
- Altman, E., & Nain, P. Closed-loop control with delayed information. In Proceedings of the ACM SIGMETRICS and Performance 1–5, pp. 193–204.
- Atkeson C.G., Moore A.W., Schaal S. (1997) Locally weighted learning for control. Artificial Intelligence Review 11(1–5): 75–113 CrossRef
- Bander J.L., White C.C. III (1999) Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society 50: 660–668 CrossRef
- Bertsekas, D. P. (2001). Dynamic programming and optimal control (2nd ed., Vol. 1/2). Athena Scientific.
- Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems: Proceedings of the 1994 conference (pp. 369–376). Cambridge, MA: MIT Press.
- Brafman R.I., Tennenholtz M. (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231 CrossRef
- Brooks D.M., Leondes C.T. (1972) Markov decision processes with state-information lag. Operations Research 20(4): 904–907 CrossRef
- Fox, R., & Tennenholtz, M. (2007). A reinforcement learning algorithm with polynomial interaction complexity for only-costly-observable MDPs. In Proceedings of the 22nd Conference on Artificial Intelligence, pp. 553–558.
- Hoeffding W. (1963) Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301): 13–30 CrossRef
- Jong, N. K., & Stone, P. (2006). Kernel-based models for reinforcement learning. In Proceedings of the 2006 ICML Kernel Machines and Reinforcement Learning Workshop.
- Kaelbling L.P., Littman M.L., Cassandra A.R. (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2): 99–134 CrossRef
- Kakade, S. (2003). On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, UK.
- Katsikopoulos K.V., Engelbrecht S.E. (2003) Markov decision processes with delays and asynchronous cost collection. IEEE Transactions on Automatic Control 48: 568–574 CrossRef
- Lin, L.-J. (1993). Reinforcement Learning for Robots using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, PA.
- Littman, M. L. (1996). Algorithms for sequential decision making. PhD thesis, Brown University, Providence, RI, 1996.
- Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the 15th International Conference on Machine Learning, pp. 323–331.
- Munos, R., & Moore, A. W. (2000). Rates of convergence for variable resolution schemes in optimal control. In Proceedings of the 17th International Conference on Machine Learning, pp. 647–654.
- Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178 CrossRef
- Papadimitriou C.H., Tsitsiklis J.N. (1987) The complexity of Markov decision processes. Mathematics of Operations Research 12(3): 441–450 CrossRef
- Puterman M.L. (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York
- Singh S.P., Sutton R.S. (1996) Reinforcement learning with replacing eligibility traces. Machine Learning 22(1–3): 123–158
- Singh S.P., Yee R.C. (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learning 16(3): 227–233
- Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). PAC model-free reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888.
- Sutton R.S. (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky D.S., Mozer M.C., HasselmoM. E. (Eds) Advances in neural information processing systems 8. MIT Press, Cambridge, MA, pp 1038–1045
- Sutton R.S., Barto A.G. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MA
- Vijayakumar, S., & Schaal, S. (2000). Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space. In Proceedings of the 17th International Conference on Machine Learning, pp. 1079–1086.
- Zubek, V. B., & Dietterich, T. G. (2000). A POMDP approximation algorithm that anticipates the need to observe. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, pp. 521–532.
- Learning and planning in environments with delayed feedback
Autonomous Agents and Multi-Agent Systems
Volume 18, Issue 1 , pp 83-105
- Cover Date
- Print ISSN
- Online ISSN
- Springer US
- Additional Links
- Reinforcement learning
- Delayed feedback
- Markov decision processes
- Industry Sectors