Baird, L. Personal Communication.
Baird, L., (1995). Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning, pages 30–37. Morgan Kaufmann.
Barto, A., Bradtke, S. & Singh, S., (1995). Learning to act using real-time dynamic programming.Artificial Intelligence
, 72(1):81–138.Google Scholar
Bertsekas, D., (1982). Distributed dynamic programming.IEEE Transactions on Automatic Control. AC-27(3)
Bertsekas, D., (1987).Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall.
Blackwell, D., (1962). Discrete dynamic programming.Annals of Mathematical Statistics
, 33 719–726.Google Scholar
Boutilier, C. & Puterman, M., (1995). Process-oriented planning and average-reward optimality. InProceedings of the Fourteenth JCAI, pages 1096–1103. Morgan Kaufmann.
Dayan, P. & Hinton, G., (1992). Feudal reinforcement learning. InNeural Information Processing Systems (NIPS), pages 271–278.
Denardo, E., (1970). Computing a bias-optimal policy in a discrete-time Markov decision problem.Operations Research
, 18:272–289.Google Scholar
Dent, L., Boticario, J., McDermott, J., Mitchell, T. & Zabowski, D., (1992). A personal learning apprentice InProceedings of the Tenth National Conference on Artificial Intelligence (AAAI), pages 96–103. MIT Press.
Engelberger, J., (1989).Robotics in Service. MIT Press.
Federgruen, A. & Schweitzer, P., (1984). Successive approximation methods for solving nested functional equations in Markov decision problems.Mathematics of Operations Research
, 9:319–344.Google Scholar
Haviv, M. & Puterman, M., (1991) An improved algorithm for solving communicating average reward markov decision processes.Annals of Operations Research
, 28:229–242.Google Scholar
Hordijk, A. & Tijms, H., (1975). A modified form of the iterative method of dynamic programming.Annals of Statistics
, 3:203–208.Google Scholar
Howard, R., (1960).Dynamic Programming and Markov Processes. MIT Press.
Jalali, A. & Ferguson, M., (1989). Computationally efficient adaptive control algorithms for Markov chains InProceedings of the 28th IEEE Conference on Decision and Control, pages 1283–1288.
Jalali, A. & Ferguson, M., (1990). A distributed asynchronous algorithm for expected average cost dynamic programming. InProceedings of the 29th IEEE Conference on Dectsion and Control. pages 1394–1395.
Kaelbling, L., (1993a). Hierarchical learning in stochastic domains: Preliminary results. InProceedings of the Tenth International Conference on Machine Learning, pages 167–173 Morgan Kaufmann.
Kaelbling, L., (1993b)Learning in Embedded Systems. MIT Press.
Lin, L., (1993).Reinforcement Learning for Robots using Neural Networks. PhD thesis. Carnegie-Mellon Univ.
Mahadevan, S. A model-based bias-optimal reinforcement learning algorithm. In preparation.
Mahadevan, S., (1992). Enhancing transfer in reinforcement learning by building stochastic models of robot actions. InProceedings of the Seventh International Conference on Machine Learning, pages 290–299 Morgan Kaufmann.
Mahadevan, S., (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. InProceedings of the Eleventh International Conference on Machine Learning. pages 164–172. Morgan Kaufmann.
Mahadevan, S. & Baird, L. Value function approximation in average reward reinforcement learning. In preparation.
Mahadevan, S. & Connell, J., (1992). Automatic programming of behavior-based robots using reinforcement learning.Artificial Intelligence
, 55:311–365. Appeared originally as IBM TR RC16359. Dec 1990.Google Scholar
Moore, A., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state spaces. InProceedings of the Eighth International Workshop on Machine Learning, pages 333–337. Morgan Kaufmann.
Narendra, K. & Thathachar, M., (1989).Learning Automata: An Introduction. Prentice Hall.
Puterman, M., (1994).Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley.
Ross, S., (1983).Introduction to Stochastic Dynamic Programming. Academic Press.
Salganicoff, M., (1993). Density-adaptive learning and forgetting. InProceedings of the Tenth International Conference on Machine Learning, pages 276–283. Morgan Kaufmann.
Schwartz, A., (1993). A reinforcement learning method for maximizing undiscounted rewards. InProceedings of the Tenth International Conference on Machine Learning, pages 298–305. Morgan Kaufmann.
Singh, S., (1994a).Learning to Solve Markovian Decision Processes. PhD thesis, Univ of Massachusetts, Amherst.
Singh, S., (1994b) Reinforcement learning algorithms for average-payoff Markovian decision processes. InProceedings of the 12th AAAI. MIT Press.
Sutton, R., (1988). Learning to predict by the method of temporal differences.Machine Learning
, 3:9–44.Google Scholar
Sutton, R., (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. InProceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan Kaufmann.
Sutton, R., editor, (1992).Reinforcement Learning. Kluwer Academic Press. Special Issue of Machine Learning Journal Vol 8, Nos 3–4, May 1992.
Tadepall, P. Personal Communication.
Tadepalli, P. & Ok, D., (1994). H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State Univ.
Tesauro, G., (1992). Practical issues in temporal difference learning. In R. Sutton, editor,Reinforcement Learning. Kluwer Academic Publishers.
Thrun, S. The role of exploration in learning control. In D. A. White and D. A. Sofge, editors,Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold.
Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning
, 16:185–202.Google Scholar
Veinott, A., (1969) Discrete dynamic programming with sensitive discount optimality criteria.Annals of Mathematical Statistics
, 40(5):1635–1660.Google Scholar
Watkins, C., (1989).Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, England.
Wheeler, R. & Narendra, K., (1986). Decentralized learning in finite Markov chains.IEEE Transactions on Automatic Control, AC-31(6)
White, D., (1963). Dynamic programming, markov chains, and the method of successive approximationsJournal of Mathematical Analysis and Applications
, 6:373–376.Google Scholar
Whitehead, S., Karlsson, J. & Tenenberg, J., (1993). Learning multiple goal behavior via task decomposition and dynamic policy merging. In J. Connell and S. Mahadevan, editors,Robot Learning. Kluwer Academic Publishers.