Baird, L. Personal Communication.

Baird, L., (1995). Residual algorithms: Reinforcement learning with function approximation. In*Proceedings of the 12th International Conference on Machine Learning*, pages 30–37. Morgan Kaufmann.

Barto, A., Bradtke, S. & Singh, S., (1995). Learning to act using real-time dynamic programming.

*Artificial Intelligence*, 72(1):81–138.

Google ScholarBertsekas, D., (1982). Distributed dynamic programming.*IEEE Transactions on Automatic Control*. AC-27(3)

Bertsekas, D., (1987).*Dynamic Programming: Deterministic and Stochastic Models*. Prentice-Hall.

Blackwell, D., (1962). Discrete dynamic programming.

*Annals of Mathematical Statistics*, 33 719–726.

Google ScholarBoutilier, C. & Puterman, M., (1995). Process-oriented planning and average-reward optimality. In*Proceedings of the Fourteenth JCAI*, pages 1096–1103. Morgan Kaufmann.

Dayan, P. & Hinton, G., (1992). Feudal reinforcement learning. In*Neural Information Processing Systems (NIPS)*, pages 271–278.

Denardo, E., (1970). Computing a bias-optimal policy in a discrete-time Markov decision problem.

*Operations Research*, 18:272–289.

Google ScholarDent, L., Boticario, J., McDermott, J., Mitchell, T. & Zabowski, D., (1992). A personal learning apprentice In*Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI)*, pages 96–103. MIT Press.

Engelberger, J., (1989).*Robotics in Service*. MIT Press.

Federgruen, A. & Schweitzer, P., (1984). Successive approximation methods for solving nested functional equations in Markov decision problems.

*Mathematics of Operations Research*, 9:319–344.

Google ScholarHaviv, M. & Puterman, M., (1991) An improved algorithm for solving communicating average reward markov decision processes.

*Annals of Operations Research*, 28:229–242.

Google ScholarHordijk, A. & Tijms, H., (1975). A modified form of the iterative method of dynamic programming.

*Annals of Statistics*, 3:203–208.

Google ScholarHoward, R., (1960).*Dynamic Programming and Markov Processes*. MIT Press.

Jalali, A. & Ferguson, M., (1989). Computationally efficient adaptive control algorithms for Markov chains In*Proceedings of the 28th IEEE Conference on Decision and Control*, pages 1283–1288.

Jalali, A. & Ferguson, M., (1990). A distributed asynchronous algorithm for expected average cost dynamic programming. In*Proceedings of the 29th IEEE Conference on Dectsion and Control*. pages 1394–1395.

Kaelbling, L., (1993a). Hierarchical learning in stochastic domains: Preliminary results. In*Proceedings of the Tenth International Conference on Machine Learning*, pages 167–173 Morgan Kaufmann.

Kaelbling, L., (1993b)*Learning in Embedded Systems*. MIT Press.

Lin, L., (1993).*Reinforcement Learning for Robots using Neural Networks*. PhD thesis. Carnegie-Mellon Univ.

Mahadevan, S. A model-based bias-optimal reinforcement learning algorithm. In preparation.

Mahadevan, S., (1992). Enhancing transfer in reinforcement learning by building stochastic models of robot actions. In*Proceedings of the Seventh International Conference on Machine Learning*, pages 290–299 Morgan Kaufmann.

Mahadevan, S., (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. In*Proceedings of the Eleventh International Conference on Machine Learning*. pages 164–172. Morgan Kaufmann.

Mahadevan, S. & Baird, L. Value function approximation in average reward reinforcement learning. In preparation.

Mahadevan, S. & Connell, J., (1992). Automatic programming of behavior-based robots using reinforcement learning.

*Artificial Intelligence*, 55:311–365. Appeared originally as IBM TR RC16359. Dec 1990.

Google ScholarMoore, A., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state spaces. In*Proceedings of the Eighth International Workshop on Machine Learning*, pages 333–337. Morgan Kaufmann.

Narendra, K. & Thathachar, M., (1989).*Learning Automata: An Introduction*. Prentice Hall.

Puterman, M., (1994).*Markov Decision Processes: Discrete Dynamic Stochastic Programming*. John Wiley.

Ross, S., (1983).*Introduction to Stochastic Dynamic Programming*. Academic Press.

Salganicoff, M., (1993). Density-adaptive learning and forgetting. In*Proceedings of the Tenth International Conference on Machine Learning*, pages 276–283. Morgan Kaufmann.

Schwartz, A., (1993). A reinforcement learning method for maximizing undiscounted rewards. In*Proceedings of the Tenth International Conference on Machine Learning*, pages 298–305. Morgan Kaufmann.

Singh, S., (1994a).*Learning to Solve Markovian Decision Processes*. PhD thesis, Univ of Massachusetts, Amherst.

Singh, S., (1994b) Reinforcement learning algorithms for average-payoff Markovian decision processes. In*Proceedings of the 12th AAAI*. MIT Press.

Sutton, R., (1988). Learning to predict by the method of temporal differences.

*Machine Learning*, 3:9–44.

Google ScholarSutton, R., (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In*Proceedings of the Seventh International Conference on Machine Learning*, pages 216–224. Morgan Kaufmann.

Sutton, R., editor, (1992).*Reinforcement Learning*. Kluwer Academic Press. Special Issue of Machine Learning Journal Vol 8, Nos 3–4, May 1992.

Tadepall, P. Personal Communication.

Tadepalli, P. & Ok, D., (1994). H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 94-30-01, Oregon State Univ.

Tesauro, G., (1992). Practical issues in temporal difference learning. In R. Sutton, editor,*Reinforcement Learning*. Kluwer Academic Publishers.

Thrun, S. The role of exploration in learning control. In D. A. White and D. A. Sofge, editors,*Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches*. Van Nostrand Reinhold.

Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-learning.

*Machine Learning*, 16:185–202.

Google ScholarVeinott, A., (1969) Discrete dynamic programming with sensitive discount optimality criteria.

*Annals of Mathematical Statistics*, 40(5):1635–1660.

Google ScholarWatkins, C., (1989).*Learning from Delayed Rewards*. PhD thesis, King's College, Cambridge, England.

Wheeler, R. & Narendra, K., (1986). Decentralized learning in finite Markov chains.*IEEE Transactions on Automatic Control*, AC-31(6)

White, D., (1963). Dynamic programming, markov chains, and the method of successive approximations

*Journal of Mathematical Analysis and Applications*, 6:373–376.

Google ScholarWhitehead, S., Karlsson, J. & Tenenberg, J., (1993). Learning multiple goal behavior via task decomposition and dynamic policy merging. In J. Connell and S. Mahadevan, editors,*Robot Learning*. Kluwer Academic Publishers.