Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery

  • Scott Proper
  • Prasad Tadepalli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)


Reinforcement learning in real-world domains suffers from three curses of dimensionality: explosions in state and action spaces, and high stochasticity. We present approaches that mitigate each of these curses. To handle the state-space explosion, we introduce “tabular linear functions” that generalize tile-coding and linear value functions. Action space complexity is reduced by replacing complete joint action space search with a form of hill climbing. To deal with high stochasticity, we introduce a new algorithm called ASH-learning, which is an afterstate version of H-Learning. Our extensions make it practical to apply reinforcement learning to a domain of product delivery – an optimization problem that combines inventory control and vehicle routing.


Action Space Markov Decision Process Vehicle Route Problem Hill Climbing Nominal Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Powell, W.B., Van Roy, B.: Approximate Dynamic Programming for High-Dimensional Dynamic Resource Allocation Problems. In: Si, J., Barto, A.G., Powell, W.B., Wunsch, D. (eds.) Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press, Hoboken (2004)Google Scholar
  3. 3.
    Van Roy, B., Bertsekas, D.P., Lee, Y., Tsitsiklis, J.N.: A Neuro-Dynamic Programming Approach to Retailer Inventory Management. In: Proceedings of the IEEE Conference on Decision and Control (1997)Google Scholar
  4. 4.
    Secamondi, N.: Comparing Neuro-Dynamic Programming Algorithms for the Vehicle Routing Problem with Stochastic Demands. Computers and Operations Research 27(11-12) (2000)Google Scholar
  5. 5.
    Secamondi, N.: A Rollout Policy for the Vehicle Routing Problem with Stochastic Demands. Operations Research 49(5), 768–802 (2001)Google Scholar
  6. 6.
    Strens, M., Windelinckx, N.: Combining planning with reinforcement learning for multi-robot task allocation. In: Kudenko, D., Kazakov, D., Alonso, E. (eds.) AAMAS 2004. LNCS, vol. 3394, pp. 260–274. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Puterman, M.L.: Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, Chichester (1994)MATHGoogle Scholar
  8. 8.
    Tadepalli, P., Ok, D.: Model-based Average Reward Reinforcement Learning. Artificial Intelligence 100, 177–224 (1998)MATHCrossRefGoogle Scholar
  9. 9.
    Bräysy, O., Gendreau, M.: Vehicle Routing Problem with Time Windows, Part II: Metaheuristics. Working Paper, SINTEF Applied Mathematics, Department of Optimisation, Norway (2003)Google Scholar
  10. 10.
    Ghavamzadeh, M., Mahadevan, S.: Learning to communicate and act using hierarchical reinforcement learning. In: AAMAS, pp. 1114–1121. IEEE Computer Society, Los Alamitos (2004)Google Scholar
  11. 11.
    Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA (2002)Google Scholar
  12. 12.
    Schwartz, A.: A Reinforcement Learning Method for Maximizing Undiscounted Rewards. In: Proceedings of the 10th International Conference on Machine Learning, Amherst, Massachusetts, pp. 298–305. Morgan Kaufmann, San Francisco (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Scott Proper
    • 1
  • Prasad Tadepalli
    • 1
  1. 1.Oregon State UniversityCorvallisUSA

Personalised recommendations