Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery

  • Scott Proper
  • Prasad Tadepalli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)


Reinforcement learning in real-world domains suffers from three curses of dimensionality: explosions in state and action spaces, and high stochasticity. We present approaches that mitigate each of these curses. To handle the state-space explosion, we introduce “tabular linear functions” that generalize tile-coding and linear value functions. Action space complexity is reduced by replacing complete joint action space search with a form of hill climbing. To deal with high stochasticity, we introduce a new algorithm called ASH-learning, which is an afterstate version of H-Learning. Our extensions make it practical to apply reinforcement learning to a domain of product delivery – an optimization problem that combines inventory control and vehicle routing.


Action Space Markov Decision Process Vehicle Route Problem Hill Climbing Nominal Feature 


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Powell, W.B., Van Roy, B.: Approximate Dynamic Programming for High-Dimensional Dynamic Resource Allocation Problems. In: Si, J., Barto, A.G., Powell, W.B., Wunsch, D. (eds.) Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press, Hoboken (2004)Google Scholar
  3. 3.
    Van Roy, B., Bertsekas, D.P., Lee, Y., Tsitsiklis, J.N.: A Neuro-Dynamic Programming Approach to Retailer Inventory Management. In: Proceedings of the IEEE Conference on Decision and Control (1997)Google Scholar
  4. 4.
    Secamondi, N.: Comparing Neuro-Dynamic Programming Algorithms for the Vehicle Routing Problem with Stochastic Demands. Computers and Operations Research 27(11-12) (2000)Google Scholar
  5. 5.
    Secamondi, N.: A Rollout Policy for the Vehicle Routing Problem with Stochastic Demands. Operations Research 49(5), 768–802 (2001)Google Scholar
  6. 6.
    Strens, M., Windelinckx, N.: Combining planning with reinforcement learning for multi-robot task allocation. In: Kudenko, D., Kazakov, D., Alonso, E. (eds.) AAMAS 2004. LNCS, vol. 3394, pp. 260–274. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Puterman, M.L.: Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, Chichester (1994)MATHGoogle Scholar
  8. 8.
    Tadepalli, P., Ok, D.: Model-based Average Reward Reinforcement Learning. Artificial Intelligence 100, 177–224 (1998)MATHCrossRefGoogle Scholar
  9. 9.
    Bräysy, O., Gendreau, M.: Vehicle Routing Problem with Time Windows, Part II: Metaheuristics. Working Paper, SINTEF Applied Mathematics, Department of Optimisation, Norway (2003)Google Scholar
  10. 10.
    Ghavamzadeh, M., Mahadevan, S.: Learning to communicate and act using hierarchical reinforcement learning. In: AAMAS, pp. 1114–1121. IEEE Computer Society, Los Alamitos (2004)Google Scholar
  11. 11.
    Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA (2002)Google Scholar
  12. 12.
    Schwartz, A.: A Reinforcement Learning Method for Maximizing Undiscounted Rewards. In: Proceedings of the 10th International Conference on Machine Learning, Amherst, Massachusetts, pp. 298–305. Morgan Kaufmann, San Francisco (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Scott Proper
    • 1
  • Prasad Tadepalli
    • 1
  1. 1.Oregon State UniversityCorvallisUSA

Personalised recommendations