Skip to main content

Part of the book series: Springer Series in Supply Chain Management ((SSSCM,volume 18))

  • 808 Accesses

Abstract

This chapter briefly reviews some fundamental concepts, standard problem formulations, and classical algorithms of reinforcement learning (RL). Specifically, we first review Markov decision processes (MDPs) and dynamic programming (DP), which provide mathematical foundations for both the problem formulation and algorithm design for RL. Then we review some classical RL algorithms, such as Q-learning, Sarsa, policy gradient, and Thompson sampling. Finally, we provide a high-level review of the exploration schemes in RL and approximate solution methods for large-scale RL problems. At the end of this chapter, we also provide some pointers for further reading.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Since we assume the time horizon and the cardinalities of \(\mathcal {S}\) and \(\mathcal {A}\) are all finite, the maximum is always achieved.

  2. 2.

    In general, a randomized policy \(\tilde {\pi }\) is optimal if \({\mathtt {supp}} \tilde {\pi }(\cdot | s, h) \subseteq \operatorname *{\mbox{arg max}}_{a \in \mathcal {A}} Q^*(s, h, a)\) for all (s, h), where \({\mathtt {supp}} \tilde {\pi }(\cdot | s, h)\) is the support of the distribution \(\tilde {\pi }(\cdot | s, h)\).

  3. 3.

    Notice that we choose the convention that t starts from 1, thus, the discount at time t is γt−1. If t starts from 0, then the discount at time t should be γt. We choose the convention that t starts from 1 to be consistent with the finite-horizon MDPs.

  4. 4.

    Chapter 1 in Bertsekas (2011) considers a cost minimization setting, which is equivalent to the reward maximization setting considered in this chapter if we define the cost as one minus the reward.

  5. 5.

    Note that one choice of such policies is a deterministic policy π satisfying

    $$\displaystyle \begin{aligned} \textstyle \pi^{\prime}(s) \in \operatorname*{\mbox{arg max}}_{a \in \mathcal{A}} \bar{r}(s, a) + \gamma \sum_{s^{\prime} \in \mathcal{S}} P(s^{\prime} | s, a) V_K(s^{\prime}). \end{aligned}$$
  6. 6.

    More precisely, the time horizon H in a computer game is usually a stopping time, rather than deterministic.

  7. 7.

    Mathematically, it means that π = ψ(χ), where ψ is a function known to the agent.

  8. 8.

    The algorithm breaks ties in a uniformly random manner, as specified in the pseudo-code.

  9. 9.

    This state-action-reward-state-action quintuple gives rise to the name Sarsa for the algorithm.

  10. 10.

    In Sect. 2.3.2.3, to simplify the notation, we drop the episode subscript t if the discussion/analysis is within one episode.

References

  • Al-Emran, M. (2015). Hierarchical reinforcement learning: A survey. International Journal of Computing and Digital Systems, 4(02). https://dx.doi.org/10.12785/IJCDS/040207

    Google Scholar 

  • Arora, S., & Doshi, P. (2021). A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297, 103500.

    Google Scholar 

  • Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.

    Google Scholar 

  • Bertsekas, D. (2019). Reinforcement and optimal control. Belmont: Athena Scientific

    Google Scholar 

  • Bertsekas, D. P. (2000). Dynamic programming and optimal control (Vol. 1). Belmont: Athena scientific.

    Google Scholar 

  • Bertsekas, D. P. (2011). Dynamic programming and optimal control (Vol. II, 3rd ed.). Belmont: Athena scientific.

    Google Scholar 

  • Bishop, C. M. (2006). Pattern recognition and machine learning (Information science and statistics). Berlin, Heidelberg: Springer.

    Google Scholar 

  • Brafman, R. I., & Tennenholtz, M. (2002). R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct), 213–231.

    Google Scholar 

  • Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2), 156–172.

    Google Scholar 

  • Cesa-Bianchi, N., Gentile, C., Lugosi, G., & Neu, G. (2017). Boltzmann exploration done right. Preprint. arXiv:170510257.

    Google Scholar 

  • Chen, X., Li, S., Li, H., Jiang, S., Qi, Y., & Song, L. (2019). Generative adversarial user model for reinforcement learning based recommendation system. In International Conference on Machine Learning, PMLR (pp. 1052–1061).

    Google Scholar 

  • Dann, C., Lattimore, T., & Brunskill, E. (2017). Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Preprint. arXiv:170307710.

    Google Scholar 

  • Dayan, P. (1992). The convergence of td (λ) for general λ. Machine Learning, 8(3–4), 341–362.

    Google Scholar 

  • Degris, T., White, M., & Sutton, R. S. (2012). Off-policy actor-critic. Preprint. arXiv:12054839.

    Google Scholar 

  • Fischer, T. G. (2018). Reinforcement Learning in Financial Markets—A Survey. Tech. rep., FAU Discussion Papers in Economics.

    Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R., et al. (2001). The elements of statistical learning. Springer series in statistics. New York: Springer.

    Google Scholar 

  • Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1), 1437–1480.

    Google Scholar 

  • Gosavii, A., Bandla, N., & Das, T. K. (2002). A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Transactions, 34(9), 729–742.

    Google Scholar 

  • Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 1471–1530.

    Google Scholar 

  • Hasselt, H. (2010). Double q-learning. Advances in Neural Information Processing Systems, 23, 2613–2621.

    Google Scholar 

  • Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C. (2017). Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2), 1–35.

    Google Scholar 

  • Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201.

    Google Scholar 

  • Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 1563–1600.

    Google Scholar 

  • Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1–2), 99–134.

    Google Scholar 

  • Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2), 209–232.

    Google Scholar 

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint. arXiv:14126980.

    Google Scholar 

  • Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238–1274.

    Google Scholar 

  • Kushner, H., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications (Vol. 35). New York: Springer Science & Business Media.

    Google Scholar 

  • Kuznetsova, E., Li, Y. F., Ruiz, C., Zio, E., Ault, G., & Bell, K. (2013). Reinforcement learning for microgrid energy management. Energy, 59, 133–146.

    Google Scholar 

  • Kveton, B., Szepesvari, C., Wen, Z., & Ashkan, A. (2015). Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, PMLR (pp. 767–776)

    Google Scholar 

  • Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham: Packt Publishing Ltd.

    Google Scholar 

  • Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge: Cambridge University Press.

    Google Scholar 

  • Li, Y. (2017). Deep reinforcement learning: An overview. Preprint. arXiv:170107274.

    Google Scholar 

  • Lin, L. J. (1992). Reinforcement learning for robots using neural networks. Pittsburgh: Carnegie Mellon University.

    Google Scholar 

  • Lu, X., Van Roy, B., Dwaracherla, V., Ibrahimi, M., Osband, I., & Wen, Z. (2021). Reinforcement learning, bit by bit. Preprint. arXiv:210304047.

    Google Scholar 

  • Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1), 159–195.

    Google Scholar 

  • Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2), 191–209.

    Google Scholar 

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015) Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

    Google Scholar 

  • Ng, A. Y., Russell, S. J., et al. (2000). Algorithms for inverse reinforcement learning. In ICML (Vol. 1, p. 2).

    Google Scholar 

  • Osband, I., Russo, D., & Van Roy, B. (2013). (More) Efficient reinforcement learning via posterior sampling. Preprint. arXiv:13060940.

    Google Scholar 

  • Osband, I., Van Roy, B., Russo, D. J., Wen, Z., et al. (2019) Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124), 1–62.

    Google Scholar 

  • Pateria, S., Subagdja, B., Tan, A. H., & Quek, C. (2021). Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5), 1–35.

    Google Scholar 

  • Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality (Vol. 703). New York: Wiley.

    Google Scholar 

  • Ravichandiran, S. (2018). Hands-on reinforcement learning with Python: Master reinforcement and deep reinforcement learning using OpenAI gym and tensorFlow. Birmingham: Packt Publishing Ltd.

    Google Scholar 

  • Ruder, S. (2016). An overview of gradient descent optimization algorithms. Preprint. arXiv:160904747.

    Google Scholar 

  • Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Vol. 37). Citeseer.

    Google Scholar 

  • Russo, D., & Van Roy, B. (2014). Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 1583–1591.

    Google Scholar 

  • Russo, D., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2017). A tutorial on Thompson sampling. Preprint. arXiv:170702038.

    Google Scholar 

  • Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839), 604–609.

    Google Scholar 

  • van Seijen, H. (2016). Effective multi-step temporal-difference learning for non-linear function approximation. Preprint. arXiv:160805151.

    Google Scholar 

  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.

    Google Scholar 

  • Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Preprint. arXiv:171201815.

    Google Scholar 

  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017b). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.

    Google Scholar 

  • Singh, S., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3), 287–308.

    Google Scholar 

  • Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst.

    Google Scholar 

  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.

    Google Scholar 

  • Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems (pp. 1038–1044). Cambridge: MIT Press.

    Google Scholar 

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.

    Google Scholar 

  • Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057–1063).

    Google Scholar 

  • Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1), 1–103.

    Google Scholar 

  • Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.

    Google Scholar 

  • Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning, 16(3):185–202.

    Google Scholar 

  • Van Seijen, H., Van Hasselt, H., Whiteson, S., & Wiering, M. (2009). A theoretical and empirical analysis of expected Sarsa. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (pp. 177–184). New York: IEEE.

    Google Scholar 

  • Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.

    Google Scholar 

  • Wen, Z., & Van Roy, B. (2017). Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3), 762–782.

    Google Scholar 

  • Wen, Z., O’Neill, D., & Maei, H. (2015). Optimal demand response using device-based reinforcement learning. IEEE Transactions on Smart Grid, 6(5), 2312–2324.

    Google Scholar 

  • Wen, Z., Precup, D., Ibrahimi, M., Barreto, A., Van Roy, B., & Singh, S. (2020). On efficiency in hierarchical reinforcement learning. Advances in Neural Information Processing Systems (Vol. 33)

    Google Scholar 

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256.

    Google Scholar 

  • Zhang, K., Yang, Z., & Başar, T. (2021). Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of reinforcement learning and control (pp. 321–384).

    Google Scholar 

  • Zhang, W., Zhao, X., Zhao, L., Yin, D., Yang, G. H., & Beutel, A. (2020). Deep reinforcement learning for information retrieval: Fundamentals and advances. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2468–2471)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Wen, Z. (2022). Reinforcement Learning. In: Chen, X., Jasin, S., Shi, C. (eds) The Elements of Joint Learning and Optimization in Operations Management. Springer Series in Supply Chain Management, vol 18. Springer, Cham. https://doi.org/10.1007/978-3-031-01926-5_2

Download citation

Publish with us

Policies and ethics