Markov Decision Processes

• Chapter
• First Online:
• 823 Accesses

Abstract

Markov decision processes (MDPs) offer a powerful framework for tackling sequential decision-making problems in the presence of uncertainty in reinforcement learning. Their applications span various domains, including robotics, finance, and optimal control.

This is a preview of subscription content, log in via an institution to check access.

Subscribe and save

Springer+ Basic
\$34.99 /Month
• Get 10 units per month
• 1 Unit = 1 Article or 1 Chapter
• Cancel anytime

Chapter
USD 29.95
Price excludes VAT (USA)
• Available as PDF
• Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
• Available as EPUB and PDF
• Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
• Compact, lightweight edition
• Dispatched in 3 to 5 business days
• Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Notes

1. 1.

In their book, Sutton and Barto briefly discussed why they chose to use $$R_{t+1}$$ instead of $$R_t$$ as the immediate reward (on page 48). However, they also emphasized that both conventions are widely used in the field.

2. 2.

The notation used in Eq. (2.2), as well as Eqs. (2.3) and (2.4), may seem unfamiliar to readers familiar with the work of Sutton and Barto. In their book, they utilize a different expression denoted as $$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdot \cdot \cdot + R_{T}$$ for the nondiscounted case. In their formulation, the immediate reward is denoted as $$R_{t+1}$$. However, in our book, we adopt a simpler reward function and notation, as explained earlier. We represent the immediate reward as $$R_t$$, assuming it solely depends on the current state $$S_t$$ and the action $$A_t$$ taken in that state. Therefore, in Eq. (2.2), as well as Eqs. (2.3) and (2.4), we start with $$R_t$$ instead of $$R_{t+1},$$ and we use $$R_{T-1}$$ instead of $$R_T$$ for the final reward. It is important to note that despite this slight time step shift, these equations essentially compute the same result: the sum of (or discounted) rewards over an episode.

3. 3.

The notation used in Eq. (2.13) and later in Eq. (2.16) may appear unfamiliar to those who are familiar with the work of Sutton and Barto. In their book, they employ a different expression denoted as $$\displaystyle \sum _{a \in A} \pi (a|s) \sum _{s', r} P(s', r|s, a) \Bigl [r + {\gamma } V_\pi (s') \Bigr ]$$. The reason for this difference is that they combine the reward function and dynamics function together to form a single function which takes in four arguments $$p(s', r | s, a)$$. By doing so, they are able to compute the expected reward or transition probabilities based on this single function. However, in our book, we adopt a slightly different and simpler notation. We use separate functions for reward and dynamics, namely, $$r=R(s, a)$$ and $$p(s'|s,a)$$, respectively. Moreover, we assume that the reward only depends on the state-action pair and not the successor state. Consequently, this allows us to separate the immediate reward $$R(s, a)$$ from the inner summation in Eq. (2.13). We further elaborate on this distinction later in this chapter when we introduce the alternative Bellman equations.

References

1. Professor Emma Brunskill and Stanford University. Cs234: Reinforcement learning. http://web.stanford.edu/class/cs234/, 2021.

2. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

Authors

Rights and permissions

Reprints and permissions

© 2023 The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature

Cite this chapter

Hu, M. (2023). Markov Decision Processes. In: The Art of Reinforcement Learning. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-9606-6_2

• DOI: https://doi.org/10.1007/978-1-4842-9606-6_2

• Published:

• Publisher Name: Apress, Berkeley, CA

• Print ISBN: 978-1-4842-9605-9

• Online ISBN: 978-1-4842-9606-6