Stochastic revision opportunities in Markov decision problems

Tsodikovich, Yevgeny; Lehrer, Ehud

doi:10.1007/s10479-019-03252-9

Stochastic revision opportunities in Markov decision problems

Original Research
Published: 06 May 2019

Volume 279, pages 251–270, (2019)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

239 Accesses
1 Citation
Explore all metrics

Abstract

We extend Markov Decision Processes to situations where the actions are binding and cannot be changed in every period. Instead, the decision maker can revise her actions at random times. We consider two slightly different models. In the first, the revision opportunity appears at a specific stage at which the decision maker can change her action, but is lost if not used. The action taken then remains constant until the next revision opportunity comes up. In the second model, the revision opportunity remains open and can be used at any time after it appears. Only when the action is changed, it becomes binding again for another random period. We compare between different stochastic revision processes and characterize when one is always preferred to another.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 4

Risk-Sensitive Markov Decision Processes

Markov decision processes with risk-sensitive criteria: an overview

Article Open access 01 April 2024

Markov decision processes with quasi-hyperbolic discounting

Article Open access 18 November 2020

Notes

Recall that the random variable X first-order stochastically dominates the random variable Y iff every utility maximizing DM prefers X over Y.
The initial state is randomly chosen according to some distribution over the states which is a parameter of the MDP.
This random variable is denoted by $\mathbb {1}$.
The strategy should also depend on whether or not it is possible to change the action, but such dependence is trivial.
For some delay strategies, such as $D(i)=0$ for all $i\in \mathbb {N}$, the event $\{X_D=\infty \}$ has a non-zero probability. Our analysis holds for such strategies as well, even though $X_D$ is not a random variable in the usual sense.
If $\mathbf{Pr }_{X,D}(k)=\mathbf{Pr }(Y=k)=0$, neither random variable would ever draw k so it is not a problem for the algorithm and $D(k)$ can be set arbitrarily.
The proof can be found in standard textbooks, such as Whitmore and Findlay (1978).
If $\mathbf{Pr }_{X,D}(k)=0$ then $\mathbf{Pr }(Y=k)=0$ and we can set $D(k)=1$ arbitrary and still satisfy the equation.

References

Ashley, R. A., & Orr, D. (1985). Further results on inventories and price stickiness. The American Economic Review, 75(5), 964–975.
Google Scholar
Aviv, Y., & Pazgal, A. (2005). A partially observed Markov decision process for dynamic pricing. Management Science, 51(9), 1400–1416.
Article Google Scholar
Beggs, A., & Klemperer, P. (1992). Multi-period competition with switching costs. Econometrica: Journal of the Econometric Society, 60, 651–666.
Article Google Scholar
Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.
Google Scholar
Besbes, O., & Zeevi, A. (2009). Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6), 1407–1420.
Article Google Scholar
Chan, L. M. A., Shen, Z. J. M., Simchi-Levi, D., & Swann, J. L. (2004). Coordination of pricing and inventory decisions: A survey and classification. In D. Simchi-Levi, S. David Wu, & Z.-J. Shen (Eds.), Handbook of quantitative supply chain analysis (pp. 335–392). Berlin: Springer.
Chapter Google Scholar
Chen, M., & Chen, Z.-L. (2015). Recent developments in dynamic pricing research: Multiple products, competition, and limited demand information. Production and Operations Management, 24(5), 704–731.
Article Google Scholar
Chen, Z.-L., & Hall, N. G. (2010). The coordination of pricing and scheduling decisions. Manufacturing and Service Operations Management, 12(1), 77–92.
Article Google Scholar
Dewenter, R., & Heimeshoff, U. (2012). Less pain at the pump? The effects of regulatory interventions in retail gasoline markets. Number 51. DICE Discussion paper.
Elmaghraby, W., & Keskinocak, P. (2003). Dynamic pricing in the presence of inventory considerations: Research overview, current practices, and future directions. Management Science, 49(10), 1287–1309.
Article Google Scholar
Farias, V. F., & Van Roy, B. (2010). Dynamic pricing with a prior on market response. Operations Research, 58(1), 16–29.
Article Google Scholar
Friedman, M. (1982). Defining monetarism. Newsweek, p. 64.
Gilbert, S. M. (2000). Coordination of pricing and multiple-period production across multiple constant priced goods. Management Science, 46(12), 1602–1616.
Article Google Scholar
Lehrer, E., & Shmaya, E. (2008). Two remarks on Blackwell’s theorem. Journal of Applied Probability, 45(2), 580–586.
Article Google Scholar
Libich, J., & Stehlík, P. (2011). Endogenous monetary commitment. Economics Letters, 112(1), 103–106.
Article Google Scholar
Lipman, B. L., & Wang, R. (2009). Switching costs in infinitely repeated games. Games and Economic Behavior, 66(1), 292–314.
Article Google Scholar
Moon, K., Bimpikis, K., & Mendelson, H. (2017). Randomized markdowns and online monitoring. Management Science, 64, 1271–1290.
Article Google Scholar
Padilla, A. J. (1995). Revisiting dynamic duopoly with consumer switching costs. Journal of Economic Theory, 67(2), 520–530.
Article Google Scholar
Slade, M. E. (1992). Vancouver’s gasoline-price wars: An empirical exercise in uncovering supergame strategies. The Review of Economic Studies, 59(2), 257–276.
Article Google Scholar
Taylor, J. B. (1999). Staggered price and wage setting in macroeconomics. Handbook of Macroeconomics, 1, 1009–1050.
Article Google Scholar
Uğurlu, K. (2017). Controlled Markov chains with Avar criteria for unbounded costs. Journal of Computational and Applied Mathematics, 319, 24–37.
Uğurlu, K. (2018). Robust optimal control using conditional risk mappings in infinite horizon. Journal of Computational and Applied Mathematics, 344, 275–287.
Article Google Scholar
White, D. J. (1978). Finite dynamic programming: An approach to finite Markov decision processes. New York: Wiley.
Google Scholar
White, D. J. (1993). A survey of applications of Markov decision processes. The Journal of the Operational Research Society, 44(11), 1073–1096.
Article Google Scholar
Whitmore, G. A., & Findlay, M. C. (1978). Stochastic dominance: An approach to decision-making under risk. Lanham: Lexington Books.
Google Scholar
Yousefi, M. R., Datta, A., & Dougherty, E. (2013). Optimal intervention in markovian gene regulatory networks with random-length therapeutic response to antitumor drug. IEEE Transactions on Biomedical Engineering, 60(12), 3542–3552.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical Sciences, Tel-Aviv University, 69978, Tel-Aviv, Israel
Yevgeny Tsodikovich & Ehud Lehrer

Authors

Yevgeny Tsodikovich
View author publications
You can also search for this author in PubMed Google Scholar
Ehud Lehrer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yevgeny Tsodikovich.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors wish to thank Eilon Solan, Galit Golan-Ashkenazi, David Lagziel, Bar Light, Ilan Nehama and two anonymous refrees of the Annals of Operations Research for their highly valuable comments. We acknowledge the support of the Israel Science Foundation, Grants #963/15 and #2510/17.

Appendices

The topology of $\left[ X\right] $

In this section we show that the set $\left[ X\right] $ can be projected into $\mathbb {R}^\infty $, where its topology can be studied in a simpler manner. Specifically, we show that the projection of $\left[ X\right] $ into $\mathbb {R}^M$ for any $M\in \mathbb {N}$ is closed and convex, which is the basis of the separation needed in Theorem 1.

Since the random variables in $\left[ X\right] $ have support over $\mathbb {N}$, it is possible to identify them with vectors in $\mathbb {R}^M$ for any $M\in \mathbb {N}$ by setting ${\underline{X}}_M=(\mathbf{Pr }(X=1),\ldots ,\mathbf{Pr }(X=M))$ and defining $\left[ X\right] _M=\{{\underline{Y}}_M \vert Y\in \left[ X\right] \}$ as the projection of $\left[ X\right] $ into $\mathbb {R}^M$. In Lemma 1 we show that $\left[ X\right] _M \subseteq \mathbb {R}^M$ is closed and convex so it can be separated from random variables that are not in $\left[ X\right] _M$ by standard hyperplane separation theorems. The separating vector is the basis of the MDP constructed in Theorem 1 in which all the revision processes in $\left[ X\right] $ perform worse than a revision process that is not induced by X, proving the “only if” part of the theorem.

Lemma 1

The set $\left[ X\right] _M \subseteq \mathbb {R}^M$ is convex and closed.

Proof

Step 1 Convexity

Let $X_1,X_2\in \left[ X\right] $ (thus $\underline{X_1}_M,\underline{X_2}_M\in \left[ X\right] _M$), $D_1,D_2$ the corresponding stationary delay strategies and $\lambda \in (0,1)$. Define $Y\sim \lambda X_1 + (1-\lambda ) X_2$ in the sense that $\mathbf{Pr }(Y=k)=\lambda \mathbf{Pr }(X_1=k) + (1-\lambda ) \mathbf{Pr }(X_2=k)$ for all $k\in \mathbb {N}$. We shall construct by induction a stationary delay strategy $D$ that induces a random variable with the same distribution as Y:

$$\begin{aligned} \mathbf{Pr }(X_D=k) = \lambda \mathbf{Pr }(X_1=k) + (1-\lambda ) \mathbf{Pr }(X_2=k) = \mathbf{Pr }(Y=k), \end{aligned}$$

(14)

and

$$\begin{aligned} \mathbf{Pr }_{X,D}(k) = \lambda \mathbf{Pr }_{X,D_1}(k) + (1-\lambda ) \mathbf{Pr }_{X,D_2}(k). \end{aligned}$$

(15)

Clearly, setting $D(1)=\lambda D_1(1)+(1-\lambda )D_2(1)$ fulfils both equations since $\mathbf{Pr }(X_D=1)=\mathbf{Pr }(X=1)D(1)=\mathbf{Pr }(Y=1)$ and $\mathbf{Pr }_{X,D}(1)=\mathbf{Pr }_{X,D_1}(1)=\mathbf{Pr }_{X,D_2}(1)=\mathbf{Pr }(X=1)$.

Assume that we defined $D(i)$ for all $i<k$ so that Eqs. (14) and (15) are satisfied. We shall prove that there exist $D(k)\in [0,1]$ so that these equations hold also for k. Recall that $\mathbf{Pr }_{X,D}(k)$ is the probability that the sum k comes up when using the stationary delay strategy $D$, which happens if we were at $k-i$ in the previous step, decided to continue and the new realization of $X$ was i for every $i\in \{1,\ldots ,k\}$:

$$\begin{aligned} \mathbf{Pr }_{X,D}(k)= & {} \sum \limits _{i=1}^k\mathbf{Pr }_{X,D}(k-i)\left( 1-D(k-i)\right) \mathbf{Pr }(X=i) \nonumber \\&\overset{eq. (5)}{=} \sum \limits _{i=1}^k\mathbf{Pr }(X=i)\left( \mathbf{Pr }_{X,D}(k-i)-\mathbf{Pr }(X_D= k-i)\right) . \end{aligned}$$

(16)

Applying the induction assumption on the last equality leads to

$$\begin{aligned} \mathbf{Pr }_{X,D}(k)= & {} \sum \limits _{i=1}^k\mathbf{Pr }(X=i)\Big (\lambda \mathbf{Pr }_{X,D_1}(k-i) + (1-\lambda ) \mathbf{Pr }_{X,D_2}(k-i) \nonumber \\&- \left( \lambda \mathbf{Pr }(X_1=k) + (1-\lambda ) \mathbf{Pr }(X_2=k)\right) \Big ) \nonumber \\= & {} \sum \limits _{i=1}^k\mathbf{Pr }(X=i)\Big (\lambda \big (\mathbf{Pr }_{X,D_1}(k-i)-\mathbf{Pr }(X_1=k)\big ) \nonumber \\&+ (1-\lambda ) \big (\mathbf{Pr }_{X,D_2}(k-i)-\mathbf{Pr }(X_2=k)\big )\Big ) \nonumber \\&\overset{eq. (5)}{=} \sum \limits _{i=1}^k\mathbf{Pr }(X=i)\Big (\lambda \mathbf{Pr }_{X,D_1}(k-i)(1-D_1(k-i)) \nonumber \\&+ (1-\lambda )\mathbf{Pr }_{X,D_2}(k-i)(1-D_2(k-i)) \nonumber \\= & {} \lambda \sum \limits _{i=1}^k\mathbf{Pr }_{X,D_1}(k-i)(1-D_1(k-i))\mathbf{Pr }(X=i) \nonumber \\&+ (1-\lambda ) \sum \limits _{i=1}^k\mathbf{Pr }_{X,D_2}(k-i)(1-D_2(k-i))\mathbf{Pr }(X=i) \nonumber \\= & {} \lambda \mathbf{Pr }_{X,D_1}(k) + (1-\lambda ) \mathbf{Pr }_{X,D_2}(k). \end{aligned}$$

(17)

Equation (15) is therefore satisfied for k as well. In addition

$$\begin{aligned} \mathbf{Pr }_{X,D}(k)= & {} \lambda \mathbf{Pr }_{X,D_1}(k) + (1-\lambda ) \mathbf{Pr }_{X,D_2}(k)\nonumber \\\ge & {} \lambda \mathbf{Pr }_{X,D_1}(k)D_1(k) + (1-\lambda ) \mathbf{Pr }_{X,D_2}(k)D_2(k)\nonumber \\= & {} \lambda \mathbf{Pr }(X_1=k) + (1-\lambda ) \mathbf{Pr }(X_2=k) =\mathbf{Pr }(Y=k) \end{aligned}$$

(18)

so it is possible to define $D(k)=\frac{{\mathbf{Pr }}(Y=k)}{\mathbf{Pr }_{X,D}(k)}\le 1$ which according to Eq. (5) sets $\mathbf{Pr }(X_D=k)=\mathbf{Pr }(Y=k)=\lambda \mathbf{Pr }(X_1=k) + (1-\lambda ) \mathbf{Pr }(X_2=k)$ and satisfies equation 14 for k as well.^{Footnote 8}$X_{D}$ has the same distribution as Y thus $Y\in \left[ X\right] $ and ${\underline{Y}}_M\in \left[ X\right] _M$ which proves the convexity.

Step 2 Closedness

Consider the function $f:[0,1]^M \rightarrow \mathbb {R}^M$ defined by

$$\begin{aligned} f(d_1,\ldots ,d_M)=\big (\mathbf{Pr }(Y=1),\ldots ,\mathbf{Pr }(Y=M)\big ) \end{aligned}$$

(19)

with Y being the random variable created by applying the following stationary delay strategy:

$$\begin{aligned} D(i)={\left\{ \begin{array}{ll} d_i \quad \text{ if } i\le M, \\ 1 \quad \text{ otherwise, }\end{array}\right. } \end{aligned}$$

(20)

on X. The ith element of $f(d_1,\ldots ,d_M)$ is $\mathbf{Pr }(Y=i)$, which according to Eq. (5) is a multivariate polynomial in the variables $d_1,\ldots ,d_M$, and hence continuous. This means that f itself is a continuous function. According to the closed map lemma, f is a closed map and thus $f([0,1]^M)$ is a closed set. It is left to show that this closed set is exactly $\left[ X\right] _M$.

Let $Y\in \left[ X\right] $ and let $D_Y$ be the corresponding stationary delay strategy. Clearly, $f(D_Y(1),\ldots ,D_Y(M))={\underline{Y}}_M$ since the stationary delay strategy that is defined in Eq. (20) and $D_Y$ are equal on every $i\le M$. Therefore, the projection of any $Y\in \left[ X\right] $ on $\mathbb {R}^M$ is an image of f which implies that f is onto $\left[ X\right] _M$ and $\left[ X\right] _M=f([0,1]^M)$ is a closed set.$\square $

As a final remark, we note that a random variable can be in $\left[ X\right] _M$ even when it is not in $\left[ X\right] $. This happens when the random variable agrees with some variable in $\left[ X\right] $ on the probabilities of the first M natural numbers but not on others. Nevertheless, it cannot happen for every M—if a random variable is not in $\left[ X\right] $ then there will be some $M\in \mathbb {N}$, for which it is not in $\left[ X\right] _M$. This result is proven in the following lemma and ensures that there are no pathological cases in which $Y\notin \left[ X\right] $ but $Y\in \left[ X\right] _M$ for every M. Moreover, the proof introduces an algorithm for determining whether $Y\in \left[ X\right] $ or not, as shown in Sect. 3.1.3.

Lemma 2

If $Y\notin \left[ X\right] $ then there exists an $M\in \mathbb {N}$ such that ${\underline{Y}}_M\notin \left[ X\right] _M$.

Proof

The proof is done by induction. We construct $D$ that satisfies $X_D\sim Y$ and set M to be the first natural number where the process fails.

First, examine $n=1$. If $\mathbf{Pr }(X=1)<\mathbf{Pr }(Y=1)$ then for every $D(1)\in [0,1]$ we get $\mathbf{Pr }(X_D=1)=\mathbf{Pr }(X=1)D(1)<\mathbf{Pr }(Y=1)$ hence $Y\notin \left[ X\right] _1$ and $M=1$. Otherwise, by defining $D(1)=\frac{{\mathbf{Pr }}(Y=1)}{\mathbf{Pr }(X=1)}$ (or arbitrary set $D(1)=1$, if both are zero) we get ${\underline{Y}}_1\in \left[ X\right] _1$.

Second, suppose that we already defined $D(k)$ for every $1\le k<n$ so that ${\underline{Y}}_k\in \left[ X\right] _k$. According to Eq. (5) this determines $\mathbf{Pr }_{X,D}(n)$ – the probability to have a series of realizations that sums up to n. There are two possible cases:

Case 1
$\mathbf{Pr }_{X,D}(n)\ge \mathbf{Pr }(Y=n)$: Setting $D(n)=\frac{{\mathbf{Pr }}(Y=n)}{\mathbf{Pr }_{X,D}(n)}$ (or arbitrary setting $D(n)=1$, if both are zero) would impose that $X_D$ agrees with Y also on n, and thus ${\underline{Y}}_n\in \left[ X\right] _n$.
Case 2
$\mathbf{Pr }_{X,D}(n) < \mathbf{Pr }(Y=n)$: For every $D(n)\in [0,1]$ we get $\mathbf{Pr }(X_D=n)=\mathbf{Pr }_{X,D}(n)D(n)<\mathbf{Pr }(Y=n)$ and ${\underline{Y}}_n\notin \left[ X\right] _n$.

If for every $n\in \mathbb {N}$ case 1 occurs, this means that we were able to define $D$ so that $X_D\sim Y$, which contradicts the assumption that $Y\notin \left[ X\right] $. Therefore, there exists an n for which case 2 occurs. Any value for $D(n)$ would induce some random variable $X_D$ that disagrees with Y on n. Moreover, changing the values of $D(i)$ for $i<n$ would cause the disagreement to occur earlier, because $D(i)$ are either determined uniquely (in case 1) or are set arbitrarily to be 1 without affecting the induced random variable. Either way, the resulting $X_D$ for every $D$ would disagree with Y on some $i\le n$ which proves that ${\underline{Y}}_n\notin \left[ X\right] _n$.$\square $

Proof of Theorems

Theorem 1

A revision process $X_1$ S-dominates the revision process $X_2$ ($X_1\succeq _SX_2$) if and only if the distribution of $X_2$ can be achieved by applying some stationary delay strategy on $X_1$, i.e. $X_2 \in \left[ X_1\right] $.

Proof of Theorem 1

Assume $X_2 \in \left[ X_1\right] $.

Fix an MDP with a strict revision process $X_2$ and let $\sigma $ be the optimal pure Markovian policy for the DM. We shall prove that the DM can emulate this policy when she is bounded by $X_1$, thus the optimal policy under $X_1$ should yield a higher payoff. $X_2 \in \left[ X_1\right] $ and denote the corresponding stationary delay strategy by $D$. Consider the following strategy $\sigma '$:

Step 1
At $t=0$ w.l.o.g. the state is s. Set the action according to $\sigma (s)$ and set $n=0$.
Step 2
Suppose the DM has a revision opportunity after k stages and the state is now $s_i$. Decide to “stop” with probability $D(n+k)$:
1. Step 2.1
  If $D$ choose to stop: set $n=0$, change the action according to $\sigma (s_i)$ and return to step 2.
2. Step 2.2
  If $D$ choose to continue: increase n by k, keep the previous action and return to step 2.

When following this strategy, whenever step 2.2 occurs, the DM keeps the previous action, although she could have changed it. Direct computation reveals that revisions conducted in step 2.1 occur according to the distribution of the revision process $X_2$ and the strategy that is carried out is in-fact $\sigma $. Thus, following the strategy $\sigma '$ in this problem with the revision process $X_1$ yields $V_S(X_2)$. Naturally, the optimal strategy with $X_1$ would yield a higher payoff so $V_S(X_1)\ge V_S(X_2)$. This result holds for any MDP, thus $X_1 \succeq _SX_2$.

Assume $X_2 \notin \left[ X_1\right] $.

According to Lemma 2 there exists an n for which $X_2 \notin \left[ X_1\right] _n$ and according to Lemma 1 the set $\left[ X_1\right] _n$ is closed and convex. Due to the strict hyperplane separation theorem there exists ${\underline{v}}=(v_1,\ldots ,v_n)$ that separates the vector ${\underline{X}}_2\in \mathbb {R}^n$ from the set $\left[ X_1\right] _n\subseteq \mathbb {R}^n$ in the sense that ${\underline{v}}\cdot {\underline{X}}_2 > {\underline{v}}\cdot {\underline{X}}$ for all ${\underline{X}} \in \left[ X_1\right] _n$. Using ${\underline{v}}$ we construct the MDP illustrated in Fig. 5.

The MDP starts at the state $s_0$ and in each state the DM can either choose the action “C” or “S”. When choosing “C”, the payoff is 0 and the MDP transients to the next state. When choosing “S” for the first time, the DM receives $\frac{v_i}{\delta ^ i}$, and the MDP is absorbed at the state $0^*$. Formally, as long as the action is “C” the states change deterministically $\mathbf{Pr }(s_{i+1}|s_i,C)=1$ for $i=0,1,\ldots ,n-1$ and when the action is “S” $\mathbf{Pr }(0^*|s_i,S)=1$ for $i=1,\ldots ,n$. Choosing “S” at the initial state is not profitable and leads to a very low payoff of $-M$. In addition, if the action “S” was chosen in the first n stages, the MDP is absorbed at $0^*$.

Consider the following strategy when the revision process is $X_2$. Set “C” at state $s_0$, and “S” in the first revision opportunity. The probability that this revision occurs at state $s_i$ is $\mathbf{Pr }(X_2=i)$ and the $\delta $-discounted payoff is $\mathbf{Pr }(X_2=i)\delta ^{i} \frac{v_i}{\delta ^i}=\mathbf{Pr }(X_2=i)v_i$. The expected payoff is therefore $\sum \nolimits _{i=1}^n \mathbf{Pr }(X_2=i) v_i = {\underline{X}}_2 \cdot {\underline{v}}$ and clearly $V_S(X_2)\ge {\underline{X}}_2 \cdot {\underline{v}}$.

As stated before, this MDP with the revision process $X_1$ has an optimal strategy which is pure Markovian, denoted by $\sigma $. Clearly, $\sigma (s_0)=C$ so the DM always continues to the states $s_i$ for $i>0$ and the optimal strategy simply indicates in which of the states the DM should choose “S”. One can refer to $\sigma $ as a stationary delay strategy by identifying “C” with continuing and “S” with stopping. Denote by $X_\sigma $ the induced random variable according to the delay strategy $\sigma $. The probability of getting to the state $s_i$ when acting according to $\sigma $ and choosing “S” there for the first time is $\mathbf{Pr }(X_\sigma =i)$ which means that $X_\sigma $ is the distribution on the states $s_1,\ldots ,s_n$ for which the DM chooses “S”. The expected $\delta $-discounted payoff is therefore $\sum \nolimits _{i=1}^n \mathbf{Pr }(X_\sigma =i)\delta ^i \frac{v_i}{\delta ^i}= {\underline{X}}_\sigma \cdot {\underline{v}}<{\underline{X}}_2 \cdot {\underline{v}}$ where the last inequality is because $X_\sigma \in \left[ X_1\right] _n$. In this MDP, $V_S(X_1)< V_S(X_2)$ thus $X_1 \not \succeq _SX_2$ and the proof is complete.$\square $

Proposition 1

Let $X\in \Delta (\mathbb {N})$ and let Y be some revision process induced by X with some delay strategy, not necessarily stationary. There exists a stationary delay strategy $D$ so that the corresponding random variable $X_D$ has the same distribution as Y, i.e, $Y\sim X_D\in \left[ X\right] $.

Proof of Proposition 1

The proof follows the general idea of the proof of Theorem 1 and Lemma 2. If $Y\notin \left[ X\right] $ then there is an $n\in \mathbb {N}$ such that ${\underline{Y}}_n\notin \left[ X\right] _n$. Therefore there exists a vector ${\underline{v}}\in \mathbb {R}^n$ that separates ${\underline{Y}}_n$ from the set $\left[ X\right] _n$. Using ${\underline{v}}$ one can construct an MDP as shown in Fig. 5 with strict-revision according to $X$. Interpreting the delay strategy that induces Y as strategies in the MDP (the probability to choose “S” for the first time after each history) yields an expected payoff of ${\underline{Y}}\cdot {\underline{v}}$ and a lower payoff of ${\underline{Z}}\cdot {\underline{v}}$ for every $Z\in \left[ X\right] $. However, this MDP with the strict-revision process $X$ has an optimal stationary strategy which corresponds to a stationary delay strategy—a contradiction.$\square $

Proposition 2

Let $X\in \Delta (\mathbb {N})$ with CDF $F_X$ and let Y be some revision process induced by X with the stationary delay strategy $D^Y(n)$. Set $F_0(t)=F_X(t)$ and define

$$\begin{aligned} F_n(t)= {\left\{ \begin{array}{ll} F_{n-1}(t)-(1-D^Y(n))(F_{n-1}(n)-F_{n-1}(n-1))(1-F_X(t-n)) \text{ for } t\ge n,\\ F_{n-1}(t) \text{ for } t<n. \end{array}\right. } \end{aligned}$$

(13)

Then the CDF of Y is $F_Y(t)=\lim \nolimits _{n\rightarrow \infty } F_n(t)$.

Proof of Proposition 2

We shall prove that the random variable $X_n$ that corresponds to $F_n$ is induced by X when setting $D^n(k)=D^Y(k)$ for $k\le n$ and $D^n(k)=1$ otherwise. This proves the proposition since assigning values to $D(k)$ for $k>n$ does not change the probabilities of stopping at $1,\ldots ,n$, as was shown in Lemma 2. This trivially holds for $X_0=X$ and for $X_1$ since for every $1<t\in \mathbb {N}$:

$$\begin{aligned} \mathbf{Pr }(X_1=t)= & {} F_1(t)-F_1(t-1)\nonumber \\= & {} F_X(t)-F_X(t-1)-(1-D^Y(1))F_X(1)(-F_X(t-1)+F_X(t-2)) \nonumber \\= & {} \mathbf{Pr }(X=t)+\mathbf{Pr }(X=1)(1-D^1(1))\mathbf{Pr }(X=t-1), \end{aligned}$$

(21)

which is the probability of either getting t and stopping or getting 1, not stopping with probability $1-D^1(1)$ and then getting $t-1$ and stopping. For $t=1$ the above is simply $\mathbf{Pr }(X_1=1)=F_1(1)=F_X(1)-(1-D^1(1))F_X(1)=D^1(1)\mathbf{Pr }(X=1)$.

Similarly, if the statement is true for $1,\ldots ,n$ then for $n+1$ and $t\ge n+1$ we get:

$$\begin{aligned} \mathbf{Pr }(X_{n+1}=t)= & {} F_{n+1}(t)-F_{n+1}(t-1) \end{aligned}$$

(22)

$$\begin{aligned}= & {} \mathbf{Pr }(X_n=t)+\mathbf{Pr }(X_n=n+1)(1-D^Y(n+1))\mathbf{Pr }(X=t-(n+1)).\nonumber \\ \end{aligned}$$

(23)

The first probability is the probability that $X_n$ reached t, which includes all the possibilities of reaching t where the last value randomized by X is larger than $t-(n+1)$. The second probability is the probability that $X_n$ reached $n+1$ and instead of stopping with probability 1 (as $X_n$ would have done) the randomization process continues with probability $1-D^Y(n+1)$ and the next value is $t-(n+1)$. There are no other possibilities to reach t as any sum of randomization that is greater than $n+1$ would stop the process. This hold also for $t<n+1$ since then $\mathbf{Pr }(X_{n+1}=t)=\mathbf{Pr }(X_n=t)$ and the proof is complete.$\square $

Theorem 2

A revision process $X_1$ P-dominates the revision process $X_2$ ($X_1\succeq _PX_2$) if and only if $X_2$ first-order stochastically dominates $X_1$.

Proof of Theorem 2

Assume $X_2$first-order stochastically dominates $X_1$.

Fix an MDP and let $\sigma $ be the optimal pure Markovian policy for the DM with persistent-revision process $X_2$. We shall prove that the DM can emulate this policy when the revision process is $X_1$, thus the optimal policy under $X_1$ yields a higher revenue.

Since $X_2$ FOSD $X_1$, there exists a random variable $Z\in \Delta (\mathbb {N})$ (that depends on $X_1$) such that $X_2\sim X_1+Z$. Consider the following strategy $\sigma '$:

Step 1
At $t=0$ w.l.o.g. the state is s. Choose the action $a=\sigma (s)$.
Step 2
Suppose the DM has the first revision opportunity after k stages in which the action was a. Keep this action for $Z|(X_1=k)$ additional stages.
Step 3
Suppose that the state after the additional $Z|(X_1=k)$ stages is $s'$. Check if $\sigma (s',a)=a$:
- YES. Keep the same action for this stage as well and return to step 3.
- NO. Change the action to $\sigma (s',a)$ and return to step 2 with the new action as a.

By following $\sigma '$ the DM waits $X_1+Z$ stages each time before she starts considering whether or not to change her action. Since $X_2 \sim X_1+Z$ it follows that the DM executes de-facto the optimal strategy for the revision process $X_2$. Naturally, the optimal strategy for $X_1$ yields a higher revenue: $V_P(X_1)\ge V_P(X_2)$. This is true for any MDP thus $X_1 \succeq _PX_2$.

Assume $X_2$does not first-order stochastically dominate$X_1$.

Since $X_2$ does not FOSD over $X_1$, there exist $k\in \mathbb {N}$ such that $\mathbf{Pr }(X_2\le k)>\mathbf{Pr }(X_1\le k)$. Consider the MDP illustrated in Fig. 6. The MDP starts at the state $s_0$ and in each state the DM can either play “C” or “S”. Choosing “S” at state $s_0$ yields a payoff of $-1$ and at state $s_k$ a payoff of 1. In the rest of the states the payoff is 0, which is also the payoff in every state when playing “C”, as shown in the table.

Clearly, any optimal strategy would be to choose “C” in the initial state $s_0$ and then change it to “S” sometime before or at the state $s_k$. The probability that such revision will be possible with the revision process $X$ is $\mathbf{Pr }(X\le k)$ and the discounted payoff in this case is $\delta ^k$. Otherwise the payoff is 0. Since $\mathbf{Pr }(X_2\le k)>\mathbf{Pr }(X_1\le k)$ then $V_P(X_2)=\delta ^k\mathbf{Pr }(X_2\le k)>\delta ^k\mathbf{Pr }(X_1\le k)=V_P(X_1)$ and therefore $X_1\not \succeq _PX_2$.$\square $

Corollary 1

If $X_1$ S-dominates $X_2$ then it also P-dominates $X_2$.

Proof of Corollary 1

Assume $X1\succeq _SX_2$. According to Theorem 1, $X_2\in \left[ X_1\right] $ which implies that there exists a stationary delay strategy $D$ such that $X_2\in \left[ X_1\right] $. Let X be a random variable with the distribution of $X_1$ that represents the first randomization of $X_1$ when inducing $X_2$ according to $D$ and Z the sum of all the future randomizations until $D$ decided to stop. It follows that $Z\vert X=k$ is the distribution of the sum of the next randomizations of $X_1$ if k was the result of the first randomization. Since $X_1>0$ we always get $Z\ge 0$. Since X represents the first randomization and Z the rest of the randomizations with the delay strategy $D$, $X+Z$ has the same distribution as $X_2$ so $X_2 \sim X_1 +Z$ which proves that $X_2$ FOSD $X_1$ and according to Theorem 2, $X_1\succeq _PX_2$.$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsodikovich, Y., Lehrer, E. Stochastic revision opportunities in Markov decision problems. Ann Oper Res 279, 251–270 (2019). https://doi.org/10.1007/s10479-019-03252-9

Download citation

Published: 06 May 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s10479-019-03252-9

Keywords

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic revision opportunities in Markov decision problems

Abstract

Access this article