Application: Choosing a fast route given uncertain delays, Controlling a Markov chain

Topics: Stochastic Dynamic Programming, Markov Decision Problems

13.1 Model

One is given a finite connected directed graph. Each edge (i, j) is associated with a travel time T(i, j). The travel times are independent and have known distributions. There are a start node s and a destination node d. The goal is to choose a fast route from s to d. We consider a few different formulations (Fig. 13.1).

Fig. 13.1
figure 1

Road network. How to select a path?

To make the situation concrete, we consider the very simple example illustrated in Fig. 13.2.

Fig. 13.2
figure 2

A simple graph

The goal is to choose the fastest path from s to d. In this example, the possible paths are sd, sad, and sabd. We assume that the delays T(i, j) on the edges (i, j) are as follows:

$$\displaystyle \begin{aligned} & T(s, a) =_D U[5, 13], T(a, d) = 10, T(a, b) =_D U[2, 10], \\ & T(b, d) = 4, T(s, d) = 20. \end{aligned} $$

Thus, the delay from s to a is uniformly distributed in [5, 13], the delay from a to d is equal to 10, and so on. The delays are assumed to be independent, which is an unrealistic simplification.

13.2 Formulation 1: Pre-planning

In this formulation, one does not observe anything and one plans the journey ahead of time. In this case, the solution is to look at the average travel times E(T(i, j)) = c(i, j) and to run a shortest path algorithm.

For our example, the average delays are c(s, a) = 9, c(a, d) = 10, and so on, as shown in the top part of Fig. 13.3.

Fig. 13.3
figure 3

The average delays (top) and the successive steps of the Bellman–Ford algorithm to calculate the minimum expected times (shown in red) from the nodes to the destination

Let V (i) be the minimum average travel time from node i to the destination d. The Bellman–Ford Algorithm calculates these values as follows. Let V n(i) be an estimate of the shortest average travel time from i to d, as calculated after the n-th iteration of the algorithm. The algorithm starts with V 0(d) = 0 and V 0(i) =  for i ≠ d. Then, the algorithm calculates

$$\displaystyle \begin{aligned} V_{n+1}(i) = \min_j \{c(i, j) + V_n(j)\}, n \geq 0. \end{aligned} $$
(13.1)

The interpretation is that V n(i) is the minimum expected travel time from i to d over all paths that go through at most n edges. The distance is infinite if no path with at most n edges reaches the destination d. This is exactly the same algorithm we discussed in Sect. 11.2 to develop the Viterbi algorithm.

These relations are justified by the fact that the mean value of a sum is the sum of the mean values. For instance, say that the minimum average travel time from a to d using a path that has at most 2 edges is V 2(a, d) and it corresponds to a path with random travel time W 2(a, d). Then, the minimum average travel time from s to d using a path that has at most 3 edges follows either the direct path sd, that has travel time T(s, d), or the edge sa followed by the fastest path from a to d that uses at most 2 edges with travel time W 2(a, d). Accordingly, the minimum expected travel time V 3(s) from s to d using at most three edges is the minimum of E(T(s, d)) = c(s, d) and the mean value of T(s, a) + W 2(a, d). Thus,

$$\displaystyle \begin{aligned} &V_3(s) = \,{\mathrm{min}}\{c(s, d), E(T(s, a) + W_2(a, d))\} \\ &~~~~~~~~ = \min\{c(s, d), c(s, a) + V_2(a, d)\}. \end{aligned} $$

Since the graph is finite, V n converges to V  in at most N steps, where N is the length of the longest cycle-free path to node d. The limit is such that V (i) is the shortest average travel time from i to d. Note that V  satisfies the following fixed-point equations:

$$\displaystyle \begin{aligned} V(i) = \min_j \{c(i, j) + V(j)\}, \forall j \mbox{ and } V(d) = 0. \end{aligned} $$
(13.2)

These are called the dynamic programming equations (DPE). Thus, (13.1) is an algorithm for solving (13.2).

13.3 Formulation 2: Adapting

We now assume that when we get to a node i, we see the actual travel times along the edges out of i. However, we do not see beyond those edges. How should we modify our path planning? If the travel times are in fact deterministic, then nothing changes. However, if they are random, we may notice that the actual travel times on some edges out of i are smaller than their mean value, whereas others may be larger. Clearly, we should use that information.

Here is a systematic procedure for calculating the best path. Let V (i) be the minimum average time to get to d starting from node i, for i ∈{s, a, b, d}. We see that V (b) = T(b, d) = 4.

To calculate V (a), define W(a) to be the minimum expected time from a to d given the observed delays along the edges out of a. That is,

$$\displaystyle \begin{aligned} W(a) = \min\{T(a, b) + V(b), T(a, d)\}. \end{aligned}$$

Hence, V (a) = E(W(a)). Thus,

$$\displaystyle \begin{aligned} V(a) = E (\min\{T(a, b) + V(b), T(a, d)\}). \end{aligned} $$
(13.3)

For this example, we see that T(a, b) + V (b) =D U[6, 14]. Since T(a, d) = 10, if T(a, b) + V (b) < 10, which occurs with probability 1∕2, we choose the path abd that has a travel time uniformly distributed in [6, 10] with a mean value 8. Also, if T(a, b) + V (b) > 10, then we choose the travel time T(a, d) = 10, also with probability 1∕2. Thus, the minimum expected travel time V (a) from a to d is equal to 8 with probability 1∕2 and to 10 with probability 1∕2, so that its average value is 8(1∕2) + 10(1∕2) = 9. Hence, V (a) = 9.

Similarly,

$$\displaystyle \begin{aligned} V(s) = E(\min\{T(s, a) + V(a), T(s, d) \}), \end{aligned}$$

where T(s, a) + V (a) =D U[14, 22] and T(s, d) = 20. Thus, if T(s, a) + V (a) < 20, which occurs with probability (20 − 14)∕(22 − 14) = 3∕4, then we choose a path that goes from s to a and has a delay that is uniformly distributed in [14, 20], with mean value 17. If T(s, a) + V (a) > 20, which occurs with probability 1∕4, we choose the direct path sd that has delay 20. Hence V (s) = 17(3∕4) + 20(1∕4) = 71∕4 = 17.75.

Note that by observing the delays on the next edges and making the appropriate decisions, we reduce the expected travel time from s to d from 19 to 17.5. Not surprisingly, more information helps. Observe also that the decisions we make depend on the observed delays. For instance, starting in node s, we go along edge sd if T(s, a) + V (a) > T(s, d), i.e., if T(s, a) + 9 > 20, or T(s, a) > 11. Otherwise, we follow the edge sa.

Let us now go back to the general model. The key relationships are as follows:

$$\displaystyle \begin{aligned} V(i) = E(\min_j \{T(i, j) + V(j)\}), \forall i. \end{aligned} $$
(13.4)

The interpretation is simple: starting from i, one can choose to go next to j. In that case, one faces a travel time T(i, j) from i to j and a subsequent minimum average time from j to d equal to V (j). Since the path from i to d must necessarily go to a next node j, the minimum expected travel time from i to d is given by the expression above. As before, these equations are justified by the fact that the expected value of a sum is the sum of the expected values.

An algorithm for solving these fixed-point equations is

$$\displaystyle \begin{aligned} V_{n+1}(i) = E(\min_j \{T(i, j) + V_n(j)\}), n \geq 0, \end{aligned} $$
(13.5)

where V 0(i) = 0 for all i. The interpretation of V n(i) is the same as before: it is the minimum expected time from i to d using a path with at most n edges, given that at each step along the path one observes the delays along the edges out of the current node.

Equations (13.4) are the stochastic dynamic programming equations for the problem. Equations (13.5) are called the value iteration equations.

13.4 Markov Decision Problem

A more general version of the path planning problem is the control of a Markov chain. At each step, one looks at the state and one chooses an action that determines the transition probabilities and also the cost for the next step.

More precisely, to define a controlled Markov chain X(n) on some state space \(\mathcal {X}\), one specifies, for each \(x \in \mathcal {X}\), a set A(x) of possible actions. For each state \(x \in \mathcal {X}\) and each action a ∈ A(x), one has transition probabilities P(x, x′;a) ≥ 0 with \(\sum _{x' \in \mathcal {X}} P(x, x'; a) = 1\). One also specifies a cost c(x, a) of taking the action a when in state x.

The sequence X(n) is then defined by

$$\displaystyle \begin{aligned} P[X(1) &= x_1, X(2) = x_2, \ldots, X(n) = x_n | X(0) = x_0, a_0, \ldots, a_{n-1}] \\ &= P(x_0, x_1; a_0)P(x_1, x_2; a_1) \times \cdots \times P(x_{n-1}, x_n; a_{n-1}). \end{aligned} $$

The goal is to choose the actions to minimize the average total cost

$$\displaystyle \begin{aligned} E\left[ \sum_{m = 0}^n c(X(m), a(m)) | X(0) = x\right]. \end{aligned} $$
(13.6)

For each m = 0, …, n, the action a(m) ∈ A(X(m)) is determined from the knowledge of X(m) and also of the previous states X(0), …, X(m − 1) and previous actions a(0), …, a(m − 1).

This problem is called a Markov decision problem (MDP).

To solve this problem, we follow a procedure identical to the path planning problem where we think of the state as the node that has been reached during the travel. Let V m(x) be the minimum value of the cost (13.6) when n is replaced by m. That is, V m(x) is the minimum average cost of the next m + 1 steps, starting from X(0) = x. The function V m(⋅) is called the value function.

The DPE are

$$\displaystyle \begin{aligned} V_m(x) &= \min_{a \in A(x)} \{c(x, a) + E[V_{m-1}(x') | X(0) = x, a(0) = a]\} \\ &= \min_{a \in A(x)} \left \{c(x, a) + \sum_{x'} P(x, x'; a) V_{m-1}(x') \right \}. {} \end{aligned} $$
(13.7)

Let a = g m(x) be the value of a ∈ A(x) that achieves the minimum in (13.7). Then the choices a(m) = g nm(X(m)) achieve the minimum of (13.6).

The existence of the minimizing a in (13.7) is clear if \(\mathcal {X}\) and each A(x) are finite and also under weaker assumptions.

13.4.1 Examples

Guess a Card

Here is a simple example. One is given a perfectly shuffled deck of 52 cards. The cards are turned over one at a time. Before one turns over a new card, you have the option of saying “Stop.” If the next card is an ace, you win $1.00. If not, the game stops and you lose. The problem is for you to decide when to stop (Fig. 13.4).

Fig. 13.4
figure 4

Guessing if the next card is an Ace

Assume that there are still x aces in a deck with m remaining cards. Then, if you say stop, you win with probability xm. If you do not say stop, then after the next card is turned over, x − 1 aces remain with probability xm and x remain otherwise.

Let V (m, x) be the maximum expected probability that you win if there are still x aces in the deck with m remaining cards.

The DPE are

$$\displaystyle \begin{aligned} V(m, x) = \max \left \{ \frac{x}{m}, \frac{x}{m} V(m-1, x - 1) + \frac{m - x}{m}V(m-1, x) \right \}. \end{aligned}$$

Interestingly, the solution of these equations is V (m, x) = xm, as you can verify. Also, the two terms in the maximum are equal if x > 0. The conclusion is that you can stop at any time, as long as there is still at least one ace in the deck.

Scheduling Jobs

You have two sets of jobs to perform. Jobs of type i (for i = 1, 2) have a waiting cost equal to c i per unit of waiting time until they are completed. Also, when you work on a job of type i, it completes with probability μ i in the next time unit, independently of how long you have worked on it. That is, the job processing times are geometrically distributed with parameter μ i. The problem is to decide which job to work on to minimize the total waiting cost of the jobs.

Let V (x 1, x 2) be the minimum expected total remaining waiting cost given that there are x 1 jobs to type 1 and x 2 jobs of type 2. The DPE are

$$\displaystyle \begin{aligned} V(x_1, x_2) = x_1 c_1 + x_2 c_2 + \min\{V_1(x_1, x_2), V_2(x_1, x_2)\}, \end{aligned}$$

where

$$\displaystyle \begin{aligned} V_1(x_1, x_2) = \mu_1 V((x_1-1)^+,x_2) + (1 - \mu_1) V(x_1, x_2) \end{aligned}$$

and

$$\displaystyle \begin{aligned} V_1(x_1, x_2) = \mu_2 V(x_1,(x_2-1)^+) + (1 - \mu_1) V(x_1, x_2). \end{aligned}$$

As can be verified directly, the solution of the DPE is as follows. Assume that c 1 μ 1 > c 2 μ 2. Then

$$\displaystyle \begin{aligned} V(x_1, x_2) = c_1 \frac{x_1(x_1 + 1)}{2\mu_1} + c_2 \frac{x_2(x_2 + 1)}{2 \mu_2} + c_2 \frac{x_1x_2}{\mu_1}. \end{aligned}$$

Moreover, this minimum expected cost is achieved by performing all the jobs of type 1 first and then the jobs of type 2. This strategy is called the rule. Thus, although one might be tempted to work on the longest queue first, this is not optimal.

There is a simple interchange argument to confirm the optimality of the rule. Say that you decide to work on the jobs in the following order: 1221211. Thus, you work on a job of type 1 until it completes, then a job of type 2, then another job of type 2, and so on. Modify the strategy as follows. Instead of working on the second job of type 2, work on the second job of type 1, until it completes. Then work on the second job of type 2 and continue as you would have. Thus, the processings of two jobs have been interchanged: the second job of type 2 and the second job of type 1. Only the waiting times of these two jobs change. The waiting time of the job of type 1 is reduced by 1∕μ 2, on average, since this is the average completion time of the job of type 2 that was previously processed before the job of type 1. Thus, the waiting cost of the job of type 1 is reduced by c 1μ 2. Similarly, the waiting cost of the job of type 2 is increased by c 2μ 1, on average. Thus, the average cost decreases by c 1μ 2 − c 2μ 1 which is a positive amount since c 1 μ 1 > c 2 μ 2. By induction, it is optimal to process all the jobs of type 1 first.

Of course, there are very few examples of control problems where the optimal policy can be proved by a simple argument. Nevertheless, keep this possibility in mind because it can yield elegant results simply. For instance, assume that jobs arrive at the queues shown in Fig. 13.5 according to independent Bernoulli processes. That is, with probability λ i, a job of type i arrives during each time step, independently of the past, for i = 1, 2. The same interchange argument shows that the rule minimizes the long-term average expected waiting cost of the jobs (a cost that we have not defined, but you may be able to imagine what it means). This is useful because the DPE can no longer be solved explicitly and proving the optimality of this rule analytically is quite complicated.

Fig. 13.5
figure 5

What job to work on next?

Hiring a Helper

Jobs arrive at random times and you must decide whether to work on them yourself or hire some helper. Intuition suggests that you should get some help if the backlog of jobs to be performed exceeds some threshold. We examine a model of this situation.

At time n = 0, 1, …, a job arrives with probability λ ∈ (0, 1). If you work alone, you complete a job with probability μ ∈ (0, 1) in one time unit, independently of the past. If you hire a helper, then together you complete a job with probability αμ ∈ (0, 1) in one unit of time, where α > 1. Let the cost at time n be c(n) = β > 0 if you hire a helper at time step n and c(n) = 0 otherwise. The goal is to minimize

$$\displaystyle \begin{aligned} E\left [\sum_{n=0}^N (X(n) + c(n)) \right], \end{aligned} $$

where X(n) is the number of jobs yet to be processed at time n. This cost measures the waiting cost of the jobs plus the cost of hiring the helper. The waiting cost is minimized if you hire the helper all the time and the helper cost is minimized if you never hire him. The goal of the problem is to figure out when to hire a helper to achieve the best trade-off between these two costs.

The state of the system is X(n) at time n. Let

$$\displaystyle \begin{aligned} V_m(x) = \min E \left[ \sum_{n = 0}^m (X(n) + c(n)) | X(0) = x \right ],\end{aligned} $$

where the minimum is over the possible choices of actions (hiring or not) that depend on the state up to that time. The stochastic dynamic programming equations are

$$\displaystyle \begin{aligned} V_m(x) &= x + \min_{a \in \{0, 1\}} \{ \beta 1\{a = 1\} + (1 - \lambda)(1 - \mu(a))V_{m-1}(x) \\ & \quad + \lambda(1 - \mu(a))V_{m-1}(\min\{x +1, K\}) \\ & \quad + (1 - \lambda) \mu(a) V_{m-1}(\max\{x -1, 0\}) \\ & \quad + \lambda \mu(a) V_{m-1}(x)\}, n \geq 0, \end{aligned} $$

where we defined μ(0) = μ and μ(1) = αμ and V −1(x) = 0. Also, we limit the backlog of jobs to K, so that if one job arrives where there are already K, we discard the new arrival.

We solve these equations using Python. As expected, the solution shows that one should hire a helper at time n if X(n) > γ(N − n), where γ(m) is a constant that decreases with m. As the time to go m increases, the cost of holding extra jobs increases and so does the incentive to hire a helper. Figure 13.6 shows the values of γ(n) for β = 14 and β = 20. The figure corresponds to λ = 0.5, μ = 0.6, α = 1.5, K = 20, and N = 200. Not surprisingly, when the helper is more expensive, one waits until the backlog is larger before hiring him.

Fig. 13.6
figure 6

One should hire a helper at time n if the backlog exceeds γ(N − n)

Which Queue to Join?

After shopping in the supermarket, you get to the cashiers and have to choose a queue to join. Naturally, you try to identify the queue with the shortest expected waiting time, and you join that queue. Everyone does the same, and it seems quite natural that this strategy should minimize the expected waiting time of all the customers. Your friend, who has taken this class before, tells you that this is not necessarily the case. Let us try to understand this apparent paradox.

Assume that there are two queues and customers arrive with probability λ at each time step. The service times in queue i are geometrically distributed with parameter μ i in queue i, for i = 1, 2.

Say that when you arrive, there are x i customers in queue i, for i = 1, 2. You should join queue 1 if

$$\displaystyle \begin{aligned} \frac{x_1 + 1}{\mu_1} < \frac{x_2 + 1}{\mu_2}, \end{aligned}$$

as this will minimize the expected time until you are served. However, if we consider the problem of minimizing the total average waiting time of customers in the two queues, we find that the optimal policy does not agree with the selfish choice of individual customers. Figure 13.7 shows an example with μ 2 < μ 1. It indicates that under the socially optimal policy some customers should join queue 2, even though they will then incur a longer delay than under the selfish policy.

Fig. 13.7
figure 7

The socially optimal policy is shown in blue and the selfish policy is shown in green

This example corresponds to minimizing the total cost

$$\displaystyle \begin{aligned} \sum_{n=0}^N \beta^n E( X_1(n) + X_2(n) ). \end{aligned}$$

In this expression, X i(n) is the number of customers in queue i at time n. The capacity of each queue is K. To prevent the system from discarding too many customers, one imposes the constraint that if only one queue is full when a customer arrives, he should join the non-full queue. In the expression for the total cost, one uses a discount factor β ∈ (0, 1) to keep the cost bounded. The figure corresponds to K = 8, λ = 0.3, μ 1 = 0.4, μ 2 = 0.2, N = 100, and β = 0.95. (The graphs are in fact for x 1 + 1 and x 2 + 2 as Python does not like the index value 0.)

13.5 Infinite Horizon

The problem of minimizing (13.6) involves a finite horizon. The problem stops at time n. We have seen that the minimum cost to go when there are m more steps is V m(x) when in state x. Thus, not surprisingly, the cost to go depends on the time to go and, consequently, the best action to choose in a given state x generally depends on the time to go.

The problem is simpler when one considers an infinite horizon because the time to go remains the same at each step. To make the total cost finite, one discounts the future costs. That is, one considers the problem of minimizing the expected total discounted cost:

$$\displaystyle \begin{aligned} E \left[ \sum_{m = 0}^\infty \beta^m c(X(m), a(m)) | X(0) = x \right ]. \end{aligned} $$
(13.8)

In this expression, 0 < β < 1 is the discount rate. Intuitively, if β is small, then future costs do not matter much and one tends to be short-sighted. However, if β is close to 1, then one pays a lot of attention to the long term.

Define V (x) to be the minimum value of the cost (13.8), where the minimum is over all the possible choices of the actions at each step. Arguing as before, one can show that

$$\displaystyle \begin{aligned} V(x) &= \min_{a \in A(x)} \{ c(x, a) + \beta E[V(X(1)) | X(0) = x, a(0) = a] \} \\ & = \min_{a \in A(x)} \left \{ c(x, a) + \beta \sum_y P(x, y; a) V(y) \right \}. {} \end{aligned} $$
(13.9)

These equations are similar to (13.7), with two differences: the discount factor and the fact that the value function does not depend on time. Note that these equations are fixed-point equations. A standard method to solve them is to consider the equations

$$\displaystyle \begin{aligned} V_{n+1}(x) = \min_{a \in A(x)} \left \{ c(x, a) + \beta \sum_y P(x, y; a) V_n(y) \right \}, n \geq 0, \end{aligned} $$
(13.10)

where one chooses V 0(x) = 0, ∀x. Note that these equations correspond to

$$\displaystyle \begin{aligned} V_n(x) = \min E \left[ \sum_{m = 0}^n \beta^m c(X(m), a(m)) | X(0) = x \right ]. \end{aligned} $$
(13.11)

One can show that the solution V n(x) of (13.10) is such that V n(x) → V (x) as n →, where V (x) is the solution of (13.9).

13.6 Summary

  • Dynamic Programming Equations;

  • Controlled Markov Chain;

  • Markov Decision Problem.

13.6.1 Key Equations and Formulas

Table 1

13.7 References

The book Ross (1995) is a splendid introduction to stochastic dynamic programming. We borrowed the “guess a card” example from it. It explains the key ideas simply and the many variations of the theory illustrated by carefully chosen examples. The textbook Bertsekas (2005) is a comprehensive presentation of the algorithms for dynamic programming. It contains many examples and detailed discussions of the theory and practice.

13.8 Problems

Problem 13.1

Consider a single queue with one server in discrete time. At each time, a new customer arrives to the queue with probability λ < 1, and if the server works on the queue at rate μ ∈ [0, 1], it serves one customer in one unit of time with probability μ. Due to energy constraints, you want your server to work with the smallest rate as possible without making the queue unstable. Thus, you want your server to work at rate μ  = λ. Unfortunately, you do not know the value of λ. All you can observe is the queue length. We try to design an algorithm based on stochastic gradient to learn μ in the following steps:

  1. (a)

    Minimize the function \(V(\mu ) = \frac 12 (\lambda -\mu )^2\) over μ using gradient descent.

  2. (b)

    Find E[Q(n + 1) − Q(n)|Q(n) = q], for some q > 0, given that server allocates capacity μ n during time slot n. Q(n) is the queue length at time n. What happens if q = 0?

  3. (c)

    Use the stochastic gradient projection algorithm and write a Python code based on parts (a) and (b) to learn μ . Note that 0 ≤ μ ≤ 1.

Hint

To avoid the case when the queue length is 0, start with a large initial queue length.

Problem 13.2

Consider a routing network with three nodes: the start node s, the destination node d, and an intermediate node r. There is a direct path from s to d with travel time 20. The travel time from s to r is 7. There are two paths from r to d. They have independent travel times that are uniformly distributed between 8 and 20.

  1. (a)

    If you want to do pre-planning, which path should be chosen to go from s to d?

  2. (b)

    If the travel times from r to d are revealed at r which path should be chosen?

Problem 13.3

Consider a single queue in discrete time with Bernoulli arrival process of rate λ. The queue can hold K jobs, and there is a fee γ when its backlog reaches K. There is one server dedicated to the queue with service rate μ(0). You can decide to allocate another server to the queue that increases the rate to μ(1) ∈ (μ(0), 1). However, using the additional server has some cost. You want to minimize the cost

$$\displaystyle \begin{aligned}\sum_{n=0}^\infty \beta^n E(X(n)+ \alpha H(n) + \gamma 1\{X(n) = K\}), \end{aligned}$$

where H(n) is equal to one if you use an extra helper at time n and is zero otherwise.

  1. (a)

    Write the dynamic programming equations.

  2. (b)

    Solve the DPE with MATLAB for λ = 0.4, μ(0) = 0.35, μ(1) = 0.5, α = 2.5, β = 0.95, and γ = 30.

Problem 13.4

We want to plan routing from node 1 to 5 in the graph of Fig. 13.7. The travel times on the edges of the graph are as follows: T(1, 2) = 2, T(1, 3) ∼ U[2, 4], T(2, 4) = 1, T(2, 5) ∼ U[4, 6], T(4, 5) ∼ U[3, 5], and T(3, 5) = 4. Note that X ∼ U[a, b] means X is a random variable uniformly distributed between a and b.

  1. (a)

    If you want to do pre-planning, which path would you choose? What is the expected travel time?

    Fig. 13.7
    figure 8

    Route planning

  2. (b)

    Now suppose that at each node, the travel times of two steps ahead are revealed. Thus, at node 1 all the travel times are revealed except T(4, 5). Write the dynamic programming equations that solve the route planning problem and solve them. That is, let V (i) be the minimum expected travel time from i to 5, and 1 ≤ i ≤ 5. Find V (i) for 1 ≤ i ≤ 5.

Problem 13.5

Consider a factory, DilBox, that stores boxes. At the beginning of year k, they have x k boxes in storage. Now at the end of every year k they are mandated by contracts to provide d k boxes. However, the number of boxes d k is unknown until the year actually ends.

At the beginning of the year, they can request u k boxes. Using very shoddy Elbonian labor each box has costs A to produce. At the end of the year DilBox is able to borrow y k boxes from BoxR’Us at the cost s(y k) to meet the contract.

The boxes remaining after meeting the demand are carried over to the next year x k+1 = x k + u k + y k − d k. Sadly, they need to pay to store the boxes at a cost given by a function r(x k+1).

Now your job is to provide a box creation and storage plan for the upcoming 20 years. Your goal is to minimize the total cost for the 20 years. You can treat costs as being paid at the end of the year and there is no inflation. Also, you get your pension after 20 years so you do not care about costs beyond those paid in the 20th year. (Assume you start with zero boxes, of course, it does not really matter).

  1. (a)

    Formulate the problem as a Markov decision problem;

  2. (b)

    Write the dynamic programming equations;

  3. (c)

    Use Python to solve the equations with the following parameters:

    • r(x k) = 5x k;

    • s(y k) = 20y k;

    • A = 1;

    • d k =D U{1, …, 10}.

Problem 13.6

Consider a video game duel where Bob starts at time 0 at distance T = 10 from Alice and gets closer to her at speed 1. For instance, Alice is at location (0, 0) in the plane and Bob starts at location (0, T) and moves toward Alice, so that after t seconds, Bob is at location (0, T − t). Alice has picked a random time, uniformly distributed in [0, T], when she will shoot Bob. If Alice shoots first, Bob is dead. Alice never misses. [This is only a video game.]

  1. (a)

    Bob has to find at what time t he should shoot Alice to maximize the probability of killing her. If Bob shoots from a distance x, the probability that he hits (and kills) Alice is 1∕(1 + x)2. Bob has only one bullet.

  2. (b)

    What is the maximum probability that Bob wins the duel?

  3. (c)

    Assume now that Bob has two bullets. You must find the times t 1 and t 2 when Bob should shoot Alice to maximize the probability that he wins the duel. Again, for each bullet that Bob shoots from distance x, the probability of success is 1∕(1 + x)2, independently for each bullet.

Problem 13.7

You play a game where you win the amount you bet with probability p ∈ (0, 0.5) and you lose it with probability 1 − p. Your initial fortune is 16 and you gamble a fixed amount γ at each step, where γ ∈{1, 2, 4, 8, 16}. Find the probability that you reach a fortune equal to 256 before you go broke. What is the gambling amount that maximizes that probability?