# Encyclopedia of Systems and Control

Living Edition
| Editors: John Baillieul, Tariq Samad

# Stochastic Dynamic Programming

• Qing Zhang
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4471-5102-9_230-1

## Abstract

This article is concerned with one of the traditional approaches for stochastic control problems: Stochastic dynamic programming. Brief descriptions of stochastic dynamic programming methods and related terminology are provided. Two asset-selling examples are presented to illustrate the basic ideas. A list of topics and references are also provided for further reading.

## Keywords

Optimality principle Bellman equation Hamilton-Jacobi-Bellman equation Markov decision problem Stochastic control Viscosity solution Asset-selling rule

## Introduction

The term dynamic programming was introduced by Richard Bellman in the 1940s. It refers to a method for solving dynamic optimization problems by breaking them down into smaller and simpler subproblems.

To solve a given problem, one often needs to solve each part of the problem (subproblems) and then put together their solutions to obtain an overall solution. Some of these subproblems are of the same type. The idea behind the dynamic programming approach is to solve each subproblem only once in order to reduce the overall computation.

The cornerstone of dynamic programming (DP) is the so-called principle of optimality which is described by Bellman in his 1957 book (Bellman 1957):

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

This principle of optimality gives rise to DP (or optimality) equations, which are referred to as Bellman equations in discrete-time optimization problems or Hamilton-Jacobi-Bellman (HJB) equations in continuous-time ones. Such equations provide a necessary condition for optimality in terms of the value of the underlying decision problem. By and large, an optimal control policy in most cases can be obtained by solving the associated Bellman (HJB) equation. In view of this, dynamic programming is a powerful tool for a broad range of control and decision-making problems. When the underlying system is driven by certain type of random disturbance, the corresponding DP approach is referred to as stochastic dynamic programming.

## Terminology

The following concepts are often used in stochastic dynamic programming.

An objective function describes the objective of a given optimization problem (e.g., maximizing profits, minimizing cost, etc.) in terms of the states of the underlying system, decision (control) variables, and possible random disturbance.

State variables represent the information about the current system under consideration. For example, in a manufacturing system, one needs to know the current product inventory in order to decide how much to produce at the moment. In this case, the inventory level would be one of the state variables.

The variables chosen at any time are called the decision or control variables. For instance, the rate of production over time in the manufacturing system is a control variable. Typically, control variables are functions of state variables. They affect the future states of the system and the objective function.

In stochastic control problems, the system is also affected by random events (noise). Such noise is referred to system disturbance. The noise is often not available a priori. Only their probabilistic distributions are known.

The goal of the optimization problem is to choose control variables over time so as to either maximize or minimize the corresponding objective function. For example, in order to maximize the overall profits, a manufacturing firm has to decide how much to produce over time so as to maximize the revenue by meeting the product demand and minimize the costs associated with inventory. The best possible value of the objective is called value function, which is given in terms of the state variables.

In the next two sections, we give two examples to illustrate how stochastic DP methods are used in discrete and continuous time.

## An Asset-Selling Example (Discrete Time)

Consider a person wants to sell an asset (e.g., a car or a house). She is offered an amount of money every period (say, a day). Let v 0, v 1, , v N − 1 denote the amount of these random offers. Assume they are independent and identically distributed. At the end of each period, the person has to decide whether to accept the offer or reject it. If she accepts the offer, she can put the money in a bank account and receive a fixed interest rate r > 0; if she rejects the offer, she waits till the next period. Rejected offers cannot be recycled. In addition, she has to sell her asset by the end of the Nth period and accept the last offer v N − 1 if all previous offers have been rejected. The goal is to decide when to accept an offer to maximize the overall return at the Nth period.

In this example, for each k, v k is the random disturbance. The control variables u k take values in {sell, hold}. The state variables x k are given by the equations
$$x_{0} = 0;\quad x_{k+1} = \left \{\begin{array}{ll} \mbox{ sold}&\mbox{ if }u_{k} = \mbox{ sell} \\ v_{k} &\mbox{ otherwise}.\\ \end{array} \right .$$
Let
$$\begin{array}{l} h_{N}(x_{N}) = \left \{\begin{array}{ll} x_{N}&\mbox{ if }x_{N}\not =\mbox{ sold},\\ 0 &\mbox{ otherwise} .\\ \end{array} \right . \\ h_{k}(x_{k},u_{k},v_{k}) = \left \{\begin{array}{ll} {(1 + r)}^{N-k}x_{k}&\mbox{ if }x_{k}\not =\mbox{ sold} \\ &\mbox{ and }u_{k} = \mbox{ sell} \\ 0 &\mbox{ otherwise}.\\ \end{array} \right . \\ \mbox{ for }k = 0,1,\ldots ,N - 1.\end{array}$$
Then, the payoff function is given by
$$E_{\{v_{k}\}}\left (h_{N}(x_{N}) +\displaystyle\sum _{ k=0}^{N-1}h_{ k}(x_{k},u_{k},v_{k})\right ).$$
Here, $$E_{\{v_{k}\}}$$ represents the expected value over {v k }. The corresponding value functions V k (x k ) satisfy the following Bellman equations:
$$\begin{array}{l} V _{N}(x_{N}) = \left \{\begin{array}{ll} x_{N}&\mbox{ if }x_{N}\not =\mbox{ sold},\\ 0 &\mbox{ otherwise} .\\ \end{array} \right . \\ V _{k}(x_{k}) = \left \{\begin{array}{ll} \max \left ({(1 + r)}^{N-k}x_{ k},EV _{k+1}(v_{k})\right )&\mbox{ if }x_{k}\not =\mbox{ sold} \\ 0 &\mbox{ otherwise}.\\ \end{array} \right . \\ \mbox{ for }k = 0,1,\ldots ,N - 1.\end{array}$$
The optimal selling rule can be given as (assuming x k  ≠ sold) (see Bertsekas 1987):
$$\begin{array}{ll} \mbox{ accept the offer}&v_{k-1} = x_{k}\mbox{ if }{(1 + r)}^{N-k}x_{k} \geq EV _{k+1}(v_{k}), \\ \mbox{ reject the offer} &v_{k-1} = x_{k}\mbox{ if }{(1 + r)}^{N-k}x_{k} < EV _{k+1}(v_{k}).\end{array}$$
Given the distribution for v k , one can compute V k backwards and solve the Bellman equations, which in turn leads to the above optimal selling rule.

Note that such backward iteration only works with finite horizon dynamic programming. When working with an infinite horizon (discounted or long-run average) payoff function, often used methods are value iteration (successive approximation) and policy iteration. The idea is to construct a sequence of functions recursively so that they converge pointwise to the value function. For description of these iteration methods, their convergence properties, and error bound analysis, we refer the reader to Bertsekas (1987).

Next, we consider a continuous-time asset-selling problem.

## An Asset-Selling Example (Continuous Time)

Suppose a person wants to sell her asset. The price x t at time t ∈ [0, ) of her asset is given by a stochastic differential equation
$$\frac{dx_{t}} {x_{t}} =\mu dt +\sigma dw_{t},$$
where μ and σ are known constants and w t is the standard Brownian motion representing the disturbance. Suppose the transaction cost is K and the discount rate r. She has to decide when to sell her asset to maximize an expected return. In this example, the state variable is price x t , control variable is a function of selling time τ, and the payoff function is given by
$$J(x,\tau ) = E{e}^{-r\tau }(x_{\tau } - K).$$
Let V (x) denote the value function, i.e., $$V (x) =\sup _{\tau }J(x,\tau )$$. Then the associate HJB equation is given by
$$\min \left \{rV (x) - x\mu \frac{dV (x)} {dx} -\frac{{x{}^{2}\sigma }^{2}} {2} \frac{{d}^{2}V (x)} {d{x}^{2}} ,\ V (x) - K\right \} = 0.$$
(1)
Let
$${x}^{{\ast}} = \frac{K\beta } {\beta -1},$$
where
$$\beta = \frac{1} {{\sigma }^{2}} \left (\frac{{\sigma }^{2}} {2} -\mu +\sqrt{{\left (\mu -\frac{{\sigma }^{2 } } {2}\right )}^{2} + 2r{\sigma }^{2}}\right ).$$
Then the optimal selling rule can be given as (see Øksendal 2007):
$$\left \{\begin{array}{ll} \mbox{ sell} &\mbox{ if }x_{t} \geq {x}^{{\ast}}, \\ \mbox{ hold}&\mbox{ if }x_{t} < {x}^{{\ast}}.\\ \end{array} \right .$$
In general, to solve an optimal control problem via the DP approach, one first needs to solve the associate Bellman (HJB) equations. Then, these solutions can be used to come up with an optimal control policy. For example, in the above case, given the value function V (x), one should hold if
$$rV (x) - x\mu \frac{dV (x)} {dx} -\frac{{x{}^{2}\sigma }^{2}} {2} \frac{{d}^{2}V (x)} {d{x}^{2}} = 0$$
and sell when $$V (x) - K = 0$$. The threshold level x  ∗  is the exact dividing point between the first part equals zero and the second part vanishes. In addition, one can also provide a theoretical justification in terms of a verification theorem to show that the solution obtained this way is indeed optimal (see Fleming and Rishel (1975), Fleming and Soner (2006), or Yong and Zhou (1999)).

## HJB Equation Characterization and Computational Methods

In continuous-time optimal control problem, one major difficulty that arises in solving the associated HJB equations (e.g., (1)) is the characterization of the solutions. In most cases, there is no guarantee that the derivatives or partial derivatives exist. In this connection, the concept of viscosity solutions developed by Crandall and Lions in the 1980s can often be used to characterize the solutions and their uniqueness. We refer the reader to Fleming and Soner (2006) for related literature and applications. In addition, we would like to point out that closed-form solutions are rare in stochastic control theory and difficult to obtain in most cases. In many applications, one needs to resort to computational methods. One typical way to solve an HJB equation is the finite difference methods. An alternative is Kushner’s Markov chain approximation methods; see Kushner and Dupuis (1992).

## Summary and Future Directions

In this article, we have briefly stated stochastic DP methods, showed how they work in two simple examples, and discussed related issues. One serious limitation of the DP approach is the so-called curse of dimensionality. In other words, the DP does not work for problems with high dimensionality. Various efforts have been devoted to search for approximate solutions. One approach developed in recent years is the multi-time-scale approach. The idea is to classify random events according to the frequency of their occurrence. Frequent occurring events are grouped together and treated as a single “state” to achieve the reduction of dimensionality. We refer the reader to Yin and Zhang (20052013) for related literature and theoretical development. Finally, we would like to mention that stochastic DP has been used in many applications in economics, engineering, management science, and finance. Some applications can be found in Sethi and Thompson (2000). Additional references are also provided at the end for further reading.

## Bibliography

1. Bellman RE (1957) Dynamic programming. Princeton University Press, PrincetonGoogle Scholar
2. Bertsekas DP (1987) Dynamic programming. Prentice Hall, Englewood CliffsGoogle Scholar
3. Davis MHA (1993) Markov models and optimization. Chapman & Hall, LondonGoogle Scholar
4. Elliott RJ, Aggoun L, Moore JB (1995) Hidden Markov models: estimation and control. Springer, New YorkGoogle Scholar
5. Fleming WH, Rishel RW (1975) Deterministic and stochastic optimal control. Springer, New YorkGoogle Scholar
6. Fleming WH, Soner HM (2006) Controlled Markov processes and viscosity solutions, 2nd edn. Springer, New YorkGoogle Scholar
7. Hernandez-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes: basic optimality criteria. Springer, New YorkGoogle Scholar
8. Kushner HJ, Dupuis PG (1992) Numerical methods for stochastic control problems in continuous time. Springer, New YorkGoogle Scholar
9. Kushner HJ, Yin G (1997) Stochastic approximation algorithms and applications. Springer, New YorkGoogle Scholar
10. Øksendal B (2007) Stochastic differential equations, 6th edn. Springer, New YorkGoogle Scholar
11. Pham H (2009) Continuous-time stochastic control and optimization with financial applications. Springer, New YorkGoogle Scholar
12. Sethi SP, Thompson GL (2000) Optimal control theory: applications to management science and economics, 2nd edn. Kluwer, BostonGoogle Scholar
13. Sethi SP, Zhang Q (1994) Hierarchical decision making in stochastic manufacturing systems. Birkhäuser, BostonGoogle Scholar
14. Yin G, Zhang Q (2005) Discrete-time Markov chains: two-time-scale methods and applications. Springer, New YorkGoogle Scholar
15. Yin G, Zhang Q (2013) Continuous-time Markov chains and applications: a two-time-scale approach, 2nd edn. Springer, New YorkGoogle Scholar
16. Yin G, Zhu C (2010) Hybrid switching diffusions: properties and applications. Springer, New YorkGoogle Scholar
17. Yong J, Zhou XY (1999) Stochastic control: Hamiltonian systems and HJB equations. Springer, New YorkGoogle Scholar