Annals of Operations Research

, Volume 200, Issue 1, pp 247–263 | Cite as

Dynamic consistency for stochastic optimal control problems

  • Pierre Carpentier
  • Jean-Philippe Chancelier
  • Guy Cohen
  • Michel De Lara
  • Pierre Girardeau
Article

Abstract

For a sequence of dynamic optimization problems, we aim at discussing a notion of consistency over time. This notion can be informally introduced as follows. At the very first time step t0, the decision maker formulates an optimization problem that yields optimal decision rules for all the forthcoming time steps t0,t1,…,T; at the next time step t1, he is able to formulate a new optimization problem starting at time t1 that yields a new sequence of optimal decision rules. This process can be continued until the final time T is reached. A family of optimization problems formulated in this way is said to be dynamically consistent if the optimal strategies obtained when solving the original problem remain optimal for all subsequent problems. The notion of dynamic consistency, well-known in the field of economics, has been recently introduced in the context of risk measures, notably by Artzner et al. (Ann. Oper. Res. 152(1):5–22, 2007) and studied in the stochastic programming framework by Shapiro (Oper. Res. Lett. 37(3):143–147, 2009) and for Markov Decision Processes (MDP) by Ruszczynski (Math. Program. 125(2):235–261, 2010). We here link this notion with the concept of “state variable” in MDP, and show that a significant class of dynamic optimization problems are dynamically consistent, provided that an adequate state variable is chosen.

Keywords

Stochastic optimal control Dynamic consistency Dynamic programming Risk measures 

1 Introduction

Stochastic Optimal Control (SOC) is concerned with sequential decision-making under uncertainty. Consider a dynamical process that can be influenced by exogenous noises as well as decisions one has to make at every time step. The decision maker wants to optimize the behavior of the dynamical system (for instance, minimize a production cost) over a certain time horizon. As the system evolves, observations of the system are made; we here suppose that the decision maker is able to keep in memory all the past observations. Naturally, it is generally more profitable for him to adapt his decisions to the observations he makes of the system. He is hence looking for strategies rather than simple decisions. In other words, he is looking for applications that map every possible history of the observations to corresponding decisions. Because the number of time steps may be large, the representation of such an object is in general numerically intractable.

However, an amount of information lighter than the whole history of the system is often sufficient to make an optimal decision. In the seminal work of Bellman (1957), the minimal information on the system that is necessary to make the optimal decision plays a crucial role; it is called the state variable (see Whittle 1982, for a more formal definition). Moreover, the Dynamic Programming (DP) principle provides a way to compute the optimal strategies when the state space dimension is not too large (see Bertsekas 2000, for a broad overview on DP). The aim of this paper is to establish a link between the concept of state variable and the notion of dynamic consistency.

The notion of “consistent course of action” (see Peleg and Yaari 1973) is well-known in the field of economics, with the seminal work of Strotz (1955–1956): an individual having planned his consumption trajectory is consistent if, reevaluating his plans later on, he does not deviate from the originally chosen plan. This idea of consistency as “sticking to one’s plan” may be extended to the uncertain case where plans are replaced by decision rules (“Do thus-and-thus if you find yourself in this portion of state space with this amount of time left”, Richard Bellman cited in Dreyfus 2002): Hammond (1976) addresses “consistency” and “coherent dynamic choice”, Kreps and Porteus (1978) refers to “temporal consistency”. Dynamic or time consistency has been introduced in the context of risk measures (see Riedel 2004; Detlefsen and Scandolo 2005; Cheridito et al. 2006; Artzner et al. 2007, for definitions and properties of coherent and consistent dynamic risk measures). Time consistency has then been studied in the stochastic programming framework by Shapiro (2009) and for Markov Decision Processes by Ruszczynski (2010). In this paper, we rather use the (almost equivalent) definition of time consistency given by Ekeland and Lazrak (2006), which is more intuitive and seems better suited in the framework of optimal control problems. We see that terminology varies among authors between dynamic, temporal or time consistency. The term “time consistency” being attached in our community to dynamic risk measures, we will rather use the term “dynamic consistency” for optimal control problems. In this context, the property of dynamic consistency is loosely stated as follows. The decision maker formulates an optimization problem at time t0 that yields a sequence of optimal decision rules for t0 and for the following time steps t1,…,tN=T. Then, at the next time step t1, he formulates a new problem starting at t1 that yields a new sequence of optimal decision rules from time steps t1 to T. Suppose the process continues until time T is reached. The sequence of optimization problems is said to be dynamically consistent if the optimal strategies obtained when solving the original problem at time t0 remain optimal for all subsequent problems. In other words, dynamic consistency means that strategies obtained by solving the problem at the very first stage do not have to be questioned later on.

The notion of information here plays a crucial role. Indeed, we show in this paper that a sequence of problems may be consistent for some information structure while inconsistent for a different one. Consider for example a standard stochastic optimization problem solvable using DP. We will observe that the sequence of problems formulated after the original one at the later time steps are dynamically consistent. Add now a probabilistic constraint involving the state at the final time T. We will show that such a constraint brings dynamic inconsistency in the sense that optimal strategies based on the usual state variable have to be reconsidered at each time step. This is because, roughly speaking, a probabilistic constraint involves not only the state variable values but their probabilistic distributions. Hence knowledge only of the usual state variable of the system is insufficient to formulate consistent problems at subsequent time steps. So, in addition to the usual technical difficulties regarding probabilistic constraints (mainly related to the non-convexity of the feasible set of strategies), an additional problem arises in the dynamic case. We will see that, in fact, this new matter stems from the information on which the optimal decision is based. Therefore, with a well-suited state variable, the sequence of problems regains dynamic consistency.

In Sect. 2, we carefully examine the notion of dynamic consistency in the context of a deterministic optimal control problem. The main ideas of the paper are so explained and then extended, in Sect. 3, to a sequence of SOC problems. Next, in Sect. 4, we show that simply adding a probability constraint (or, equivalently in our context, an expectation constraint) to the problem makes dynamic consistency fall apart, when using the original state variable. We then establish that dynamic consistency can be recovered provided that an adequate state variable is chosen. A toy example is presented in Sect. 5. We conclude that, for a broad class of SOC problems, dynamic consistency has to be considered with respect to the notion of a state variable and of DP.

2 A first example

We introduce sequential deterministic optimal control problems, indexed by time, and derive the notion of dynamic consistency for this instance. We then illustrate the fact that the decision making process may be dynamically consistent or not, depending on the information on which decisions are based. The discussion is informal, in the sense that we do not enter technical details regarding existence of the solutions for the problems we introduce.

Let us consider a discrete and finite time horizon t0,…,tN=T.1 The decision maker has to optimize (according to a cost function we introduce below) the management of an amount of stock xt, which lies in some space \(\mathcal{X}_{t}\), at every time step t=t0,…,T. Let \(\mathcal{U}_{t}\) be some other space, for every time step t=t0,…,T−1. At each time step t, a decision \(u_{t}\in\mathcal{U}_{t}\) has to be made. Then a cost Lt is incurred by the system, depending on the values of the control and on the auxiliary variable xt that we call the state of the system. This state variable is driven from time t to time t+1 by some dynamics \(f_{t}: \mathcal{X}_{t} \times\mathcal{U}_{t} \rightarrow \mathcal{X}_{t+1}\). The aim of the decision maker is to minimize the sum of the intermediate costs Lt at all time steps plus a final cost K.

Denoting by \(x=(x_{t_{0}},\ldots,x_{T})\) (resp. \(u=(u_{t_{0}},\ldots,u_{T-1})\)) the trajectory of the state variable (resp. control variable), the problem reads: subject to the initial condition: and dynamic constraints: Note that here the decision at time t is taken knowing the current time step and the initial condition (the decision is generally termed “open loop”). A priori, there is no need for more information since the model is deterministic (hence the evolution of the system is entirely determined by the initial condition of the state and by the control values).
Suppose a solution to this problem exists. This is a sequence of controls that we denote by \(u_{t_{0},t_{0}}^{*}, \dots, u_{t_{0},T-1}^{*}\), where the first index refers to the initial time step and the second index refers to the time step to which the decision applies. Moreover, we suppose a solution exists for each one of the natural subsequent problems, i.e. for every ti=t1,…,T−1: We denote the solutions of these problems by \(u_{t_{i},t_{i}}^{*},\dots,u_{t_{i},T-1}^{*}\), for every time step ti=t1,…,T−1. Those notations however make implicit the fact that the solutions do generally depend on the initial condition \(x_{t_{i}}\). We now make a first observation.

Lemma 1

(Independence of the initial condition)

In the very particular case when the solution to Problem (1a)(1c) and the solutions to Problems (2a)(2c) for every time step ti=t1,…,T−1 do not depend on the initial state conditions, problems are dynamically consistent.

Proof

Let us denote by \(x_{t_{0}, t_{i}}^{*}\) the optimal value of the state variable within Problem (1a)–(1c) at time ti. If we suppose that solutions to Problems (2a)–(2c) do not depend on the initial condition, then they are the same as the solutions obtained with the initial condition \(x_{t_{0}, t_{i}}^{*}\), namely \(u_{t_{0},t_{i}}^{*}, \dots, u_{t_{0},T-1}^{*}\). In other words, the sequence of decisions \(u_{t_{0},t_{0}}^{*}\), …, \(u_{t_{0},T-1}^{*}\) remains optimal for the subsequent problems starting at a later date. □

This property is of course not true in general, but we see in Example 1 hereafter and in Sect. 3 that some very practical problems do have this surprising property.

Example 1

Let us introduce, for every t=t0,…,T−1, functions \(l_{t}:\mathcal{U}_{t} \rightarrow\mathbb{R}\) and \(g_{t}: \mathcal{U}_{t}\rightarrow\mathbb{R}\), and assume that xt is scalar. Let K be a scalar constant and consider the following deterministic optimal control problem: Variables xt can be recursively replaced using dynamics gt. Therefore, the above optimization problem can be written:
$$\min_{u} \; \sum_{t=t_0}^{T-1} l_t\left(u_t\right)g_{t-1}\left(u_{t-1}\right) \ldots g_{t_0}(u_{t_0})x_{t_0} + K g_{T-1}\left(u_{T-1}\right) \ldots g_{t_0}(u_{t_0}) x_{t_0}.$$
Hence the optimal cost of the problem is linear with respect to the initial condition \(x_{t_{0}}\). Suppose that \(x_{t_{0}}\) only takes positive values. Then the value of \(x_{t_{0}}\) has no influence on the minimizer (it only influences the optimal cost). The same argument applies at subsequent time steps ti>t0 provided that dynamics are such that xt remains positive for every time step t=t1,…,T. Now, formulate the same problem at a later date ti=t1,…,T−1, with initial condition \(x_{t_{i}}\) given. By the same token as for the first stage problem, the value of the initial condition \(x_{t_{i}}\) has no influence on the optimal controls. Assumptions made in Lemma 1 are fulfilled, so that the dynamic consistency property holds true for open-loop decisions without reference to initial state conditions.

Although, for the time being, this example may look very special, we will see later on that it is analogous to familiar SOC problems.

As already noticed, the property assumed at Lemma 1 does not hold true in general. Moreover, the deterministic formulation (1a)–(1c) comes in general from the representation of a real-life process which may indeed be subject to unmodeled disturbances. Think of an industrial context, for example, in which sequential decisions are taken in the following manner.
  • At time t0, Problem (1a)–(1c) is solved. One obtains a decision \(u_{t_{0}, t_{0}}^{*}\) to apply at time t0, as well as decisions \(u_{t_{0}, t_{1}}^{*}\), …, \(u_{t_{0}, T-1}^{*}\) for future time steps.

  • At time t1, one formulates and solves the problem starting at time t1 with initial condition \(x_{t_{1}}=f_{t_{0}}(x_{t_{0}}, u_{t_{0}, t_{0}}^{*}) +\varepsilon_{t_{1}}\), \(\varepsilon_{t_{1}}\) being some perturbation of the model. There is no reason not to use the observation of the actual value of the variable \(x_{t_{1}}\) at time t1 as long as we have it at our disposal.

  • Hence a decision \(u_{t_{1}, t_{1}}^{*}\) is obtained, which is different from the initially obtained optimal decision \(u_{t_{0},t_{1}}^{*}\) (once again, in general).

  • The same process continues at times t2,…,T−1.

Let us now state the following two lemmas.

Lemma 2

(True deterministic world)

If the deterministic model is actually exact, i.e. if all perturbations \(\varepsilon_{t_{i}}\)introduced above equal zero, then Problems (2a)(2c) with initial conditions \(x_{t_{i}} =x_{t_{i}}^{*} :=f_{t_{i}}(x_{t_{i-1}}^{*}, u_{t_{0}, t_{i-1}}^{*})\)are dynamically consistent.

Proof

Since decisions \(u_{t_{0}, t_{0}}^{*}, \dots, u_{t_{0}, T-1}^{*}\) are optimal for Problem (1a)–(1c), it follows that decisions \(u_{t_{0}, t_{1}}^{*}, \dots, u_{t_{0}, T-1}^{*}\) are optimal for the problem: which has the same  argmin as Problem (2a)–(2c) at time t1. The same argument applies recursively for subsequent time steps. □

It is clear that the assumption made at Lemma 2 is not satisfied in real life. Therefore, adding disturbances to the problem seems to bring inconsistency to the sequence of optimization problems. Decisions that are optimal for the first stage problem do not remain optimal for the subsequent problems if we do not let decisions depend on the initial conditions.

In fact, as it is stated next, dynamic consistency is recovered provided we let decisions depend upon the right information.

Lemma 3

(Right amount of information)

Suppose that the solutions of Problem (1a)(1c) are looked for in the set of control strategies\((\phi_{t_{0},t_{0}}^{*}\), …, \(\phi_{t_{0}, T-1}^{*})\), where each feedback function \(\phi_{t_{0}, t}^{*}\)is defined on \(\mathcal{X}_{t}\)and takes values in \(\mathcal{U}_{t}\). Then Problems (2a)(2c) are dynamically consistent for every time step t=t0,…,T−1.

Proof

The result is a direct application of the DP principle, which states that there exists such a feedback function \(\phi^{*}_{t_{0}, t_{i}}\) that is optimal for Problem (1a)–(1c) and is still optimal for Problem (2a)–(2c) at time ti, whatever initial condition \(x_{t_{i}}\) is. □

We thus retrieve the dynamic consistency property provided that we use the feedback functions \(\phi_{t_{0},t}^{*}\) rather than the controls \(u_{t_{0},t}^{*}\). In other words, problems are dynamically consistent as soon as the control strategy is based on a sufficiently rich amount of information (time instant tand state variable x in the deterministic case).

There is of course an obvious link between these optimal strategies and the controls \((u_{t_{0},t_{0}}^{*},\ldots,u_{t_{0},T-1}^{*})\), namely: where

The considerations we made so far seem to be somewhat trivial. However, we shall observe that for SOC problems, which may seem more complicated at first sight, the same considerations remain true. Most of the time, decision making processes are dynamically consistent, provided we choose the correct information on which decisions are based.

3 Stochastic optimal control without constraints

We now consider a more general case in which a controlled dynamical system is influenced by modeled exogenous disturbances. The decision maker has to find strategies to drive the system so as to minimize some objective function over a certain time horizon. This is a sequential decision making process on which we can state the question of dynamic consistency. As in the previous example, the family of optimization problems is derived from the original one by truncating the dynamics and the cost function (the final time step T remains unchanged in each problem), and strategies are defined relying on the same information structure as in the original problem. In the sequel, random variables will be denoted using bold letters.

3.1 The classical case

Consider a dynamical system characterized by state2 variables \(\uppercase{\boldsymbol{x}}=(\uppercase{\boldsymbol{x}}_{t})_{t=t_{0},\dots,T}\), where \(\uppercase{\boldsymbol{x}}_{t}\) takes values in \(\mathcal{X}_{t}\). The system can be influenced by control variables \(\uppercase{\boldsymbol{u}}=(\uppercase{\boldsymbol{u}}_{t})_{t=t_{0},\dots,T-1}\) and by exogenous noise variables \(\uppercase{\boldsymbol{w}} = (\uppercase{\boldsymbol{w}}_{t})_{t=t_{0}, \dots, T}\) (\(\uppercase{\boldsymbol{u}}_{t}\) and \(\uppercase{\boldsymbol{w}}_{t}\) taking values in \(\mathcal{U}_{t}\) and \(\mathcal{W}_{t}\) respectively). Noises that affect the system can be correlated through time. All random variables are defined on a probability space Open image in new window. The problem we consider consists in minimizing the expectation of a sum of costs depending on the state, the control and the noise variables over a discrete finite time horizon. The state variable evolves with respect to some dynamics that depend on the current state, noise and control values. The problem starting at t0 is written:3 On the one hand, (3b) states that the probability law \(\mu_{t_{0}}\) of the initial state \(\uppercase{\boldsymbol{x}}_{t_{0}}\) is part of the known data of the problem (such a knowledge is mandatory for evaluating the criterion in (3a)). On the other hand, (3d) implies that each control variable \(\uppercase{\boldsymbol{u}}_{t}\) may be modeled as a function notably depending on the realization of the random variable \(\uppercase{\boldsymbol{x}}_{t_{0}}\). Of course, such a function may also parametrically depend on the data of the problem, and thus on the probability law \(\mu_{t_{0}}\). We shall see now that this last dependency does not occur for Problem (3a)–(3d) under the so-called Markovian assumption (Assumption 1 hereafter).

A general approach in optimal control consists in assuming that noise variables \(\uppercase{\boldsymbol{w}}_{t_{1}}, \dots, \uppercase{\boldsymbol{w}}_{T}\) are independent through time, so that all necessary information is included in the state variables \(\uppercase{\boldsymbol{x}}\).4 We hence make the following assumption.

Assumption 1

(Markovian setting)

Noise variables \(\uppercase{\boldsymbol{x}}_{t_{0}}, \uppercase{\boldsymbol{w}}_{t_{1}}, \dots, \uppercase{\boldsymbol{w}}_{T}\) are independent.

Using Assumption 1, it is well known (see Bertsekas 2000) that:
  • there is no loss of optimality in looking for the optimal strategy \(\uppercase{\boldsymbol{u}}_{t}\) at time t as a feedback function depending on the state variable \(\uppercase{\boldsymbol{x}}_{t}\), i.e. as a (measurable) function of the form \(\phi_{t_{0}, t}: \mathcal{X}_{t} \rightarrow\mathcal{U}_{t}\);

  • the optimal strategies \(\phi_{t_{0}, t_{0}}^{*}, \dots, \phi_{t_{0},T-1}^{*}\) can be obtained by solving the classical DP equation. Let Vt(x) denote the optimal cost when being at time step t with state value x, this equation reads:
We call this case the classical case. A consequence of the DP equation is that the  argmin in (4b) depends on both x and the probability law of \(\uppercase{\boldsymbol{w}}_{t+1}\), but not on \(\mu_{t_{0}}\) so that the optimal feedback functions do not depend on the initial condition \(\mu_{t_{0}}\). The probability law of \(\uppercase{\boldsymbol{x}}_{t_{0}}\) only affects the optimal cost value, but not its  argmin. Moreover it is clear while inspecting the DP equation that optimal strategies \(\phi_{t_{0},t_{0}}^{*}\), …, \(\phi_{t_{0}, T-1}^{*}\) remain optimal for the subsequent optimization problems: for every ti=t1,…,T−1. In other words, these problems are dynamically consistent provided the information variable at time t contains at least the state variable \(\uppercase{\boldsymbol{x}}_{t}\), that is the control variable \(\uppercase{\boldsymbol{u}}_{t}\) is modeled as a function of the realizations of \(\uppercase{\boldsymbol{x}}_{t}\). While building an analogy with properties described in the deterministic example in Sect. 2, the reader should be aware that the case we consider here is closer to Lemma 1 than to Lemma 3, as we explain now in more details.

3.2 The distributed formulation

In fact, we are within the same framework as in Example 1. Indeed, Problem (3a)–(3d) can be written as a deterministic distributed optimal control problem involving the probability laws of the state variable, the dynamics of which are given by the so-called Fokker–Planck equation. Let us detail this last formulation (see Witsenhausen 1973).

Let Ξt be a linear space of ℝ-valued measurable functions on \(\mathcal{X}_{t}\) (equipped with an appropriate σ-field Open image in new window). We can consider a dual pairing between Ξt and a space ϒt of signed measures on Open image in new window by considering the bilinear form:5
$$\left\langle\xi, \mu\right\rangle= \int_{\mathcal{X}_t}\xi(x) \mathrm{d} \mu(x) \quad\text{for} \;\xi\in\varXi_{t} \;\text{and}\; \mu\in\varUpsilon_{t} . $$
(6)
For ξtΞt, let \(\uppercase{\boldsymbol{x}}_{t}\) be a random variable taking values in \(\mathcal{X}_{t}\) and distributed according to the probability law μt. The former bilinear form reads as follows:
$$\langle\xi_t, \mu_t \rangle= \mathbb {E} \bigl (\xi_t(\uppercase{\boldsymbol{x}}_t)\bigr ) .$$
Consider again Problem (3a)–(3d). Given feedback laws \(\phi_{t}: \mathcal{X}_{t} \rightarrow\mathcal{U}_{t}\) for every time step t=t0,…,T−1, we define the operator \(\mathbb {T}_{t}^{\phi_{t}}: \varXi_{t+1} \rightarrow\varXi_{t}\) as:
$$\big(\mathbb {T}_t^{\phi_t} \xi_{t+1}\big) \left(x \right) := \mathbb {E} \bigl (\xi_{t+1}(\uppercase{\boldsymbol{x}}_{t+1})\ \big |\ \uppercase{\boldsymbol{x}}_{t} = x\bigr )\quad\forall x \in\mathcal{X}_t .$$
Using the state dynamics (3c) and thanks to Assumption 1 (Markovian setting), we obtain:
$$\big(\mathbb {T}_t^{\phi_t} \xi_{t+1}\big) \left(x \right)= \mathbb {E} \Bigl (\xi_{t+1}\big(f_t(x,\phi_t(x), \uppercase{\boldsymbol{w}}_{t+1})\big)\Bigr )\quad\forall x \in\mathcal{X}_t .$$
Denoting by \(\mu_{t_{0}}\) the probability law of the first stage state \(\uppercase{\boldsymbol{x}}_{t_{0}}\) and by \((\mathbb {T}_{t}^{\phi_{t}})^{\star}\) the adjoint operator of \(\mathbb {T}_{t}^{\phi_{t}}\)—defined by the identity \(\langle \mathbb {T}_{t}^{\phi} \xi\:,\mu\rangle = \langle\xi\:,(\mathbb {T}_{t}^{\phi})^{\star}\mu \rangle\)—we can describe the evolution of the state probability law (as driven by the chosen feedback laws ϕt) by the following Fokker–Planck equation:
$$\mu_{t+1} = \big(\mathbb {T}_t^{\phi_t}\big)^\star\mu_t\quad\forall t=t_0,\dots,T-1,\ \mu_{t_0} \, \mbox{given} .$$
Indeed, for every ξt+1Ξt+1 we have: Next we introduce the operator \(\varLambda_{t}^{\phi_{t}} : \mathcal{X}_{t} \rightarrow\mathbb{R}\):
$$\varLambda_t^{\phi_t}\left(x\right) := \mathbb {E} \bigl (L_t\left(x, \phi_t\left(x\right),\uppercase{\boldsymbol{w}}_{t+1}\right)\bigr ) = \mathbb {E} \Bigl (L_t\big(\uppercase{\boldsymbol{x}}_t, \phi_t\left(\uppercase{\boldsymbol{x}}_t\right),\uppercase{\boldsymbol{w}}_{t+1}\big)\ \Big |\ \uppercase{\boldsymbol{x}}_t = x\Bigr ) \quad \forall x \in\mathcal{X}_t ,$$
which is meant to be the expected cost at time t for each possible state value when feedback function ϕt is applied. Using \(\varLambda_{t}^{\phi_{t}}\) we have:
$$ \mathbb {E} \bigl (L_t\left(\uppercase{\boldsymbol{x}}_t, \phi_t\left(\uppercase{\boldsymbol{x}}_t\right),\uppercase{\boldsymbol{w}}_{t+1}\right)\bigr )= \mathbb {E} \bigl ( \varLambda_t^{\phi_t}(\uppercase{\boldsymbol{x}}_t)\bigr ) =\big\langle\varLambda_t^{\phi_t}\:,\mu_t\big\rangle.$$
Replacing the expectation by the bilinear form (6) and replacing the state dynamics by the Fokker–Planck dynamics of the state law, we can now write a deterministic infinite-dimensional optimal control problem that is equivalent to Problem (3a)–(3d):

Remark 1

An alternative formulation is: where the ξtΞt are real-valued functions on \(\mathcal{X}_{t}\). This may be called “the backward formulation” since the “state” ξt follows an affine dynamics which is backward in time, with an initial-only cost function (whereas the previous forward formulation follows a forward linear dynamics with an integral + final cost function). Both formulations are infinite-dimensional linear programming problems which are dual of each other (see Theorem 1 in Witsenhausen 1973). The measures μt and functions ξt are the distributed state and/or co-state (according to which one is considered the primal problem) of this distributed deterministic optimal control problem of which ϕ is the distributed control.
Probability laws μt are by definition positive and appear only in a multiplicative manner in the problem. Hence we are in a similar case as Example 1. The main difference is rather technical: since we here have probability laws instead of scalars, we need to apply backwards in time interchange theorems between expectation and minimization in order to prove that the solution of the problem actually does not depend on the initial condition \(\mu_{t_{0}}\). Indeed, suppose that μT−1 is given at time step T−1. Then the most inner optimization problem reads: which is equivalent to:
$$ \min_{\phi_{T-1}}\big\langle\varLambda_t^{\phi_{T-1}}+\mathbb {T}_{T-1}^{\phi_{T-1}} K\:,\mu _{T-1}\big\rangle .$$
(8)
Note that the functional minimization in (8) can be reduced to a pointwise one, that is minimization has to be done independently for every x. Indeed, let \(g:\,\mathcal{X}_{T-1}\times\mathcal{U}_{T-1}\rightarrow\mathbb{R}\) be defined as:
$$g(x,u) := \mathbb {E} \Bigl (L_{T-1}\left(x,u,\uppercase{\boldsymbol{w}}_{T}\right)+K\big(f_{T-1}(x,u,\uppercase{\boldsymbol{w}}_{T})\big)\Bigr ) .$$
Then Problem (8) equivalently writes:
$$\min_{\phi_{T-1}} \int_{\mathcal{X}_{T-1}}g \big(x,\phi_{T-1}(x)\big) \mathrm{d} \mu_{T-1}(x) ,$$
ϕT−1 being a (measurable) function defined on \(\mathcal{X}_{T-1}\). The interchange theorem (Rockafellar and Wets 1998, Theorem 14.60) applies.6 We are in the case of Example 1 for every x. Therefore, the minimizer does not depend on μT−1. The same argument applies recursively to every time step before T−1 so that, at time t0, the initial condition \(\mu_{t_{0}}\) only influences the optimal cost of the problem, but not the argument of the minimum itself (here, the feedback laws \(\phi_{t_{0},t}^{*}\)).

Hence, following Lemma 1, Problems (5a)–(5d) are naturally dynamically consistent when strategies are searched as feedback functions on \(\uppercase{\boldsymbol{x}}_{t}\) only. It thus appears that the rather general class of stochastic optimal control problems shaped as Problem (3a)–(3d) is in fact very specific. However, such a property does not remain true when adding new ingredients in the problem, as we show in the next subsection.

4 Stochastic optimal control with constraints

We now give an example in which the state variable, as defined notably by Whittle (1982), cannot be reduced to variable \(\uppercase{\boldsymbol{x}}_{t}\) as above. Let us make Problem (3a)–(3d) more complex by adding to the model a probability constraint applying to the final time step T. For instance, we want the system to be in a certain state at the final time step with a given probability:
$$\mathbb {P}\left (h\left(\uppercase{\boldsymbol{x}}_T\right) \geq b\right ) \leq \pi.$$
Such chance constraints can equivalently be modeled as an expectation constraint in the following way:
$$\mathbb {E}\left (\mathbf {1}_{\left\{h\left(\uppercase{\boldsymbol{x}}_T\right) \geq b\right\}}\right )\leq\pi,$$
where 1A refers to the indicator function of set A. Note however that chance constraints bring important theoretical and numerical difficulties, notably regarding connexity and convexity of the feasible set of controls, even in the static case. The interested reader should refer to the work of Prékopa (1995), and to the handbook by Ruszczynski and Shapiro (2003, Chap. 5) for mathematical properties and numerical algorithms in Probabilistic Programming (see also Henrion 2002; Henrion and Strugarek 2008, for related studies). We do not discuss them here. The difficulty we are interested in is common to both chance and expectation constraints. This is why we concentrate in the sequel on adding an expectation constraint to Problem (3a)–(3d) of the form:
$$ \mathbb {E}\left (g\left(\uppercase{\boldsymbol{x}}_T\right)\right ) \leq a. $$
(9)
The reader familiar with chance constraints might want to see the level a as a level of probability that one wants to satisfy for a certain event at the final time step.

We now show that when adding such an expectation constraint, the dynamic consistency property falls apart. More precisely, the sequence of SOC problems are not dynamically consistent anymore when using the usual state variable. Nevertheless, we observe that the lack of consistency comes from an inappropriate choice for the state variable. By choosing the appropriate state variable, one regains dynamic consistency.

4.1 Problem setting

We introduce a measurable function \(g: \mathcal{X}_{T}\rightarrow \mathbb {R}\) and a∈ℝ and consider Problem (3a)–(3d) with the additional final expectation constraint:
$$ \mathbb {E}\left (g\left(\uppercase{\boldsymbol{x}}_T\right)\right ) \leq a.$$
The subsequent optimization problems formulated at an initial time ti>t0 are naturally deduced from this problem. The level a of the expectation constraint remains the same for every problem. Admittedly this corresponds to a (naive) formulation of the family of optimization problems under consideration. Such a choice is questionable since the perception of the risk level a may evolve over time.

Suppose there exists a solution for the problem at t0. As previously, we are looking for the optimal control at time t as a feedback function \(\phi_{t_{0}, t}^{*}\) depending on the realizations of the variable \(\uppercase{\boldsymbol{x}}_{t}\). The first index t0 refers to the time step at which the problem is stated, while the second index t refers to the time step at which the decision is taken.

One has to be aware that these solutions now parametrically depend on the initial condition \(\mu_{t_{0}}\). Indeed, let μT be the probability law of \(\uppercase{\boldsymbol{x}}_{T}\). Constraint (9) can be written \(\left\langle g, \mu_{T} \right\rangle\leq a\), so that the equivalent distributed formulation of the initial time problem is: subject to the Fokker–Planck dynamics: \(\mu_{t_{0}}\) being given by the initial condition, and the final expectation constraint: Even though this problem seems linear with respect to variables μt, the last constraint introduces an additional highly nonlinear term in the cost function, namely:
$$\chi _{{\left\{\left\langle g,\mu_T \right\rangle\leq a \right\}}},$$
where χA stands for the characteristic function7 of set A. The dynamics are still linear and variables μt are still positive, but the objective function is not linear with respect to μT anymore, and therefore not linear with respect to the initial law \(\mu_{t_{0}}\) either. Hence there is no reason for feedback laws to be independent of the initial condition as in the case without constraint presented in Sect. 3.
Let us now make a remark on this initial condition. Since the information structure is such that the state variable is fully observed, the initial condition is in fact of a deterministic nature:
$$\uppercase{\boldsymbol{x}}_{t_{0}} = x_{t_{0}},$$
where \(x_{t_{0}}\) is a given (observed) value of the system state. The probability law of \(\uppercase{\boldsymbol{x}}_{t_{0}}\) is accordingly the Dirac function \(\delta_{x_{t_{0}}}\).8 The reasoning made for the problem initiated at time t0 remains true for the subsequent problems starting at time ti: an observation \(x_{t_{i}}\) of the state variable \(\uppercase{\boldsymbol{x}}_{t_{i}}\) becomes available before solving Problem (5a)–(5d), so that its natural initial condition is in fact:
$$\uppercase{\boldsymbol{x}}_{t_{i}} = x_{t_{i}}.$$
Otherwise stated, the initial state probability law in each optimization problem we consider should correspond to a Dirac function. Note that such a sequence of Dirac functions is not driven by the Fokker–Planck equation, but is in fact associated to some dynamics of the degenerate filter corresponding to this perfect observation scheme. In the sequel, we assume such an initial condition for every problem we consider.
Now, according to Lemma 2, the subsequent optimization problems formulated at time ti will be dynamically consistent provided their initial conditions are given by the optimal Fokker–Planck equation:
$$\mu_{t_0, t_{i}}^* =\Bigl (\mathbb {T}_{t_{i-1}}^{\phi_{t_{0},t_{i-1}}^*}\Bigr )^\star\cdots \Bigl (\mathbb {T}_{t_{0}}^{\phi_{t_{0},t_{0}}^*}\Bigr )^\star\mu_{t_0}.$$
However, except for noise free problems, such a probability law \(\mu_{t_{0}, t_{i}}^{*}\) is always different from a Dirac function, which is, as already explained, the natural initial condition for the subsequent problem starting at time ti. As a conclusion, the sequence of subsequent optimization problems deduced from (10a)–(10c) for initial times ti>t0 is not dynamically consistent as long as we consider feedback laws ϕt depending on \(\uppercase{\boldsymbol{x}}_{t}\) only.

Remark 2

(Joint probability constraints)

Rather than \(\mathbb {P}\left (g\left(\uppercase{\boldsymbol{x}}_{T}\right) \geq b\right ) \leq a\), let us consider a more general chance constraint of the form:
$$ \mathbb {P}\left (g_t\left(\uppercase{\boldsymbol{x}}_t\right) \geq b_t, \forall t=t_1, \dots, T\right )\leq a.$$
This last constraint can be modeled, like the previous one, through an expectation constraint by introducing a new binary state variable: and considering constraint \(\mathbb {E}\left (\uppercase{\boldsymbol{y}}_{T}\right ) \leq a\).

4.2 Back to dynamic consistency

We now show that dynamic consistency can be recovered provided we choose the right state variable on which to base decisions. We hence establish a link between dynamic consistency of a family of optimization problems and the notion of state variable.

We claim that a better-suited state variable for the family of problems with final time expectation constraint introduced above is the probability law of the variable \(\uppercase{\boldsymbol{x}}\). Let us denote by Vt(μt) the optimal cost of the problem starting at time t with initial condition μt. Using notations of the distributed formulation of a SOC problem, one can write a DP equation depending on the probability laws μ on \(\mathcal{X}\): and, for every t=t0,…,T−1 and every probability law μ on \(\mathcal{X}\): The context is similar to the one of the deterministic example of Sect. 2, and Lemma 3 states that solving the deterministic infinite-dimensional problem associated with the constrained problem leads to dynamic consistency provided DP is used. For the problem under consideration, we thus obtain optimal feedback functions ϕt which depend on the probability laws μt. Otherwise stated, the family of constrained problems introduced in Sect. 4.1 is dynamically consistent provided one looks for strategies as feedback functions depending on both the realization of variable \(\uppercase{\boldsymbol{x}}_{t}\) and the probability law of \(\uppercase{\boldsymbol{x}}_{t}\).

Naturally, this DP equation is rather conceptual. The resolution of such an equation is intractable in practice since probability laws μt are infinite-dimensional objects.

5 A toy example with finite probabilities

Let us highlight the results of Sect. 4 using a rather simple case-study, corresponding to a discrete controlled Markov chain. The state space \(\mathbb{X}=\{1,2,3\}\) represents the discrete possible values of a reservoir, and we denote by \(\underline{x}=1\) (resp. \(\overline{x}=3\)) the lower (resp. upper) level of the reservoir. The control action u takes values in \(\mathbb{U}=\{0,1\}\), the value 1 corresponding to using some given water release to produce some electricity. The noise variable affecting the reservoir (stochastic inflow/outflow) takes its values in \(\mathbb{W}=\{-1,0,1\}\). We suppose that Assumption 1 is fulfilled, the discrete probability law of each noise variable \(\uppercase{\boldsymbol{w}}_{t}\) being characterized by the weights {ϖ,ϖ0,ϖ+}. Finally the reservoir dynamics is:
$$\uppercase{\boldsymbol{x}}_{t+1} =\min\big(\overline{x} , \max (\,\underline{x},\uppercase{\boldsymbol{x}}_{t}-\uppercase{\boldsymbol{u}}_{t}+\uppercase{\boldsymbol{w}}_{t+1})\big).$$
Let μt be the discrete probability law associated with the state random variable \(\uppercase{\boldsymbol{x}}_{t}\), the initial probability law \(\mu_{t_{0}}\) being given. In such a discrete case, it is easy to compute the transition matrix Pu giving the Markov chain transitions for each possible value u of the control, with the following interpretation: We obtain for the reservoir problem:
$$P^{0} =\left(\begin{array}{c@{\quad}c@{\quad}c}\varpi_{-}\!\!+\!\!\varpi_{0} \;\; & \varpi_{+} \;\; & 0 \\\varpi_{-} & \varpi_{0} & \varpi_{+} \\0 & \varpi_{-} & \varpi_{0}\!\!+\!\!\varpi_{+}\end{array}\right), \qquad P^{1} =\left(\begin{array}{c@{\quad}c@{\quad}c}1 & 0 & 0 \\\varpi_{-}\!\!+\!\!\varpi_{0} \;\; & \varpi_{+} \;\; & 0 \\\varpi_{-} & \varpi_{0} & \varpi_{+}\end{array}\right),$$
and we denote by \(P^{u}_{i}\) the ith row of matrix Pu.
Let \(\phi_{t} : \, \mathbb{X} \rightarrow\mathbb{U}\) be a feedback law at time t. We denote by Φt the set of admissible feedbacks at time t. In our discrete case, \(\mathrm{card}(\boldsymbol{\varPhi}_{t}) =\mathrm{card}(\mathbb{U})^{\mathrm{card}(\mathbb{X})} =8\) for all t. The transition matrix \(P^{\phi_{t}}\) associated with such a feedback is obtained by properly selecting rows of the transition matrices Pu, namely:
$$P^{\phi_{t}} =\left(\begin{array}{c}P^{\phi_{t}(1)}_{1} \\[3pt]P^{\phi_{t}(2)}_{2} \\[3pt]P^{\phi_{t}(3)}_{3}\end{array}\right) .$$
Then the dynamics given by the Fokker–Planck equation writes:
$$\mu_{t+1} = \big(P^{\phi_{t}}\big)^{\top }\mu_{t} ,$$
and the state μt involved in this dynamic equation is a three-dimensional vector:9
$$\mu_{t} =\left(\begin{array}{c}\mu_{1,t} \\\mu_{2,t} \\\mu_{3,t}\end{array}\right) .$$
Let now consider a cost function corresponding to the sale of the produced electricity: the cost at time t is supposed to be linear, equal to \(p_{t}\uppercase{\boldsymbol{u}}_{t}\), pt being a (negative) deterministic price. There is no final cost (K≡0), but the reservoir level at final time T must be equal to \(\overline{x}\) with a probability level at least equal to π. The distributed formulation (10a)–(10c) associated with the reservoir control problem is: with \(\mathbf {1}_{\{\overline{x}\}}(x)=1\) if \(x=\overline{x}\), and 0 otherwise. Condition (11c) just expresses that the last component of the three-dimensional vector μT has to be greater than or equal to π. Here, 〈⋅,⋅〉 denotes the standard scalar product in ℝ3.

Problem (11a)–(11c) can be solved using dynamic programming: the state equation is a three-dimensional one,10 so that the computational burden of such a resolution remains achievable. Suppose now that the number of reservoir discrete levels is n (rather than 3), with n big enough: the resolution of the distributed formulation suffers from the curse of dimensionality ((n−1)-dimensional state), the ultimate curse being to consider that the level takes values in the interval \([\,\underline{x},\overline{x}\,]\), so that the μt’s are continuous probability laws.

6 Conclusion

We informally introduced a notion of dynamic consistency of a sequence of decision-making problems, which basically requires that plans that are made from the very first time remain optimal if one rewrites optimization problems at subsequent time steps. We show that, for several classes of optimal control problems, this concept is not new and can be directly linked with the notion of state variable, which is the minimal information one must use to be able to take the optimal decision.

We show that, in general, feedback laws have to depend on the probability law of the usual state variable for Stochastic Optimal Control problems to be dynamically consistent. This is necessary, for example, when the model contains expectation or chance constraints.

Future works will focus on three main directions. The first concern will be to better formalize the state notion in the vein of the works by Witsenhausen (1971, 1973) and Whittle (1982). The second will be to establish the link with the literature concerning risk measures, in particular the work by Ruszczynski (2010). Finally, the last DP equations we introduced are in general intractable. In a forthcoming paper, we will provide a way to get back to a finite-dimensional information variable, which makes a resolution by DP tractable.

Footnotes

  1. 1.

    Where ti+1=ti+1.

  2. 2.

    The use of the terminology “state” is somewhat abusive until we make Assumption 1.

  3. 3.

    We here use the notations ∼ for “is distributed according to” and ⪯ for “is measurable with respect to”. We suppose that all measurability and integrability conditions are fulfilled, so that all stochastic optimization problems in the sequel are well defined.

  4. 4.

    If the noise variables are not independent, one has to include at most all the past values of the noise variable to make a new state variable.

  5. 5.

    We do not aim at discussing technical details concerning spaces here. Following Witsenhausen (1973), a natural choice of Ξt is the Banach space of real bounded measurable functions with the supremum norm, ϒt being the Banach space of signed measures with the total variation norm. In the same vein, as far as integrability is concerned, we will suppose that all the operators we introduce are well-defined.

  6. 6.

    One needs several technical assumptions concerning spaces and measurability in order to use (Rockafellar and Wets 1998, Theorem 14.60). We do not intend to discuss them in this paper.

  7. 7.
    As defined in convex analysis:
    $$\chi _{{A}}(x) =\left\{\begin{array}{l@{\quad}l} 0 & \text{if } x \in A, \\+\infty&\text{otherwise} .\end{array}\right.$$
  8. 8.

    The initial law \(\mu_{t_{0}}\) in Problem (3a)–(3d) corresponds to the information available on \(\uppercase{\boldsymbol{x}}_{t_{0}}\)before\(\uppercase{\boldsymbol{x}}_{t_{0}}\) is observed, but it seems more reasonable in a practical situation to use all the available information when setting the problem again at each new initial time, and thus to use a Dirac function as the initial condition.

  9. 9.

    The dynamics of a controlled Markov chain is traditionally written: \(\mu_{t+1} = \mu_{t} P^{\phi_{t}}\), μt being represented by a row vector. We rather use here a column vector (transpose of the previous one) in order to have consistent notations throughout the paper.

  10. 10.

    In fact two-dimensional because μ1,t+μ2,t+μ3,t=1.

Notes

Acknowledgements

This study was made within the Systems and Optimization Working Group (SOWG), which is composed of Laetitia Andrieu, Kengy Barty, Pierre Carpentier, Jean-Philippe Chancelier, Guy Cohen, Anes Dallagi, Michel De Lara and Pierre Girardeau, and based at Université Paris-Est, CERMICS, Champs sur Marne, 77455 Marne la Vallée Cedex 2, France.

References

  1. Artzner, P., Delbaen, F., Eber, J.-M., Heath, D., & Ku, H. (2007). Coherent multiperiod risk-adjusted values and Bellman’s principle. Annals of Operations Research, 152(1), 5–22. CrossRefGoogle Scholar
  2. Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press. Google Scholar
  3. Bertsekas, D. (2000). Dynamic programming and optimal control (2nd ed.). Nashua: Athena Scientific. Google Scholar
  4. Cheridito, P., Delbaen, F., & Kupper, M. (2006). Dynamic monetary risk measures for bounded discrete-time processes. Electronic Journal of Probability, 11(3), 57–106. Google Scholar
  5. Detlefsen, K., & Scandolo, G. (2005). Conditional and dynamic convex risk measures. Finance and Stochastics, 9(4), 539–561. CrossRefGoogle Scholar
  6. Dreyfus, S. (2002). Richard Bellman on the birth of dynamic programming. Operations Research, 50(1), 48–51. CrossRefGoogle Scholar
  7. Ekeland, I., & Lazrak, A. (2006). Being serious about non-commitment: subgame perfect equilibrium in continuous time. arXiv:math.OC/0604264.
  8. Hammond, P. J. (1976). Changing tastes and coherent dynamic choice. Review of Economic Studies, 43(1), 159–173. CrossRefGoogle Scholar
  9. Henrion, R. (2002). On the connectedness of probabilistic constraint sets. Journal of Optimization Theory and Applications, 112(3), 657–663. CrossRefGoogle Scholar
  10. Henrion, R., & Strugarek, C. (2008). Convexity of chance constraints with independent random variables. Computational Optimization and Applications, 41(2), 263–276. CrossRefGoogle Scholar
  11. Kreps, D. M., & Porteus, E. L. (1978). Temporal resolution of uncertainty and dynamic choice theory. Econometrica, 46(1), 185–200. CrossRefGoogle Scholar
  12. Peleg, B., & Yaari, M. E. (1973). On the existence of a consistent course of action when tastes are changing. Review of Economic Studies, 40(3), 391–401. CrossRefGoogle Scholar
  13. Prékopa, A. (1995). Stochastic programming. Dordrecht: Kluwer Academic. Google Scholar
  14. Riedel, F. (2004). Dynamic coherent risk measures. Stochastic Processes and Their Applications, 112(2), 185–200. CrossRefGoogle Scholar
  15. Rockafellar, R., & Wets, R.-B. (1998). Variational analysis. Berlin: Springer. CrossRefGoogle Scholar
  16. Ruszczynski, A. (2010). Risk-averse dynamic programming for Markov decision processes. Mathematical Programming, 125(2), 235–261. CrossRefGoogle Scholar
  17. Ruszczynski, A., & Shapiro, A. (Eds.) (2003). Handbooks in operations research and management science: Vol10. Stochastic programming. Amsterdam: Elsevier. Google Scholar
  18. Shapiro, A. (2009). On a time consistency concept in risk averse multistage stochastic programming. Operations Research Letters, 37(3), 143–147. CrossRefGoogle Scholar
  19. Strotz, R. H. (1955–1956). Myopia and inconsistency in dynamic utility maximization. Review of Economic Studies, 23(3), 165–180. CrossRefGoogle Scholar
  20. Whittle, P. (1982). Optimization over time. New York: Wiley. Google Scholar
  21. Witsenhausen, H. S. (1971). On information structures, feedback and causality. SIAM Journal on Control, 9(2), 149–160. CrossRefGoogle Scholar
  22. Witsenhausen, H. S. (1973). A standard form for sequential stochastic control. Mathematical Systems Theory, 7(1), 5–11. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Pierre Carpentier
    • 1
  • Jean-Philippe Chancelier
    • 2
  • Guy Cohen
    • 2
  • Michel De Lara
    • 2
  • Pierre Girardeau
    • 1
    • 2
    • 3
  1. 1.ENSTA ParisTechParis Cedex 15France
  2. 2.CERMICS, École des Ponts ParisTechUniversité Paris-EstMarne-la-Vallée Cedex 2France
  3. 3.EDF R&DClamart CedexFrance

Personalised recommendations